Imagine a tool that not only crafts captivating narratives but also breathes life into them through stunning visual illustrations. Welcome to our latest venture — the Image Story Generator project. This project represents a bold exploration into the harmonious collaboration between state-of-the-art text generation LLMs: Meta Llama-2–7B-chat-hf and OpenAI’s gpt-3.5-turbo, alongside cutting-edge Stable diffusion image generation models from stabilityai/stable-diffusion-2–1 and runwayml/stable-diffusion-v1–5. The project was a part of GenAI Hackathon organized by Intel in association with Cognizance’24 IIT Roorkee. It was done in a team of two members. A similar article can be found on medium about this project which has been written by my team member.
The objective of our project is to be able to deliver creative and wonderful story based on small description of the story plot or the topic of the story based on a prompt given by the user.
The large models with billions of parameters require huge amount of computation. We are grateful to Intel for providing access to Intel Developer Cloud for free for the hackathon which enabled us to run our programs using latest Intel processors and GPUs. For running the multi gigabytes huggingface models intel has developed strong infrastructure Intel Developer Cloud. Tools like oneAPI, Intel Processors and Intel-optimized deep learning libraries were crucial to make our project successful.
Github: https://github.com/eklavyaK/GenAI_Hackathon
LLMs are powerful tools used in various markets today and there are many open source powerful LLMs which are free to use for commercial purposes. The two LLMs used in this project are listed below.
- gpt-3.5-turbo:
It is not available publicly but can be accessed using a OpenAI API Key. Inference speed is fast as the query directly runs on OpenAI servers. Maximum output tokens which can be generated for a query is 4096 and maximum of 16385 tokens in the context window (intput + output). To access this model through API go to OpenAI website, create a new OpenAI account and generate a new API key by entering your Phone number. Some limited credits will be provided to the new user.
2. Llama-2–7B-chat-hf:
It is a open source model available for commercial use, can be downloaded from Huggingface. This was the main reason to switch from gpt-3.5-turbo to this model. The performance of the model is on par with the former and can be downloaded for free. Provided by Meta this model has 7B parameters. It is optimized for dialogue use cases and converted for the Hugging Face Transformers format.
The two very common open source stable diffusion models used in this project are stabilityai/stable-diffusion-2–1 and runwayml/stable-diffusion-v1–5.
- stable-diffusion-2–1
This model is fine-tuned from stable-diffusion-2 (768-v-ema.ckpt) with an additional 55k steps on the same dataset (with punsafe = 0.1), and then fine-tuned for another 155k extra steps with punsafe = 0.98.
2. stable-Diffusion-v1–5
It’s checkpoint was initialized with the weights of the stable-Diffusion-v1–2 checkpoint and subsequently fine-tuned on 595k steps at resolution 512×512 on “laion-aesthetics v2 5+” and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
Generating Images with Stable diffusion:
Make sure following libraries are installed:
pillow:
pip install pillow
torch:
pip install torch torchvision torchaudio
transformers and diffusers:
pip install transformers diffusers['torch']
Let’s run a basic inference on our stable diffusion model. In this case we’re using image to image transformation based on prompt. The image to image transformation involves generating a image based on initial base image and a prompt. The base image is transformed based on the prompt provided. The colors used in the base image are used to color most of the final image. So if you want the output to be of a particular type you can simply feed an base which is similar to your final image you want.
Here we’re trying to generate a image of a village scene with farmer:
Prompt: ‘A cartoon style image of two village farmers planting seeds in their fields’
The base image used is:
import torch
from PIL import Image
from pathlib import Path
import transformers
from diffusers import StableDiffusionImg2ImgPipeline, DPMSolverMultistepSchedulerclass Img2ImgModel:
def __init__(
self,
model_id_or_path: str,
model_dir = "",
device: str = "cpu", # Use cuda if available make the inference super fast
torch_dtype: torch.dtype = torch.bfloat16,
optimize: bool = True,
warmup: bool = False,
scheduler: bool = True,
) -> None:
self.device = device
self.data_type = torch_dtype
self.scheduler = scheduler
self.generator = torch.Generator()
self.pipeline = self._load_pipeline(model_id_or_path, torch_dtype, model_dir)
def _load_pipeline(
self, model_id_or_path: str, torch_dtype: torch.dtype, model_dir
) -> StableDiffusionImg2ImgPipeline:
model_path = Path(f"{model_dir}/{model_id_or_path}")
if model_path.exists():
load_path = model_path
else:
print("Using the default path for models...")
load_path = model_id_or_path
pipeline = StableDiffusionImg2ImgPipeline.from_pretrained( # loading the pipeline
load_path, # if model is not already installed it'll be installed first while loading the pipeline
torch_dtype = torch_dtype,
use_safetensors = True,
variant = "fp16",
)
if self.scheduler:
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
pipeline.scheduler.config
)
if not model_path.exists():
try:
print(f"Attempting to save the model to {model_path}...")
pipeline.save_pretrained(f"{model_path}")
print("Model saved.")
except Exception as e:
print(f"An error occurred while saving the model: {e}. Proceeding without saving.")
pipeline = pipeline.to(self.device)
return pipeline
def generate_images(
self,
prompt: str,
image_path: str,
num_inference_steps: int = 200,
strength: float = 0.75,
guidance_scale: float = 7.5,
batch_size: int = 1,
):
init_image = Image.open(image_path)
try:
image = self.pipeline(
prompt = prompt,
image = init_image,
strength = strength,
guidance_scale = guidance_scale,
num_inference_steps=num_inference_steps,
).images
return image[0]
except:
return init_image
image_path = '3.jpg'
model_id = 'runwayml/stable-diffusion-v1-5'
model = Img2ImgModel(model_id, device = 'cpu')
prompt = "A cartoon style image of two village farmers planting seeds in their fields"
img = model.generate_images(prompt = prompt, image_path = image_path)
img.show()
The following image is generated:
Notice the position colors and position of the colors. Same color seem to be present around same location.
- gpt-3.5-turbo
Accessing gpt-3.5-turbo using API key is explained in the below code.
Prompt: ‘write a 100 word article on AI’
import openai as OpenAIOpenAI.api_key = "" # Enter your API key here
def get_completion(prompt, model = "gpt-3.5-turbo"):
messages = [{"role": "user", "content": prompt}]
response = ""
try:
response = OpenAI.chat.completions.create(model = model, messages = messages, temperature = 0,)
except:
print("\n\nOpenAI LLM is not responding...")
return prompt
return response.choices[0].message.content
prompt = "write a 100 word article on AI"
print(get_completion(prompt))
The generated text:
Artificial Intelligence (AI) is revolutionizing the way we live and work. \
From self-driving cars to virtual assistants, AI is becoming increasingly \
integrated into our daily lives. This technology uses algorithms and data to \
simulate human intelligence, enabling machines to learn, reason, and make \
decisions like humans.AI has the potential to greatly improve efficiency and productivity in various \
industries, such as healthcare, finance, and transportation. It can analyze \
vast amounts of data in seconds, identify patterns, and make predictions that \
can help businesses make informed decisions.
However, there are also concerns about the ethical implications of AI, such as \
privacy issues and job displacement. As AI continues to advance, it is \
important for society to carefully consider the implications and ensure that \
this technology is used responsibly for the benefit of all.
2. Llama-2–7B-chat-hf
Let’s install langchain:
pip install langchain
The inference with same input ran on Meta’s Llama2 using the below code:
Prompt: ‘write a 100 word article on AI’
import sys
import torch
import torch.nn as nnfrom pathlib import Path
from langchain.chains import LLMChain
from langchain_community.llms import huggingface_pipeline
from langchain_core.prompts import PromptTemplate
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = 'meta-llama/Llama-2-7b-chat-hf'
hf_aut = "" # enter your huggingface authentication token
## Template generation for the llm.........................
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""
def Template(instruction, new_system_prompt = DEFAULT_SYSTEM_PROMPT ):
SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
template = B_INST + SYSTEM_PROMPT + instruction + E_INST
return template
system_prompt = "You're a expert article writer"
instruction = "write a article on the following topic:\n\n {text}"
template_prompt = Template(instruction, system_prompt)
## class definition to work with llm.....................
class LLM:
def __init__(
self,
model_path = model_path,
hf_aut = hf_aut,
torch_dtype = torch.float16,
top_k = 50,
max_tokens = 3000,
device_map = 'cpu', # use cuda if available
temperature = 0.1,
optimize = True,
) -> None:
self.device_map = device_map
self.torch_dtype = torch_dtype
self.hf_aut = hf_aut
self.model_path = model_path
self.generator = torch.Generator()
self.pipeline = self._load_pipeline(model_path = model_path,
torch_dtype = torch_dtype,
hf_aut = hf_aut,
device_map = device_map,
top_k = top_k,
max_tokens = max_tokens)
self.llm = huggingface_pipeline.HuggingFacePipeline(pipeline = self.pipeline, model_kwargs = {'temperature': temperature}) # a workable pipeline is created
def _load_pipeline(
self, model_path, torch_dtype, hf_aut, device_map, top_k, max_tokens
):
tokenizer = AutoTokenizer.from_pretrained(model_path,
token = hf_aut,)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map = device_map,
torch_dtype = torch_dtype,
token = hf_aut,
)
pipe = transformers.pipeline( # loading the llm pipeline
task = "text-generation",
model = model,
device_map = device_map,
tokenizer = tokenizer,
return_full_text = True,
max_new_tokens = max_tokens,
do_sample = True,
top_k = top_k,
num_return_sequences = 1,
eos_token_id = tokenizer.eos_token_id
)
return pipe
def clean_output(self, text):
text = text['text']
split = text.split(E_INST)
if len(split) == 0: return " "
elif len(split) == 1: return split[0]
else:
split = split[1:]
cur = ""
for i in split: cur += i
return cur
def generate(self, template, text):
prompt = PromptTemplate(template = template, input_variables = ["text"])
model = LLMChain(prompt = prompt, llm = self.llm) #wrapping up the llm with the template
return self.clean_output(model.invoke(text))
def get_prompt(self, text):
return self.generate(template_prompt, text)
## loading the llm......................
try:
llama = LLM(model_path, hf_aut, top_k = 50, max_tokens = 3000, temperature = 0.1)
## running inference.....................
prompt = "write a 100 word article on AI"
print(llama.get_prompt(prompt))
except:
print("Difficulty loading LLM...\nCan't generate the story now")
The following article is generated:
Artificial intelligence (AI) is transforming the world as we know it. \
From virtual assistants to self-driving cars, AI is being used in a wide \
range of applications. With the ability to learn and adapt, AI is making \
tasks easier and more efficient. However, there are also concerns about the \
impact of AI on jobs and society as a whole. As AI continues to advance, it \
is important to consider the potential benefits and risks and to ensure that \
its development and use is ethical and responsible.
4.1 Naive approach
At first we started in very simple manner where we were generating the story without using base images i.e. text to image stable diffusion. The process flow diagram for it is:
- User gives a prompt about the plot of the story and characters involved.
model_cache = {}def generate_story(): ## driver function
out = widgets.Output()
model_ids = [
"stabilityai/stable-diffusion-2-1",
"CompVis/stable-diffusion-v1-4",
]
model_dropdown = widgets.Dropdown(options = model_ids, value = model_ids[0], description = "Select Model:",)
prompt_text = widgets.Text(value="", placeholder = "Enter the plot", description = "Story plot:", layout = widgets.Layout(width = "600px"))
layout = widgets.Layout(margin = "10px")
button1 = widgets.Button(description = "Generate Story", button_style = "primary")
button2 = widgets.Button(description = "Clear Story", button_style = "primary")
model_dropdown.layout.width = "50%"
prompt_text.layout.width = "600px"
button1.layout.margin = "0 0 0 100px"
button1.layout.width = "150px"
button2.layout.margin = "0 0 0 300px"
button2.layout.width = "120px"
top_row = widgets.HBox([model_dropdown])
bottom_row = widgets.HBox([prompt_text])
top_box = widgets.VBox([top_row, bottom_row])
user_input_widgets = widgets.HBox([top_box], layout = layout)
bottom_box = widgets.HBox([button1, button2], layout = layout)
display(user_input_widgets)
display(bottom_box)
display(out)
def generate_image(button):
clear_output(wait = True)
print("Creating a new story...")
story = get_story_from_plot(prompt_text.value)
print(f"Story : {story}")
partial_stories = split_paragraphs(story)
for i, parts in enumerate(partial_stories):
print(f"Part {i} : {parts}")
with out:
button.button_style = "warning"
selected_model_index = model_ids.index(model_dropdown.value)
model_id = model_ids[selected_model_index]
model_key = (model_id, "cpu")
if model_key not in model_cache:
model_cache[model_key] = Text2ImgModel(model_id, device = "xpu")
model = model_cache[model_key]
prompt = get_prompt(parts)
if not prompt:
prompt = " "
try:
start_time = time.time()
image = model.generate_images(
prompt,
num_inference_steps = 200,
)
display_plot(parts, image)
except KeyboardInterrupt:
print("\nUser interrupted image generation...")
except Exception as e:
print(f"An error occurred: {e}")
finally:
button.button_style = "primary"
def end_story(button):
with out:
clear_output(wait = True)
print("Creating a new story...")
button1.on_click(generate_image)
button2.on_click(end_story)
generate_story()
Running the generate story function displays simple GUI where user can enter the description of his story and then click on generate story button
- A creative story is generated using LLM model (OpenAI model = ‘gpt-3.5-turbo’, using API Key) in about 1000–1500 words, as described in the code below.
def get_completion(prompt, text, model = "gpt-3.5-turbo"):
messages = [{"role": "user", "content": prompt}]
response = ""
try:
response = OpenAI.chat.completions.create(model = model, messages = messages, temperature = 0,)
except:
print("\n\nOpenAI LLM is not responding. Proper story can't be generated...")
return text
return response.choices[0].message.contentdef get_story_from_plot(plot): ## getting story from gpt-3.5-trubo
message = f"""
Write a creative story for a class of small childrens based on the plot provided in text delimited by triple backticks in \
3000 words.
Text:
```{plot}```
"""
return get_completion(message, plot)
2. The story is split into different paragraphs of around 150–200 words.
import redef split_paragraphs(story_text, max_words_per_paragraph=150): ## split paragraph function
if story_text is None:
story_text = "a man"
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', story_text.strip())
current_paragraph = ''
paragraphs = []
for sentence in sentences:
current_paragraph += sentence.strip() + ' '
if len(current_paragraph.split()) > max_words_per_paragraph:
sentences_in_paragraph = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', current_paragraph.strip())
current_paragraph = ' '.join(sentences_in_paragraph[:-1]).strip()
paragraphs.append(current_paragraph)
current_paragraph = sentence.strip() + ' '
if current_paragraph:
paragraphs.append(current_paragraph)
return paragraphs
3. Through some prompt engineering each paragraph is sent to the LLM model to generate a simple, straightforward and small size prompt defining the characters and surrounding \ scenario of that paragraph. The LLM model is prompted to follow some rules in order to generate this prompts so as to make it easier for stable diffusion model to generate images based on simple and small prompts describing the part of story in that paragraph
def get_prompt(input):
## prompt engineering for getting the prompt for image generation for a scenemessage = f"""
A part of a story will be provided to you and you have to generate a simple prompt that describes the \
scenario in that part of the story such that the part of the story can be explained in an image generated by the prompt generated by you.
Here are some rules u have to follow while generating the prompts:
1. The prompt be strictly less than 70 words.
2. Don't include special characters other than comma and hyphen and dot.
3. You just have to describe the scenario not write the whole story.
4. Always include "A colored cartoon type sketch of," at the start of every prompt.
5. The very important one, write in crisp and very simple english, don't use complicated words.
6. Separate the different traits of the scenario with commas.
7. If you can't understand the story or text, just write whatever you think the situation could be in the text.
Here are some examples on how to generate the prompt:
Example story paragraph:
Once upon a time, in a not-so-distant future, there lived a man named Alex. Alex was an adventurous soul who dreamed of exploring the great \
unknown: outer space. From a young age, he would gaze up at the stars with wonder, imagining what it would be like to journey among them.
Expected text from you is:
A colored cartoon type sketch of, a man looking up in the sky at night, sky has stars & moon.
Example story paragraph:
As the days turned into weeks, Maya forged friendships with the creatures of the jungle. She shared moments of laughter with mischievous monkeys, \
and learned the ancient wisdom of wise old elephants. Together, they explored hidden caves and winding rivers, each new discovery fueling Maya's sense of wonder.
Expected text from you is:
A colored cartoon type sketch of, A girl laughing with monkeys, old elephants, hidden caves, winding rivers.
Example story paragraph:
In the heart of a bustling metropolis, where skyscrapers kissed the sky and streets hummed with the \
rhythm of life, there existed a city like no other. Its streets were a labyrinth of winding alleys and bustling boulevards, \
lined with towering buildings that reached for the clouds.
Expected prompt generated from you is:
A colored cartoon type sketch of, a metropolitan city, high skycrapers, streets, sky with clouds.
Example story paragraph:
what's up
Since the paragraph is vague to understand, you can assume that a person is saying what's up to another person, for this the expected \
text generated by you is:
A colored cartoon type sketch of, two person speaking.
Further rules:
Please don't generate more than 70 words, this is a must.
Please note that all the above examples the generated prompts were less than 20 words, you must also generate the prompts strictly less than 70 words.
I just want the prompt from you not the explanation of why you generated that prompt.
Now the actual story paragraph for which the prompt is to be generated is the text delimited by triple backticks
Text:
```{input}```
"""
return get_completion(message, input)
4. LLM generated prompt is sent to the Stable diffusion model to generate image for each part of story separately as described in section 3.
Challenges encountered in this approach:
- The major challenge encountered is that coherence of colors and images was not maintained across the story scenes. For a story the traits of a character described with the image shouldn’t change through the story. But if we’re generating image of a man using stable diffusion two times, the image of the man in first image will look different than the second. And this will happen with every image. This will spoil the story. The issue is depicted below.
2. The images generated by the stable diffusion are not satisfactory even after 150 inference steps.
3. Prompt engineering could’ve better.
4.2 Improved approach
It is very difficult to maintain the very good coherence between images. But we can still improve achieve coherence in the images using image to image transformation through stable diffusion.
When we do image to image transformation, the output image is developed mostly from the colors of the base image which is fed to the model. Now assume that a scene of the story is taking place in village. Now we’ll use a base image which has village scene. The stable diffusion will use this image to generate a good output of the scene. If again another scene of the story takes place in village then we’ll use the same base image. In this manner the output image will have similar looking background / surrounding for the images in both the scenes. In this manner we can achieve coherence to some extent for scene taking place at same background / surrounding.
We’ve used 7 base images for general backgrounds present in the universe.
- city
- village
- forest
- mountain
- sea
- room
- space
Now there are in general two times when a scene of the story can take place for each background scene, except the last one (space).
- day
- night
So total we’re getting 13 combinations (6 * 2 + 1)
We collected a good base image for these 16 combinations to be used while generating the image for a scene in the story.
The process flow is depicted below:
The steps of the overall workflow is as follows:
- User gives a prompt about the plot of the story and characters involved.
model_cache = {}def generate_story(): ## driver function
out = widgets.Output()
model_ids = [
"stabilityai/stable-diffusion-2-1",
"CompVis/stable-diffusion-v1-4",
]
model_dropdown = widgets.Dropdown(options = model_ids, value = model_ids[0], description = "Select Model:",)
prompt_text = widgets.Text(value="", placeholder = "Enter the plot", description = "Story plot:", layout = widgets.Layout(width = "600px"))
layout = widgets.Layout(margin = "10px")
button1 = widgets.Button(description = "Generate Story", button_style = "primary")
button2 = widgets.Button(description = "Clear Story", button_style = "primary")
model_dropdown.layout.width = "50%"
prompt_text.layout.width = "600px"
button1.layout.margin = "0 0 0 100px"
button1.layout.width = "150px"
button2.layout.margin = "0 0 0 300px"
button2.layout.width = "120px"
top_row = widgets.HBox([model_dropdown])
bottom_row = widgets.HBox([prompt_text])
top_box = widgets.VBox([top_row, bottom_row])
user_input_widgets = widgets.HBox([top_box], layout = layout)
bottom_box = widgets.HBox([button1, button2], layout = layout)
display(user_input_widgets)
display(bottom_box)
display(out)
def generate_image(button): ## generating a new story
clear_output(wait = True)
print("Creating a new story...")
story = get_story_from_plot(prompt_text.value)
print(f"Story : {story}")
partial_stories = split_paragraphs(story)
for i, parts in enumerate(partial_stories):
print(f"Part {i} : {parts}")
with out:
button.button_style = "warning"
selected_model_index = model_ids.index(model_dropdown.value)
model_id = model_ids[selected_model_index]
model_key = (model_id, "cpu")
if model_key not in model_cache:
model_cache[model_key] = Text2ImgModel(model_id, device = "xpu")
model = model_cache[model_key]
prompt = get_prompt(parts)
if not prompt:
prompt = " "
try:
start_time = time.time()
image = model.generate_images(
prompt,
num_inference_steps = 200,
)
display_plot(parts, image)
except KeyboardInterrupt:
print("\nUser interrupted image generation...")
except Exception as e:
print(f"An error occurred: {e}")
finally:
button.button_style = "primary"
def end_story(button):
with out:
clear_output(wait = True)
print("Creating a new story...")
button1.on_click(generate_image)
button2.on_click(end_story)
generate_story()
2. A creative story is generated using LLM model (OpenAI model = ‘gpt-3.5-turbo’) in about 1000–1500 words.
def get_completion(prompt, text, model = "gpt-3.5-turbo"):
messages = [{"role": "user", "content": prompt}]
response = ""
try:
response = OpenAI.chat.completions.create(model = model, messages = messages, temperature = 0,)
except:
print("\n\nOpenAI LLM is not responding. Proper story can't be generated...")
return text
return response.choices[0].message.contentdef get_story_from_plot(plot): ## getting story response
message = f"""
Write a creative story for a class of small childrens based on the plot provided in text delimited by triple backticks in \
3000 words.
Text:
```{plot}```
"""
return get_completion(message, plot)
3. The story is split into different paragraphs of around 150–200 words.
import redef split_paragraphs(story_text, max_words_per_paragraph=150): ## split paragraph function
if story_text is None:
story_text = "a man"
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', story_text.strip())
current_paragraph = ''
paragraphs = []
for sentence in sentences:
current_paragraph += sentence.strip() + ' '
if len(current_paragraph.split()) > max_words_per_paragraph:
sentences_in_paragraph = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', current_paragraph.strip())
current_paragraph = ' '.join(sentences_in_paragraph[:-1]).strip()
paragraphs.append(current_paragraph)
current_paragraph = sentence.strip() + ' '
if current_paragraph:
paragraphs.append(current_paragraph)
return paragraphs
4. Each paragraph is sent to the LLM model to generate a simple, straightforward and small size prompt defining the characters and surrounding \ scenario of that paragraph. The LLM model is prompted to follow some rules in order to generate this prompts so as to make it easier for stable diffusion model to generate images based on simple and small prompts describing the part of story in that paragraph.
def get_prompt(input):
## prompt engineering for getting the prompt for image illustration of the scenemessage = f"""
A part of a story will be provided to you and you have to generate a simple prompt that describes the \
scenario in that part of the story such that the part of the story can be explained in an image generated by the prompt generated by you.
Here are some rules u have to follow while generating the prompts:
1. The prompt be strictly less than 70 words.
2. Don't include special characters other than comma and hyphen and dot.
3. You just have to describe the scenario not write the whole story.
4. Always include "A colored cartoon type sketch of," at the start of every prompt.
5. The very important one, write in crisp and very simple english, don't use complicated words.
6. Separate the different traits of the scenario with commas.
7. If you can't understand the story or text, just write whatever you think the situation could be in the text.
Here are some examples on how to generate the prompt:
Example story paragraph:
Once upon a time, in a not-so-distant future, there lived a man named Alex. Alex was an adventurous soul who dreamed of exploring the great \
unknown: outer space. From a young age, he would gaze up at the stars with wonder, imagining what it would be like to journey among them.
Expected text from you is:
A colored cartoon type sketch of, a man looking up in the sky at night, sky has stars & moon.
Example story paragraph:
As the days turned into weeks, Maya forged friendships with the creatures of the jungle. She shared moments of laughter with mischievous monkeys, \
and learned the ancient wisdom of wise old elephants. Together, they explored hidden caves and winding rivers, each new discovery fueling Maya's sense of wonder.
Expected text from you is:
A colored cartoon type sketch of, A girl laughing with monkeys, old elephants, hidden caves, winding rivers.
Example story paragraph:
In the heart of a bustling metropolis, where skyscrapers kissed the sky and streets hummed with the \
rhythm of life, there existed a city like no other. Its streets were a labyrinth of winding alleys and bustling boulevards, \
lined with towering buildings that reached for the clouds.
Expected prompt generated from you is:
A colored cartoon type sketch of, a metropolitan city, high skycrapers, streets, sky with clouds.
Example story paragraph:
what's up
Since the paragraph is vague to understand, you can assume that a person is saying what's up to another person, for this the expected \
text generated by you is:
A colored cartoon type sketch of, two person speaking.
Further rules:
Please don't generate more than 70 words, this is a must.
Please note that all the above examples the generated prompts were less than 20 words, you must also generate the prompts strictly less than 70 words.
I just want the prompt from you not the explanation of why you generated that prompt.
Now the actual story paragraph for which the prompt is to be generated is the text delimited by triple backticks
Text:
```{input}```
"""
return get_completion(message, input)
5. Each paragraph is sent to the LLM model again to give a integer number back (between 1–13) the number returned is used to retrieve a base image out of the 13 base images already saved in a folder. Each base image represent a location / surrounding in either daytime or night time. The LLM model infer from the paragraph what is the most probable/suitable location and day/night time the events described in that paragraph corresponds and return the appropriate number which is then used to retrieve that base image. This base image responsible to affect the colors of the image generated for this paragraph.
def get_bgnumber(para):
## prompt engineering for getting the background imagemessage = f"""
You will be given a part of a story you have to read that part and have to infere the probable and most suitable location/surrounding in which it's taking place.
You can give the output out of only the below options provided. You have to give the corresponding integer value from 1 to 15 depending on the location/surrounding you have infered:
City_night ----> 1
City_day ----> 2
Forest_night ----> 3
Forest_day ----> 4
Sea_night ----> 5
Sea_day ----> 6
Room_night ----> 7
Room_day ----> 8
Village_night ----> 9
Village_day ----> 10
Moutain_night ----> 11
Moutain_day ----> 12
Sky_Space_Universe ----> 13
You have to strictly follow the given rules:
1. Always give the output as an integer between 1 to 13 no extra string or anything just an integer number between 1 to 15.
2. If you are able to infere the location but not able to infere whether it's day or night by default choose the option with day which is an even number \
for eg. 2, 4, 6, 8, 10, and 12 all these options have day in it so choose out of these options based on the location you have infered.
3. If the location/surrounding isn't obvious from the text provided then by default give output as 2.
Here are some examples on how to generate the required option:
Example story paragraph:
In the heart of a bustling metropolis, where skyscrapers kissed the sky and streets hummed with the \
rhythm of life, there existed a city like no other. Its streets were a labyrinth of winding alleys and bustling boulevards, \
lined with towering buildings that reached for the clouds.
As from the above paragraph we can infer that the surrounding is of city and since the time of the day isn't obvious whether \
it's day or night so you have to select day by default so the final surrounding infered is "City_day" so expected output generated from you is:
2
Example story paragraph:
As the days turned into weeks, Maya forged friendships with the creatures of the jungle. She shared moments of laughter with mischievous monkeys, \
and learned the ancient wisdom of wise old elephants. Together, they explored hidden caves and winding rivers, each new discovery fueling Maya's sense of wonder.
As from the above paragraph we can infer that the surrounding is of forest/jungle and since the time of the day isn't obvious whether \
it's day or night so you have to select day by default so the final surrounding infered is "Forest_day" so expected output generated from you is:
4
Example story paragraph:
Once upon a time, in a not-so-distant future, there lived a man named Alex. Alex was an adventurous soul who dreamed of exploring the great \
unknown: outer space. From a young age, he would gaze up at the stars with wonder, imagining what it would be like to journey among them.
As from the above paragraph we can infer that the surrounding is of Space/Sky/Universe so the expected output from you is:
13
Example story paragraph:
Under the shimmering canopy of a star-studded night sky, a young boy named Ethan ventures to the edge of the sea. The moon casts a soft, silvery \
glow upon the restless waves, inviting him into their mysterious depths. With bare feet sinking into cool, wet sand, Ethan hesitates for a moment \
before plunging into the salty embrace of the ocean. The water is surprisingly warm, and he feels a surge of exhilaration as he dives beneath the \
surface. Bioluminescent creatures sparkle like underwater stars, illuminating his path as he swims further out. With each stroke, he feels a \
sense of freedom and wonder, lost in the magic of the nocturnal sea. Laughter echoes in the darkness as Ethan dances with the waves, \
creating memories that will linger long after the night fades into dawn.
As from the above paragraph we can infer that the surrounding is of Sea/ocean and the time is of night so the expected output from you is of Sea_night:
5
Example story paragraph:
what's up
Since the location/surrounding to infere is difficult for the above text, you can assume that a person is saying what's up to another person, for this the expected \
number generated by you is:
2
Now the actual story paragraph for which the number is to be generated is the text delimited by triple backticks
Text:
```{para}```
"""
return get_completion(message, "10")
6. Retrieved Base image & generated prompt is sent to the image to image Stable diffusion model to generate image for each part of story separately as described in section 3.
4.3 Using Llama-2–7B-chat-hf
OpenAI has limits on number of queries per minute and we have limited credits. Most of times it stops giving responses throwing error “your quota is over” which we don’t want to persist in our streamlit app. Hence we decided to use a open source model. Meta’s Llama-2–7B-chat-hf seemed to be best option depending on the availability of our resources. Since Llama2 model not being as accurate as gpt-3.5-turbo. We changed the rules of prompt for getting the base image slightly. The steps of the overall workflow is as follows:
- User gives a prompt about the plot of the story and characters involved.
- A creative story is generated using Llama2 in about 1000–1500 words.
- The story is split into different paragraphs of around 150–200 words.
- Each paragraph is sent to the LLM model to generate a simple, straightforward and small size prompt defining the characters and surrounding \ scenario of that paragraph. The LLM model is prompted to follow some rules in order to generate this prompts so as to make it easier for stable diffusion model to generate images based on simple and small prompts describing the part of story in that paragraph.
- Each paragraph is sent to the LLM model again to guess the background scene (city, village, sea, space, forest, room etc) of the scene depicted in the paragraph, by default we take scene as village. Further based on the response we search for the scene in the output of the LLM. After we make another query from the LLM about when (day or night) the scene is taking place, by default we take time as day. Now by combining these two responses a base image for the current scene is selected.
- Retrieved Base image & generated prompt is sent to the image to image Stable diffusion model to generate the final image illustration for each scene of the story separately as described in section 3.
Prompt engineering in this case:
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""
base = "./base"
model_dir = "./sd_models"
model_path = 'meta-llama/Llama-2-7b-chat-hf'
hf_aut = "" # put your hugging face authentication token
scene = {"city" : 1, "town" : 1, "urban" : 1, "village" : 3, "rural" : 3,
"forest" : 5, "jungle" : 5, "mountain" : 7, "hill" : 7, "sea" : 9, "ocean" : 9, "aqua" : 9,
"room" : 11, "house" : 11, "space" : 13, "universe" : 13, "cosmos" : 13}
model_ids = [
"runwayml/stable-diffusion-v1-5",
"stabilityai/stable-diffusion-2-1",
]
## Template formation for scene prompt generation
system_prompt = "You write long, wonderful and creative stories on provided topic. \
A short description of the story will be provided to you. \
You have to generate a good story which should be based on the provided description"
instruction = "Write a story on the following topic in 2500 words:\n\n {text}"
template_story = Template(instruction, system_prompt)
system_prompt = """You're an expert prompt generator. The prompt generated by you will be fed to a image generation model. A part of a story will be provided to you and you have to generate a simple prompt that describes the scenario in that part of the story such that the part of the story can be explained in an image generated by the prompt generated by you.
Here are some rules u have to follow while generating the prompts:
1. The prompt be strictly less than 70 words.
2. Don't include special characters other than comma and hyphen and dot.
3. You just have to describe the scenario not write the whole story.
4. Always include "A colored cartoon type sketch of," at the start of every prompt.
5. The very important one, write in crisp and very simple english, don't use complicated words.
6. Separate the different traits of the scenario with commas.
7. If you can't understand the story or text, just write whatever you think the situation could be in the text.
Here are some examples on how to generate the prompt:
Example of a story paragraph:
Once upon a time, in a not-so-distant future, there lived a man named Alex. Alex was an adventurous soul who dreamed of exploring the great \
unknown: outer space. From a young age, he would gaze up at the stars with wonder, imagining what it would be like to journey among them.
Expected text from you is:
A colored cartoon type sketch of, a man looking up in the sky at night, sky has stars & moon.
Example story paragraph:
As the days turned into weeks, Maya forged friendships with the creatures of the jungle. She shared moments of laughter with mischievous monkeys, \
and learned the ancient wisdom of wise old elephants. Together, they explored hidden caves and winding rivers, each new discovery fueling Maya's sense of wonder.
Expected prompt from you is:
A colored cartoon type sketch of, A girl laughing with monkeys, old elephants, hidden caves, winding rivers.
Example of another story paragraph:
In the heart of a bustling metropolis, where skyscrapers kissed the sky and streets hummed with the \
rhythm of life, there existed a city like no other. Its streets were a labyrinth of winding alleys and bustling boulevards, \
lined with towering buildings that reached for the clouds.
Expected prompt generated from you is:
A colored cartoon type sketch of, a metropolitan city, high skycrapers, streets, sky with clouds.
Example story paragraph:
what's up
Since the paragraph is vague to understand, you can assume that a person is saying what's up to another person, for this the expected \
prompt generated by you is:
A colored cartoon type sketch of, two person speaking.
Further rules:
Please don't generate more than 70 words, this is a must.
Please note that all the above examples the generated prompts were less than 20 words, you must also generate the prompts strictly less than 70 words.
I just want the prompt from you not the explanation of why you generated that prompt.
"""
instruction = "Generate a prompt less than 70 words for the story paragraph:\n\n {text}"
template_prompt = Template(instruction, system_prompt)
## prompt engineering for getting the scene
system_prompt = """You're an expert background scenario recognizer. You will be given a scene of a story you have to read that scene and try to recognize the most suitable background or surrounding scenario where the scene is taking place.
You have to give the output ONLY from below options provided.
1. city
2. village
3. forest
4. mountain
5. sea
7. room
6. space
You should output the backgrounds from these six options only. You must not output any other background which is not listed above.
If you can't understand the background in the story just output the background as village by default.
Here are some examples on how to generate the required option:
Example story paragraph:
In the heart of a bustling metropolis, where skyscrapers kissed the sky and streets hummed with the \
rhythm of life, there existed a city like no other. Its streets were a labyrinth of winding alleys and bustling boulevards, \
lined with towering buildings that reached for the clouds.
Expected guess:
As from the above paragraph we can guess that the surrounding is of city because metropolis and skycrapers are mentioned. \
Hence the text generated by you is:
From the story I infer the background scene as city
Example story paragraph:
As the days turned into weeks, Maya forged friendships with the creatures of the jungle. She shared moments of laughter with mischievous monkeys, \
and learned the ancient wisdom of wise old elephants. Together, they explored hidden caves and winding rivers, each new discovery fueling Maya's sense of wonder.
Expected guess:
As from the above paragraph we can guess that the surrounding is of forest / jungle because of monkeys, rivers and caves. \
Hence the text generated by you is:
From the story I infer the background scene as forest
Example story paragraph:
Hey man how are you? All good!
Expected guess:
Since the location / surrounding can't be guessed because there is no hint present in the paragraph you have to provide the default option which is village.\
Hence the text generated by you is:
From the story I infer the background scene as forest
"""
instruction = "Guess the background scene for the story paragraph:\n\n {text}"
template_scene = Template(instruction, system_prompt)
## prompt engineering for getting the time of the day (night or day)
system_prompt = """You're an expert at guessing whether a event is occuring in day or night based on the description of the event.\
You will be given a scene of a story which you have to read. After reading the scene you have to tell whether the scene in the story is taking place in day or in night.
In case you're not able to guess between night or day, then by default output day.
You must output only one of the following options:
1. day
2. night
You should not output times like: afternoon, morning, evening etc. You have to output only either day or night.
In case you're not able to guess between night or day, then by default output day.
Here are some examples on how to generate the required option:
Example story paragraph:
Once upon a sun-kissed morning, in the heart of a serene village nestled amidst rolling hills and lush greenery, a bustling day began to unfold. \
The village, with its quaint cottages and winding pathways, exuded an aura of tranquility under the clear blue sky.
Expected guess: day
Example story paragraph:
In the afternoon, as the village stirred with activity, Sarah joined her neighbors in the bustling marketplace.\
Amidst stalls laden with fresh produce and the lively chatter of vendors and customers alike, she exchanged greetings and stories with familiar faces,\
weaving the fabric of community that bound them together.
Expected guess: day
Example story paragraph:
Under the velvet embrace of the starlit night sky, a mysterious tale unfolds in the shadowed corners of a forgotten town. \
The moon, a solitary sentinel, casts its silvery glow upon the cobblestone streets, illuminating secrets hidden in the darkness.
Expected guess: night
Example story paragraph:
As the sun dipped below the horizon, painting the sky with hues of crimson and gold, the sleepy town of Willowbrook stirred to life once more. \
In the tranquil streets lined with quaint cottages and flickering lanterns, a tale of love and longing began to unfold.
Expected guess: night
Example story paragraph:
Hey man! what's up.
Expected guess: Since, we're not able to guess anything, then by default you should output day
"""
instruction = "Guess the background scene for the story paragraph:\n\n {text}"
template_time = Template(instruction, system_prompt)
## class to load the model and perform text generation operations on itclass LLM:
def __init__(
self,
model_path = model_path,
hf_aut = hf_aut,
torch_dtype = torch.float16,
top_k = 50,
max_tokens = 3000,
device_map = 'xpu',
temperature = 0.1,
optimize = True,
) -> None:
self.device_map = device_map
self.torch_dtype = torch_dtype
self.hf_aut = hf_aut
self.model_path = model_path
self.generator = torch.Generator()
self.pipeline = self._load_pipeline(model_path = model_path,
torch_dtype = torch_dtype,
hf_aut = hf_aut,
device_map = device_map,
top_k = top_k,
max_tokens = max_tokens)
self.llm = huggingface_pipeline.HuggingFacePipeline(pipeline = self.pipeline, model_kwargs = {'temperature': temperature})
def _load_pipeline(
self, model_path, torch_dtype, hf_aut, device_map, top_k, max_tokens
):
tokenizer = AutoTokenizer.from_pretrained(model_path,
token = hf_aut,)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map = device_map,
torch_dtype = torch_dtype,
token = hf_aut,
)
pipe = transformers.pipeline( # loading the pipeline
task = "text-generation",
model = model,
device_map = device_map,
tokenizer = tokenizer,
return_full_text = True,
max_new_tokens = max_tokens,
do_sample = True,
top_k = top_k,
num_return_sequences = 1,
eos_token_id = tokenizer.eos_token_id
)
return pipe
def clean_output(self, text):
print("Returning the clean output...")
text = text['text']
split = text.split(E_INST)
if len(split) == 0: return " "
elif len(split) == 1: return split[0]
else:
split = split[1:]
cur = ""
for i in split: cur += i
return cur
def generate(self, template, text):
print("Wrapping the model and making template...")
prompt = PromptTemplate(template = template, input_variables = ["text"])
model = LLMChain(prompt = prompt, llm = self.llm)
print("Wrapping complete✅")
return self.clean_output(model.invoke(text))
def check_time(self, text): # searching for the time
if "night" in text: return 0
return 1
def check_scene(self, text): # searching for the scene
for key, value in scene.items():
if key in text: return value
return 3
def get_base(self, text): # combining the outputs of the check_scene() and check_time() we get the base image number
output = self.generate(template_scene, text)
output = output.lower()
sc = self.check_scene(output)
if sc == 13: return sc
output = self.generate(template_time, text)
return sc + self.check_time(output)
def get_story(self, text):
return self.generate(template_story, text)
def get_prompt(self, text):
return self.generate(template_prompt, text)
try:
llama = LLM(model_path, hf_aut, top_k = 50, max_tokens = 3000, temperature = 0.1, device_map = 'cpu')
print("llama loaded✅")
except:
print("Difficulty loading LLM...\nCan't generate the story now")
4.4 Streamlit Application:
## streamlit application..................................................................................import streamlit as st
st.set_page_config(page_title = "app",
page_icon = '🤖',
layout = 'centered',
initial_sidebar_state = 'collapsed')
st.markdown("<h1 style='text-align: center;'>🤖 Image Story GenAI 🏡</h1>", unsafe_allow_html = True)
st.markdown("<h5 style='text-align: justify;'> \
This is a simple story generator. Just enter a brief description about your story below and and wait for some time. \
A creative story with image illustration will be generated.<br>This project is using Meta Llama-2-7B-chat-hf LLM from Hugging face \
for text generation and stable diffusion models (listed in the selectbox below) for image generation \
</h1>", unsafe_allow_html = True)
st.write("")
label1 = 'Select the Diffusion Model'
sd_model = st.selectbox(
label1,
(model_ids[0],
model_ids[1])
)
st.write("")
label2 = "Story Description"
story_description = st.text_input(label2)
m = st.markdown("""
<style>
div.stButton > button:first-child {
background-color: rgb(30, 30, 120);
width: 150px
}
</style>""", unsafe_allow_html=True)
col1, col2 = st.columns([3.4, 1])
with col1:
submit_g = st.button('Generate Story')
with col2:
submit_c = st.button('Clear Story')
def display_image(img, width = 600):
st.image(img, width = width)
def display_text(text):
st.write(
f"<div style='text-align: justify;'>"
f"{text}"
"</div>",
unsafe_allow_html=True
)
model_cache = {}
model_cache[(sd_model, 'auto')] = Img2ImgModel(sd_model, device = 'auto')
if submit_g:
display_text("Generating story for you...😊")
display_text("Please be patient. It takes some time to run large models...")
print("Generating story...")
story = llama.get_story(story_description)
print("Story generated✅")
partial_stories = split_paragraphs(story)
for i, parts in enumerate(partial_stories):
model_key = (sd_model, "cpu")
print("checking presence of stable diffusion model....")
if model_key not in model_cache:
model_cache[model_key] = Img2ImgModel(sd_model, device = "cpu")
print("stable diffusion model loaded✅")
prompt = parts
model = model_cache[model_key]
if not prompt:
prompt = "a village man"
display_text(prompt)
try:
start_time = time.time()
print("Generating image for the prompt...")
image = model.generate_images(prompt = prompt,)
print("Image successfully generated✅")
display_image(image)
except:
display_text("An Error Occurred...☠️")
if submit_c:
st.write("Cleared...✅")
We deployed the app on Intel Developer Cloud. The app had very high response times on CPU device. As of now we don’t have access to GPU instance which will achieve much better inference speeds and we might be able to deploy the app.
4.5 Example of a story generation using Llama-2–7B-chat-hf and stable-diffusion-v1–5:
prompt: ‘A astronaut travelling through space in his spacecraft’
Scene 1: