Use GenAI to Create Accurate Images of Specific Objects or Characters

Generative AI is now mainstream. Many of us use tools like DALL.E & Midjourney to create images for work. While building an application on top of these models I faced a limitation:

How do I ask these models to generate an image of a specific entity in a certain setting? For example: How do I generate a picture of my puppy sitting on a stool eating pasta? Not any puppy, MY puppy!

After some research I came across a great solution that uses a combination of Stable Diffusion and Dreambooth. Stable Diffusion is an open-source machine learning model used for generating images from text prompts.

Dreambooth is a technique that can fine-tune models like Stable Diffusion and teach it to associate a certain word with an entity. This fine-tuning is possible because Stable Diffusion is open source giving us access to its architecture and weights.

Enough theory, let’s practice. I will now use Dreambooth to train a Stable Diffusion model to associate pictures of my puppy with the word hiccupmypup. Fire up your Google Colab notebooks!

Install dependencies

!pip install diffusers transformers scipy ftfy accelerate
!pip install bitsandbytes

We use the Hugging Face diffusers library to fine-tune the model as well as generate photos.

!git clone https://github.com/huggingface/diffusers
!pip install diffusers/.
!pip install -r diffusers/examples/dreambooth/requirements_sd3.txt

Create training_data folder to store images used to fine-tune the model. Additionally create a folder called trained_model to write the fine-tuned model.
Upload pictures of your subject in the training_data folder. In my case, it’s going to be pictures of my puppy Hiccup
!mkdir trained_model !mkdir training_data
We will be using the stabilityai/stable-diffusion-3-medium-diffusers model as the base model in this exercise. This model is hosted on Hugging Face, and hence we will need to do the following to be able to use this model:
- Create an account on Hugging Face
- Request access to this gated model (link)
- Create a user access token and store a copy of it on your local system notepad (link)
Once the above steps are done, run the below code. On running this, you will be asked for a token. Please paste the generated user access token from your notepad.
from accelerate.utils import write_basic_config write_basic_config() from huggingface_hub import login login()

Time to begin fine-tuning. Let’s take a look at some of the parameters we have used:

instance_prompt: This parameter describes the training data to the model.
max_train_steps & learning_rate: These are paramters that need to be tuned by you to get the best results based on your input

!accelerate launch diffusers/examples/dreambooth/train_dreambooth_sd3.py \
      --pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers" \
      --instance_data_dir="training_data" \
      --instance_prompt="Photo of a dog named hiccupmypup" \
      --output_dir="trained_model" \
      --resolution=512 \
      --train_batch_size=1 \
      --gradient_accumulation_steps=1 \
      --learning_rate=5e-7 \
      --lr_scheduler="constant" \
      --lr_warmup_steps=0 \
      --max_train_steps=1200 \
      --gradient_checkpointing \
      --use_8bit_adam

Now that the training is done, a new model will have been generated in the trained_model folder. Let’s use this model to generate fun images!

Add a prompt as a parameter to pipeline to get desired outputs
Try playing with the num_inference_stepsand guidance_scale parameters

from diffusers import DiffusionPipeline
import torch
import time
import random


# Set seed for reproducibility
seed = 0
torch.manual_seed(seed)
random.seed(seed)

pipeline = DiffusionPipeline.from_pretrained("trained_model", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
images = pipeline("A dog named hiccupmypup sitting on a stool eating pasta", num_inference_steps=28, guidance_scale=7.5).images

idx = 0
for image in images:
  image.save("hiccup_"+str(++idx)+"_"+str(round(time.time() * 1000))+".png")

Here are some images of Hiccup I generated:

Hopefully, in the future we will not need to fine-tune models in order to generate pictures of specific entities. NVIDIA recently released a paper that promises just that: link

Please reach out to us at info@betacrew.io for help building cutting edge GenAI applications

Ronil Mehta

Write by

Building BetaCrew Labs | Always curious, always learning | Meta & Bloomberg Alumnus

Use GenAI to Create Accurate Images of Specific Objects or Characters

How do I ask these models to generate an image of a specific entity in a certain setting? For example: How do I generate a picture of my puppy sitting on a stool eating pasta? Not any puppy, MY puppy!

Related blogs

Related blogs

What is the Arazzo Specification?

What is the Arazzo Specification?

What is the Arazzo Specification?

How do I add an Arazzo specification to my Open API spec?

How do I add an Arazzo specification to my Open API spec?

How do I add an Arazzo specification to my Open API spec?

Portle vs. Traditional Methods of API Documentation

Portle vs. Traditional Methods of API Documentation

Portle vs. Traditional Methods of API Documentation

Let's Do The Impossible Together

Let's Do The Impossible Together

Let's Do The Impossible Together