Generative AI is now mainstream. Many of us use tools like DALL.E & Midjourney to create images for work. While building an application on top of these models I faced a limitation:

How do I ask these models to generate an image of a specific entity in a certain setting? For example: How do I generate a picture of my puppy sitting on a stool eating pasta? Not any puppy, MY puppy!

After some research I came across a great solution that uses a combination of Stable Diffusion and Dreambooth. Stable Diffusion is an open-source machine learning model used for generating images from text prompts.

Dreambooth is a technique that can fine-tune models like Stable Diffusion and teach it to associate a certain word with an entity. This fine-tuning is possible because Stable Diffusion is open source giving us access to its architecture and weights.

Enough theory, let’s practice. I will now use Dreambooth to train a Stable Diffusion model to associate pictures of my puppy with the word hiccupmypup. Fire up your Google Colab notebooks!

  1. Install dependencies

    !pip install diffusers transformers scipy ftfy accelerate
    !pip install bitsandbytes


  2. We use the Hugging Face diffusers library to fine-tune the model as well as generate photos.

    !git clone https://github.com/huggingface/diffusers
    !pip install diffusers/.
    !pip install -r diffusers/examples/dreambooth/requirements_sd3.txt



  3. Create training_data folder to store images used to fine-tune the model. Additionally create a folder called trained_model to write the fine-tuned model.

    Upload pictures of your subject in the training_data folder. In my case, it’s going to be pictures of my puppy Hiccup



    !mkdir trained_model
    !mkdir training_data


  4. We will be using the stabilityai/stable-diffusion-3-medium-diffusers model as the base model in this exercise. This model is hosted on Hugging Face, and hence we will need to do the following to be able to use this model:

    • Create an account on Hugging Face

    • Request access to this gated model (link)

    • Create a user access token and store a copy of it on your local system notepad (link)

    Once the above steps are done, run the below code. On running this, you will be asked for a token. Please paste the generated user access token from your notepad.

    from accelerate.utils import write_basic_config
    write_basic_config()
    
    from huggingface_hub import login
    login()
  5. Time to begin fine-tuning. Let’s take a look at some of the parameters we have used:

    • instance_prompt: This parameter describes the training data to the model.

    • max_train_steps & learning_rate: These are paramters that need to be tuned by you to get the best results based on your input

    !accelerate launch diffusers/examples/dreambooth/train_dreambooth_sd3.py \
          --pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers" \
          --instance_data_dir="training_data" \
          --instance_prompt="Photo of a dog named hiccupmypup" \
          --output_dir="trained_model" \
          --resolution=512 \
          --train_batch_size=1 \
          --gradient_accumulation_steps=1 \
          --learning_rate=5e-7 \
          --lr_scheduler="constant" \
          --lr_warmup_steps=0 \
          --max_train_steps=1200 \
          --gradient_checkpointing \
          --use_8bit_adam


  6. Now that the training is done, a new model will have been generated in the trained_model folder. Let’s use this model to generate fun images!

    • Add a prompt as a parameter to pipeline to get desired outputs

    • Try playing with the num_inference_stepsand guidance_scale parameters


    from diffusers import DiffusionPipeline
    import torch
    import time
    import random
    
    
    # Set seed for reproducibility
    seed = 0
    torch.manual_seed(seed)
    random.seed(seed)
    
    pipeline = DiffusionPipeline.from_pretrained("trained_model", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
    images = pipeline("A dog named hiccupmypup sitting on a stool eating pasta", num_inference_steps=28, guidance_scale=7.5).images
    
    idx = 0
    for image in images:
      image.save("hiccup_"+str(++idx)+"_"+str(round(time.time() * 1000))+".png")


Here are some images of Hiccup I generated:

Hopefully, in the future we will not need to fine-tune models in order to generate pictures of specific entities. NVIDIA recently released a paper that promises just that: link

Please reach out to us at info@betacrew.io for help building cutting edge GenAI applications

Ronil Mehta

Write by

Building BetaCrew Labs | Always curious, always learning | Meta & Bloomberg Alumnus

Let's Do The Impossible Together

Ready to tackle your toughest challenges? Book an exploration call with us today and discover how we can solve your problems.

  • Fractional CTO

  • Product Lab

  • Software Implementation

  • Technology Consulting

  • Operational Excellence

  • Software Solutions

  • Modernizing Systems

Let's Do The Impossible Together

Ready to tackle your toughest challenges? Book an exploration call with us today and discover how we can solve your problems.

  • Fractional CTO

  • Product Lab

  • Software Implementation

  • Technology Consulting

  • Operational Excellence

  • Software Solutions

  • Modernizing Systems

Let's Do The Impossible Together

Ready to tackle your toughest challenges? Book an exploration call with us today and discover how we can solve your problems.

  • Fractional CTO

  • Product Lab

  • Software Implementation

  • Technology Consulting

  • Operational Excellence

  • Software Solutions

  • Modernizing Systems