Vinith Misra March 5, 2023

Code: ‣

Note: this light post was drafted in Fall 2022, before the arrival of Controlnet, depth-guided Stable Diffusion, and tuning encoder. The takeaways seem naively understated in hindsight.

Untitled

Untitled

This is my daughter with my dog, Astro. They’ve never actually met, since Astro passed away in 2016, but this is exactly how I’d imagine them together (aside from the seven fingers on her hand).

These images were created in the process of some experiments around fine tuning Stable Diffusion on tiny image datasets (i.e. Dreambooth), and specifically for fine-tuning on multiple subjects.

<aside> 💡 TLDR #1: Trained evaluator models can bring a quantitative lens to bear on some of the typically subjective questions posed in this space.

</aside>

<aside> 💡 TLDR #2: Finetuning diffusion models remains more of an art than a science, and the limits of what is possible (e.g. multiple subjects) vary significantly across image varieties and choices of subject.

</aside>

<aside> 💡 TLDR #3: We’ll be working through the product consequences of inpainting, controlled generation, and finetuning for quite some time.

</aside>

Background on single-subject finetuning (Dreambooth)

The secret sauce behind most diffusion-grounded image-generation products in the wild right now and most experiments from the residents of r/stablediffusion is a single very cute project called Dreambooth. The name is cute, the technique is cute, and even the examples they choose are cute.

Untitled

This is really just finetuning alongside some prompt engineering: If I pair pictures of my dog “Astro” with the label “photo of dog Astro”, and finetune SD on this dataset, the finetuned model should learn to associate the term “Astro” with Astro’s visual appearance. As always, there are a host of complications and nuances in avoiding overfitting and underfitting (the paper has some good nuggets, and this is a decent practical round-up of advice).

It remains striking how few images of a subject one needs to perform successfully, and how little compute one can make do with - a handful of examples on a single consumer-grade GPU is enough to produce compelling outputs.

Evaluation and tuning

Evaluation doesn’t sound as exciting as a new architecture, but in reality it’s the axel around which the entire ML R&D process turns. Consider for instance the design space we’re operating in here: