CLIP Guided Stable Diffusion (outdated, new guide coming soon)

maxine
4 min readSep 16, 2022

--

EDIT: I’ve overhauled the entire codebase! It’s different now! but everything should be self-explanatory. Guides for parameters that existed in the old version here should be accurate. (note: no prompt weighting as of now, am in the process of re-writing the CLIP code to accommodate this at the encoding level.) The rest of this article will remain unchanged.

I’ve created a new notebook! Building off of Johnathan Whitaker’s “Grokking Stable Diffusion,” I bring you…. Doohickey 🙃 an almost total beginners’ guide. (Assuming you know how to navigate websites with average proficiency)

an image generated with the notebook, that I’m using for my profile picture! It was made using the CLIP ViT-H-14 model for classifier guidance

None of the public notebooks that allow you to use Stable Diffusion really called to me, so I made my own fully featured with CLIP Text/Image Guidance (even with the new SOTA ViT-H/14 and B/14 from LAION https://laion.ai/blog/large-openclip/), Textual-Inversion (https://arxiv.org/abs/2208.01618), Attention Slicing for memory efficient sampling, Perlin/image inits, LPIPS guidance for the inits, and way more features to come.

You can use it here (https://colab.research.google.com/github/aicrumb/doohickey/blob/main/Doohickey_Diffusion.ipynb) on a free GPU instance from Google Colab.

this is another image from the notebook, but very quickly thrown together because i read somewhere that the more pictures you have in your medium article the better the reading retention

USAGE GUIDE!

The first three paragraphs are about signing up to huggingface, if you already have a huggingface account with a token that has either read or write access, skip these.

Guide time, for a simple start if you aren’t familiar with Colab or IPython notebooks, go here for the welcome page https://colab.research.google.com/?utm_source=scs-index

If you are familiar, or don’t care, start with the previous link. The first cell is just installing libraries and logging into huggingface. You will need an account on https://huggingface.co/ and you will need to agree to the terms of stable diffusion at https://huggingface.co/CompVis/stable-diffusion-v1-4 .

After all of that, go to your settings at https://huggingface.co/settings/tokens and create a token with either the “write” or “read” roll. This will be the token you use to log into the notebook.

After all of that you can just hit the play button to the left of the first cell in the notebook, and a GUI will open to log you in.

“Import libraries” and “Set up generation loop” don’t matter a lot, you can hit the play button on those too after logging in.

The fifth cell has to deal with Textual Inversion, it’s not required to change this but if you have a pretrained textual inversion concept on the huggingface hub, you can load it into this notebook by putting the user id and concept name inside the “specific_concepts” list. For example, if your concept is on the sd-concepts library, then the list might look something like

[“sd-concepts-library/my-concept”]

If you don’t know what textual-inversion is, there’s a notebook here that will introduce you and let you train one at this link: https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb

There’s some filler cells that have tips and tricks but after those there’s a giant block titled “Generate”. This is where your prompt is, where you set the size of the image to be generated, and enable CLIP Guidance. CLIP Guidance can increase the quality of your image the slightest bit and a good example of CLIP Guided Stable Diffusion is Midjourney (if Emad's AMA answers are true). It is not required though as it slows down the generation time by around 5x.

lo! but what could that be but another reading retention picture!

For beginners: change “prompt” to the text you want to turn into an image, and then hit the play button next to this cell too, if there’s a checkmark next to “classifier_guidance,” uncheck it, it just makes it slow.

For non-beginners: every parameter is explained in a little detail in the notebook, there’s init image support (not sure if it works how it’s supposed to, if you find a problem submit a PR or Issue at https://github.com/aicrumb/doohickey). If you’re running locally and/or using a GPU that supports BFloat16, change the dtype variable to torch.bfloat16 for up to a 3x speed increase. Also in the Github repo I have details for parameters regarding the new H/14 CLIP model.

That’s it this was more of a blog post detailing how to use the tool rather than how it works, if you have questions about specific details in the notebook either reply to this or send me a message. 👋👋

(I could’ve had this out earlier but I don’t have a very fast machine! I’m just using the free colab tier to develop. If anyone wants to sponsor me 🙃)

--

--