An Insight into Stable Diffusion

Jan 18, 2023by, Muhamed Risvan

Data Science

Stable Diffusion is a latent text-to-image diffusion model that is capable of generating photo-realistic images from any text inputs. On giving a text as a prompt, an image will be generated as output. It was developed by engineers and researchers from LAION, Stability AI, and CompVis.

Stable Diffusion is trained on images from a subset of the LAION-5B database, which is the largest and freely accessible multi-model dataset that currently exists.


  • This technique can be used to create and alter images in response to text cues.
  • A text-to-image diffusion model with a profound comprehension of language and an unmatched level of photorealism.
  • It is a Latent Diffusion Model which combines an autoencoder with a diffusion model trained in the autoencoder’s latent space.
  • Images are transformed into latent representations with the use of an encoder.
  • A ViT-L/14 text encoder is used to encrypt text prompts.
  • ‌The optimizer used in this model is Adam Optimizer.


5,85B CLIP-filtered image-text pairs make up the large-scale research dataset LAION 5B. 1,3B samples have texts that cannot be assigned to a particular language. 2,2B samples come from more than 100 other languages and 2,3B samples are in English.

In order for models like FLORENCE, ALIGN, and BASIC to perform competitively billions of image-text pairs must be used but until today there hasn’t been a freely accessible dataset of this size. There is a new dataset called LAION-5B that is superior to those used to train models like CLIP and DALL-E.

The Columns included in the dataset are

  • URL: The url of the image
  • TEXT: Input text prompt for generating the image
  • WIDTH: Width of the image
  • HEIGHT: Height of the image
  • LANGUAGE: Language of the text
  • SIMILARITY: Cosine between text and image embeddings
  • WATERMARK: Probability of being a watermarked image
  • UNSAFE: The likelihood that an image is dangerous

How does the model work?

  • In High-Resolution Image Synthesis using Latent Diffusion Models, a specific kind of diffusion model known as Latent Diffusion was proposed.
  • By employing a lower dimensions latent space rather than the actual pixel space, latent diffusion can reduce memory and computational complexity.
  • For instance, Stable Diffusion’s autoencoder has an 8 reduction factor.
  • In other words, a shape of (3, 512, 512) becomes (3, 64, 64) in latent space requiring 8 x 8 = 64 times less memory. This is the reason why 16GB Colab GPUs can take 512 x 512 photos so quickly.
  • There are three main components in latent diffusion. An autoencoder (VAE), U-Net and text-encode.
  • All the elements needed to set up a full diffusion pipeline are included in the pre-trained model. The following folders contain them.

Text encoder: An embedding space that the U-Net can understand is created from the input prompt such as “An astronaut riding a horse” by the text-encoder. A series of input tokens are typically converted into a series of latent text embeddings using a straightforward transformer-based encoder.

While other diffusion models may employ different encoders such as BERT, Stable Diffusion makes use of CLIP. Stable Diffusion does not train the text encoder during training and simply uses CLIP’s already trained text encoder (CLIPTextModel).

U Net: The model used to generate latent representation of the input. U-Net has an encoder part and a decoder part. Both are composed of ResNet blocks. A higher-resolution image representation that is purportedly less noisy is converted by the encoder into a lower-resolution image representation, which is then converted by the decoder back to the original. More specifically, the anticipated denoised picture representation can be computed using the noise residual predicted by the U-Net output.

Short-cut connections are typically created between the downsampling ResNets of the encoder and the upsampling ResNets of the decoder in order to prevent the U-Net from losing crucial information when downsampling. The stable diffusion U-Net can also use cross-attention layers to condition its output on text embeddings. The cross-attention layers are often introduced between ResNet blocks to the encoder and decoder parts of the U-Net.

Autoencoder: Autoencoder module that we’ll use to decode latent representations into real images. The VAE model has two parts. An encoder and a decoder. A low-dimensional latent representation of the image is created using the encoder and will be utilized as the input for the U-Net model. In contrast, the decoder turns the latent representation back into a picture.

For the forward diffusion process which adds more and more noise at each step, the encoder is utilized to obtain latent representations (latents) of the pictures during latent diffusion training. The VAE decoder is used during inference to transform the denoised latents produced by the reverse diffusion process back into pictures. We just require the VAE decoder as we will see during inference.

Model Architecture

  • The latent seed and text prompt are both inputs to the stable diffusion model.
  • The latent seed is then used to generate random latent image representations of size 64×64 whereas the text prompt is transformed to text embeddings of size 77×768 via CLIP’s text encoder.
  • The random latent image representations are then iteratively denoised by the U-Net while being conditioned on the text embeddings.
  • A scheduling approach is used to construct a denoised latent picture representation from the noise residual produced by the U-Net.
  • The denoising process is repeated to retrieve better latent image representations.
  • The decoder component of the variational autoencoder then decodes the latent picture representation.


  • An insect robot preparing a delicious meal.

  • A photo of an astronaut riding a horse.


  • ‌Educational and creative tools.
  • Creation of works of art and their use in creative processes like design.
  • Research on generative models.
  • Deployment of models with the potential to produce dangerous content in a secure manner.


The Stable Diffusion model can support several operations like DALL-E. It can be given a text description of a desired image and generate a high-quality picture that matches the description. It can also generate a realistic-looking image from a simple sketch plus a textual description of the desired image. The model does not achieve perfect photorealismFaces and people, in general, may not be generated properly. The model was trained mainly with English captions and will not work well in other languages. The model is unable to display text clearly. When given more challenging compositional tasks like creating an image of “A red cube on top of a blue sphere”, the model does not perform well. If you have any projects regarding the above, contact us.

Disclaimer: The opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Dexlock.

  • Share Facebook
  • Share Twitter
  • Share Linkedin