Diffusion models

February 25, 2023

This post gives a gentle introduction to the diffusion models, a family of the state of art generative model.

Theory
Toy Implementation
References

Theory

Diffusion model can be formulated as a Hierarchichal Variational Encoder (HVAE). To optimize variational auto-encoders, the critical technique is the use of evidence lower bound (ELBo) to estimate otherwise intractable distribution.

Eveidence Lower Bound (ELBo) and Vartiational Auto-encoder (VAE)

Given a variable $x$ and its latent variable $z$ , $p(x) = \int p(x, z) dz$ However, the integration over all latent variables is intractable, so we introduce an encoder parametrized by $\phi$ , $q_\phi(z| x)$ , to derive the evidence lower bound,

\begin{aligned} \log p(x) &= \log \int p(x, z) dz \\ &= \log \int \frac{p(x, z)q_{\phi}(z|x)}{q_{\phi}(z, x)} dz \\ &= \log \mathbb{E}_{q_{\phi}(z|x)} \Big[ \frac{p(x, z)}{q_{\phi}(z|x)} \Big] \\ &\geq \mathbb{E}_{q_{\phi}(z|x)} \Big[ \log \Big(\frac{p(x, z)}{q_{\phi}(z|x)}\Big) \Big]\\ &= \mathbb{E}_{q_{\phi}(z|x)} \Big[ \log \Big(\frac{p(x|z)p(z)}{q_{\phi}(z|x)}\Big) \Big]\\ &= \mathbb{E}_{q_{\phi}(z|x)}\Big[ \log \Big( p(x|z) \Big) \Big] + \mathbb{E}_{q_{\phi}(z|x)} \Big[ \log \Big(\frac{p(z)}{q_{\phi}(z|x)}\Big) \Big] \\ &= \mathbb{E}_{q_{\phi}(z|x)}\Big[ \log \Big( p(x|z) \Big) \Big] - D_{\mathrm{KL}}(q_{\phi}(z|x)||p(z)) \end{aligned}

For the variational auto-encoder, we also introduce a decoder parametrized by $\theta$ , $p_\theta(z|x)$ , so the training objective becomes:

\log p(x) \geq \mathbb{E}_{q_{\phi}(z|x)}\Big[ \log \Big( p_\theta(x|z) \Big) \Big] - D_{\mathrm{KL}}(q_{\phi}(z|x)||p(z))

In the training procedure, given a sample $x$ , we obtain the encoding $z$ by sampling from $q_{\phi}(z|x)$ , and reconstruct a $x'$ by sampling from $p_\theta(z|x)$ . The first term in ELBo is a reconstruction term, and the second term is a prior matching term. We usually assume $z$ follows a simple distribution like a Gaussian distribution. In actual optimization, we use Montecarlo samples to estimate the expectation value.

Hierarchical VAE (HVAE)

One problem of VAE is that its prior distribution is usually very simple. To model more complex distributions, we can introduce $T$ latent variables $z_1, \dots, z_T$ .

\begin{aligned} p_{\theta}(x, z_{1:T}) &= p(z_T)p_{\theta}(x|z_1)\prod_{t=2}^{T} p_{\theta}(z_{t-1} | z_t) \\ q_{\phi}(z_{1:T}|x) &= q_{\phi}(z_1|x)\prod_{t=2}^{T} q_{\phi}(z_t | z_{t-1}) \end{aligned}

We can obtain a similar ELBo with Jensen’s inequality,

\begin{aligned} \log p(x) \geq \mathbb{E}_{q_{\phi}(z_{1:T}|x)} \Big[ \log \frac{p_{\theta}(x, z_{1:T})}{q_{\phi}(z_{1:T}|x)} \Big] \end{aligned}

Diffusion Model as a HVAE

The most intuitive way to understand diffusion models is the process of gradually denoising a data sample into Gaussian noise, and the reverse process generates new data samples from Gaussian noise. Mathematically, the easist formulation of diffusion models is a special HVAE. In particular, instead of usual dimensionaly reduction in HVAE, each $z_1, \dots, z_T$ has the same dimension as $x$ . Therefore, we denote $x$ with $x_0$ instead, and $z_1, \dots, z_T$ with $x_1, \dots, x_T$ . Further, we assume the encoder follows a fixed Gaussian distribution,

\begin{aligned} q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, (1-\alpha_t)I) \end{aligned}

where $\alpha_t$ is a hyper-parameter.

Noising Encoder

We will show that after multiple encoding steps, the encoded sample is simply the original sample plus some Gaussian noise. Using the reparamatrization trick, we can write

q(x_t|x_{t-1}) = \sqrt{\alpha_t} x_{t-1} + \sqrt{1-\alpha_t} \epsilon_0

where $\epsilon_0 \sim \mathcal N(\epsilon_0; 0, 1)$

By applying the encoder repeatedly, one can show that

\begin{aligned} x_t &= \sqrt{\prod_{i=1}^{t} \alpha_i} x_0 + \sqrt{1 - \prod_{i=1}^{t} \alpha_i} \epsilon_0 \\ &:= \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon_0 \\ &\sim \mathcal{N}(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I) \end{aligned}

ELBo for Diffusion Model

Now we rewrite ELBo to decompose it into more intuitive terms,

\begin{aligned} \log p(x) &\geq \mathbb{E}_{q(x_{1:T}|x_0)} \Big[ \log \frac{p_{\theta}(x_0, x_{1:T})}{q(x_{1:T}|x)} \Big] \\ & = \mathbb{E}_{q(x_1|x_0)} \left[ \log p_{\theta}(x_0| x_1) \right] - \mathbb{E}_{q(x_{T-1}|x_0)} \left[ D_{\text{KL}}\left(q(x_T| x_{T-1}) \| p(x_T) \right) \right] \\ & \quad - \sum_{t=1}^{T} \mathbb{E}_{q(x_{t-1},x_{t+1}| x_0)} \left[ D_{\text{KL}}\left(q(x_t| x_{t-1}) \| p_{\theta}(x_t| x_{t+1}) \right) \right] \end{aligned}

Now the first term is the reconstruction term, which can be approximated and optimized using a Monte Carlo estimate. The second term is the prior matching term, but since our encoder and prior distribution is fixed, there are no parameters to optimize here. The new third term is the consistency term. With smarter term collection and cancellation, we decompose the ELBo differently so that each term is only an expectation over one random variable.

\begin{aligned} \log p(x) &\geq \mathbb{E}_{q(x_1|x_0)} \left[ \log p_{\theta}(x_0| x_1) \right] - D_{\text{KL}}\left(q(x_T| x_0) \| p(x_T) \right) \\ &\quad - \sum_{t=2}^{T} \mathbb{E}_{q(x_{t}|x_0)} \left[ D_{\text{KL}}\left(q(x_{t-1}| x_t, x_0) \| p_{\theta}(x_{t-1}| x_t) \right) \right] \end{aligned}

With the new derivation, we can understand the third term as a noise matching term, where $p_{\theta}(x_{t-1}| x_t)$ is estimating the noise added by $x_t$ and approximating the GT denoising fucntion $q(x_{t-1}| x_t, x_0)$ which has access to the data. We will now rewrite $q(x_{t-1}| x_t, x_0)$ to derive different training objectives of diffusion models.

1st Training Objective: Reconstruct Data Sample

We’ve already derived $q(x_t | x_0)$ , which is

q(x_t | x_0) = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon_0

By Bayes rule and using the above fact with a lot of algebra, we can show that

\begin{aligned} q(x_{t-1}| x_t, x_0) &= \frac{q(x_{t}| x_{t-1}, x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &= \frac{q(x_{t}| x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &\propto \mathcal{N}\Big(x_{t-1}; \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t)x_0}{1-\bar{\alpha}_t} , \frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} I \Big) \\ &:= \mathcal{N}\Big(x_{t-1}; \mu_q(x_t, x_0), \sigma_q^2(t)I) \end{aligned}

Set our decoder to be $p_\theta(x_{t-1}|x_t) = \mathcal N(x_{t-1}; \mu_\theta(x_t, t), \sigma_q^2(t)I)$ , then we can use the DL divergence for Gaussians to derive,

\begin{aligned} \argmin_{\theta} D_{\text{KL}}\left(q(x_t| x_{t-1}) \| p_{\theta}(x_t| x_{t+1}) \right) &= \argmin_{\theta} \frac{1}{2\sigma_q^2(t)}||\mu_{\theta} - \mu_q||_2^2 \end{aligned}

Finally, instead of predicting the mean $\mu_\theta(x_t, t)$ , we can directly predict the sample $\hat x_\theta(x_t, t)$ and set the mean to match $\mu_q(x_t, x_0)$

\mu_\theta(x_t, t) = \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t) \hat x_\theta (x_t, t)}{1-\bar{\alpha}_t}

Now substitue this assumption into the DL divergence term,

\begin{aligned} \argmin_{\theta} D_{\text{KL}}\left(q(x_t-1| x_t, x_0) \| p_{\theta}(x_t-1| x_{t}) \right) &= \argmin_{\theta} \frac{1}{2\sigma_q^2(t)} \frac{\bar{\alpha}_{t-1}(1-\alpha_t)^2}{(1-\bar{\alpha}_t)^2} ||\hat x_\theta (x_t, t) - x_0||_2^2 \end{aligned}

Thus, the training objective becomes to predict the original data sample from a noisified version after $t$ steps.

2nd Training Objective: Predicting Error

The most popular interpretation and the training objective in DDPM (Denoising Diffusion Probablistic Model), however, is to predict the noise in each denoising step. To derive this equivalent training objective, we need to rewrite $\mu_q(x_t, x_0)$ by substituting in $x_0 = \frac{x_t-\sqrt{1-\bar\alpha_t}\epsilon_0}{\sqrt{\bar\alpha_t}}$ .

\begin{aligned} \mu_q(x_t, x_0) &= \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t)x_0}{1-\bar{\alpha}_t} \\ & = \frac{1}{\sqrt{\alpha_t}}x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}\sqrt{\alpha_t}}\epsilon_0 \end{aligned}

Now instead of predicting the sample $\hat x_\theta(x_t, t)$ , we predict the noise given a noised sample at time step $t$ , $\hat\epsilon(x_t, t)$ , and set

\mu_\theta(x_t, t) =\frac{1}{\sqrt{\alpha_t}}x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}\sqrt{\alpha_t}}\hat\epsilon(x_t, t)

Now substitue this assumption into the DL divergence term,

\begin{aligned} \argmin_{\theta} D_{\text{KL}}\left(q(x_{t-1}| x_t, x_0) \| p_{\theta}(x_{t-1}| x_{t}) \right) &= \argmin_{\theta} \frac{1}{2\sigma_q^2(t)} \frac{(1-\alpha_t)^2}{(1-\bar{\alpha}_t)\alpha_t} ||\hat \epsilon_\theta (x_t, t) - \epsilon_0||_2^2 \end{aligned}

Thus, the training objective becomes to predict the noise at a time step $t$ .

3rd Training Objective: Predict the Score Function

In my opinion, the most profound understanding of diffusion models is that it is training a score function. Roughly speaking, the score function predicts the gradient field of a data distribution, so given any point, it predicts the direction to move into dense regions. To derive this formulation, we make use of the Tweedie’s Formula. Given $z \sim \mathcal N(z; \mu_z, \Sigma_z)$ ,

\mathbb E[\mu_z|z] = z + \sigma_z \nabla_z \log p(z)

Applying this formula, we have that

\mathbb E[\mu_{x_t}|x_t] = x_t + (1-\bar\alpha_t)\sigma_{x_t} \nabla_z \log p(x_t)

We’ve drived that $x_t \sim \mathcal N(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I)$ , which gives us $E[\mu_{x_t}|x_t] = \sqrt{\bar\alpha_t}x_0$ . Thus, we can rewrite $x_0$ as

x_0 = \frac{x_t+(1-\bar\alpha_t) \nabla_{x_t}\log p(x_t)}{\sqrt{\bar\alpha_t}}

Substitute this back into $\mu_q(x_t, x_0)$ ,

\begin{aligned} \mu_q(x_t, x_0) &= \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t)x_0}{1-\bar{\alpha}_t} \\ & = \frac{1}{\sqrt{\alpha_t}}x_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}}\nabla_{x_t} \log p(x_t) \end{aligned}

Now instead of predicting the sample $\hat x_\theta(x_t, t)$ or the noise $\hat\epsilon(x_t, t)$ , we predict the score function $s_\theta(x_t, t)$ to match $\nabla_{x_t} \log p(x_t)$ ,

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}x_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}}s_\theta(x_t, t)

Now substitue this assumption into the DL divergence term,

\begin{aligned} \argmin_{\theta} D_{\text{KL}}\left(q(x_{t-1}| x_t, x_0) \| p_{\theta}(x_{t-1}| x_{t}) \right) &= \argmin_{\theta} \frac{1}{2\sigma_q^2(t)} \frac{(1-\alpha_t)^2}{\alpha_t} ||s_\theta (x_t, t) - \nabla_{x_t} \log p(x_t)||_2^2 \end{aligned}

Guidance for Generation

Now everthing is great. We can either train a neural network to predict sample from Gaussian samples, or to predict noise and denoise, or to predict the score function. What if we want to guide the generation given some prior information, say we want to generate a cat but we train on all kinds of animals? The two main guidance methods are classifer guidance and classifer-free guidance.

Classifer Guidance

More concretely, given some prior $y$ to condition on, we want to generate a sample according to $p(x_0 | y)$ . Let’s decompose this using the Bayes’ rule,

p(x_t |y) = \frac{p(x_t)p(y|x_t)}{p(y)}

and the score function is

\begin{aligned} \nabla_{x_t} p(x_t|y) &= \nabla_{x_t}p(x_t) + \nabla_{x_t} p(y|x_t) - \nabla_{x_t} p(y) \\ &=\nabla_{x_t}p(x_t) + \nabla_{x_t} p(y|x_t) \end{aligned}

The first term is the usual unconditional score, and the second term is an adversarial gradient of a classifer $p(y|x_t)$ . Thus, in practice, we can train a classifier for each class $y$ to guide the generation process.

To gain fine-grained control over the guidance, we can introduce a hyperparameter $\gamma$ to scale the adversarial gradient,

\nabla_{x_t} p(x_t|y) = \nabla_{x_t}p(x_t) + \gamma \nabla_{x_t} p(y|x_t)

Classifer-free Guidance

Nevertheless, it’s costly and even impossible to train a classifer for each new condition. To introduce classifer-free guidance, we rewrite the scaled conditional score,

\begin{aligned} \nabla_{x_t} p(x_t|y) &= \nabla_{x_t}p(x_t) + \gamma \nabla_{x_t} p(y|x_t) \\ &= \nabla_{x_t}p(x_t) + \gamma (\nabla_{x_t} p(x_t|y) - \nabla_{x_t} p(x_t)) \\ &= \gamma \nabla_{x_t} p(x_t|y) + (1-\gamma) \nabla_{x_t} p(x_t) \end{aligned}

The first term is the conditional score, and the second term is the unconditional score. In practice, we can train a neural network that always takes in a conditional imformation (which is empty for unconditional score).

Toy Implementation

For a toy impelmentation, you can check this colab notebook.

References

[1] Luo, Calvin. “Understanding diffusion models: A unified perspective.” arXiv preprint arXiv:2208.11970 (2022).

[2] Song, Yang, et al. “Score-based generative modeling through stochastic differential equations.” arXiv preprint arXiv:2011.13456 (2020).

[3] Song, Jiaming, Chenlin Meng, and Stefano Ermon. “Denoising diffusion implicit models.” arXiv preprint arXiv:2010.02502 (2020).

[4] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.

[5] Ho, Jonathan, and Tim Salimans. “Classifier-free diffusion guidance.” arXiv preprint arXiv:2207.12598 (2022).