NowhereLog

Diffusion models

February 25, 2023

This post gives a gentle introduction to the diffusion models, a family of the state of art generative model.

Theory

Diffusion model can be formulated as a Hierarchichal Variational Encoder (HVAE). To optimize variational auto-encoders, the critical technique is the use of evidence lower bound (ELBo) to estimate otherwise intractable distribution.

Eveidence Lower Bound (ELBo) and Vartiational Auto-encoder (VAE)

Given a variable xx and its latent variable zz, p(x)=p(x,z)dzp(x) = \int p(x, z) dz However, the integration over all latent variables is intractable, so we introduce an encoder parametrized by ϕ\phi, qϕ(zx)q_\phi(z| x), to derive the evidence lower bound,

logp(x)=logp(x,z)dz=logp(x,z)qϕ(zx)qϕ(z,x)dz=logEqϕ(zx)[p(x,z)qϕ(zx)]Eqϕ(zx)[log(p(x,z)qϕ(zx))]=Eqϕ(xz)[log(p(xz)p(z)qϕ(zx))]=Eqϕ(xz)[log(p(xz))]+Eqϕ(zx)[log(p(z)qϕ(zx))]=Eqϕ(xz)[log(p(xz))]DKL(qϕ(zx)p(z))\begin{aligned} \log p(x) &= \log \int p(x, z) dz \\ &= \log \int \frac{p(x, z)q_{\phi}(z|x)}{q_{\phi}(z, x)} dz \\ &= \log \mathbb{E}_{q_{\phi}(z|x)} \Big[ \frac{p(x, z)}{q_{\phi}(z|x)} \Big] \\ &\geq \mathbb{E}_{q_{\phi}(z|x)} \Big[ \log \Big(\frac{p(x, z)}{q_{\phi}(z|x)}\Big) \Big]\\ &= \mathbb{E}_{q_{\phi}(x|z)} \Big[ \log \Big(\frac{p(x|z)p(z)}{q_{\phi}(z|x)}\Big) \Big]\\ &= \mathbb{E}_{q_{\phi}(x|z)}\Big[ \log \Big( p(x|z) \Big) \Big] + \mathbb{E}_{q_{\phi}(z|x)} \Big[ \log \Big(\frac{p(z)}{q_{\phi}(z|x)}\Big) \Big] \\ &= \mathbb{E}_{q_{\phi}(x|z)}\Big[ \log \Big( p(x|z) \Big) \Big] - D_{\mathrm{KL}}(q_{\phi}(z|x)||p(z)) \end{aligned}

For the variational auto-encoder, we also introduce a decoder parametrized by θ\theta, pθ(zx)p_\theta(z|x), so the training objective becomes:

logp(x)Eqϕ(zx)[log(pθ(xz))]DKL(qϕ(zx)p(z))\log p(x) \geq \mathbb{E}_{q_{\phi}(z|x)}\Big[ \log \Big( p_\theta(x|z) \Big) \Big] - D_{\mathrm{KL}}(q_{\phi}(z|x)||p(z))

In the training procedure, given a sample xx, we obtain the encoding zz by sampling from qϕ(zx)q_{\phi}(z|x), and reconstruct a xx' by sampling from pθ(zx)p_\theta(z|x). The first term in ELBo is a reconstruction term, and the second term is a prior matching term. We usually assume zz follows a simple distribution like a Gaussian distribution. In actual optimization, we use Montecarlo samples to estimate the expectation value.

Hierarchical VAE (HVAE)

One problem of VAE is that its prior distribution is usually very simple. To model more complex distributions, we can introduce TT latent variables z1,,zTz_1, \dots, z_T.

pθ(x,z1:T)=p(zT)pθ(xz1)t=2Tpθ(zt1zt)qϕ(z1:Tx)=qϕ(z1x)t=2Tqϕ(ztzt1)\begin{aligned} p_{\theta}(x, z_{1:T}) &= p(z_T)p_{\theta}(x|z_1)\prod_{t=2}^{T} p_{\theta}(z_{t-1} | z_t) \\ q_{\phi}(z_{1:T}|x) &= q_{\phi}(z_1|x)\prod_{t=2}^{T} q_{\phi}(z_t | z_{t-1}) \end{aligned}

We can obtain a similar ELBo with Jensen’s inequality,

logp(x)Eqϕ(z1:Tx)[logpθ(x,z1:T)qϕ(z1:Tx)]\begin{aligned} \log p(x) \geq \mathbb{E}_{q_{\phi}(z_{1:T}|x)} \Big[ \log \frac{p_{\theta}(x, z_{1:T})}{q_{\phi}(z_{1:T}|x)} \Big] \end{aligned}

Diffusion Model as a HVAE

The most intuitive way to understand diffusion models is the process of gradually denoising a data sample into Gaussian noise, and the reverse process generates new data samples from Gaussian noise. Mathematically, the easist formulation of diffusion models is a special HVAE. In particular, instead of usual dimensionaly reduction in HVAE, each z1,,zTz_1, \dots, z_T has the same dimension as xx. Therefore, we denote xx with x0x_0 instead, and z1,,zTz_1, \dots, z_T with x1,,xTx_1, \dots, x_T. Further, we assume the encoder follows a fixed Gaussian distribution,

q(xtxt1)=N(xt;αtxt1,(1αt)I)\begin{aligned} q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, (1-\alpha_t)I) \end{aligned}

where αt\alpha_t is a hyper-parameter.

Noising Encoder

We will show that after multiple encoding steps, the encoded sample is simply the original sample plus some Gaussian noise. Using the reparamatrization trick, we can write

q(xtxt1)=αtxt1+1αtϵ0q(x_t|x_{t-1}) = \sqrt{\alpha_t} x_{t-1} + \sqrt{1-\alpha_t} \epsilon_0

where ϵ0N(ϵ0;0,1)\epsilon_0 \sim \mathcal N(\epsilon_0; 0, 1)

By applying the encoder repeatedly, one can show that

xt=i=1tαix0+1i=1tαiϵ0:=αˉtx0+1αˉtϵ0N(xt;αˉtx0,(1αˉt)I)\begin{aligned} x_t &= \sqrt{\prod_{i=1}^{t} \alpha_i} x_0 + \sqrt{1 - \prod_{i=1}^{t} \alpha_i} \epsilon_0 \\ &:= \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon_0 \\ &\sim \mathcal{N}(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I) \end{aligned}

ELBo for Diffusion Model

Now we rewrite ELBo to decompose it into more intuitive terms,

logp(x)Eq(x1:Tx0)[logpθ(x0,x1:T)q(x1:Tx)]=Eq(x1x0)[logpθ(x0x1)]Eq(xT1x0)[DKL(q(xTxT1)p(xT))]t=1TEq(xt1,xt+1x0)[DKL(q(xtxt1)pθ(xtxt+1))]\begin{aligned} \log p(x) &\geq \mathbb{E}_{q(x_{1:T}|x_0)} \Big[ \log \frac{p_{\theta}(x_0, x_{1:T})}{q(x_{1:T}|x)} \Big] \\ & = \mathbb{E}_{q(x_1|x_0)} \left[ \log p_{\theta}(x_0| x_1) \right] - \mathbb{E}_{q(x_{T-1}|x_0)} \left[ D_{\text{KL}}\left(q(x_T| x_{T-1}) \| p(x_T) \right) \right] \\ & \quad - \sum_{t=1}^{T} \mathbb{E}_{q(x_{t-1},x_{t+1}| x_0)} \left[ D_{\text{KL}}\left(q(x_t| x_{t-1}) \| p_{\theta}(x_t| x_{t+1}) \right) \right] \end{aligned}

Now the first term is the reconstruction term, which can be approximated and optimized using a Monte Carlo estimate. The second term is the prior matching term, but since our encoder and prior distribution is fixed, there are no parameters to optimize here. The new third term is the consistency term. With smarter term collection and cancellation, we decompose the ELBo differently so that each term is only an expectation over one random variable.

logp(x)Eq(x1x0)[logpθ(x0x1)]DKL(q(xTx0)p(xT))t=2TEq(xtx0)[DKL(q(xt1xt,x0)pθ(xt1xt))]\begin{aligned} \log p(x) &\geq \mathbb{E}_{q(x_1|x_0)} \left[ \log p_{\theta}(x_0| x_1) \right] - D_{\text{KL}}\left(q(x_T| x_0) \| p(x_T) \right) \\ &\quad - \sum_{t=2}^{T} \mathbb{E}_{q(x_{t}|x_0)} \left[ D_{\text{KL}}\left(q(x_{t-1}| x_t, x_0) \| p_{\theta}(x_{t-1}| x_t) \right) \right] \end{aligned}

With the new derivation, we can understand the third term as a noise matching term, where pθ(xt1xt)p_{\theta}(x_{t-1}| x_t) is estimating the noise added by xtx_t and approximating the GT denoising fucntion q(xt1xt,x0)q(x_{t-1}| x_t, x_0) which has access to the data. We will now rewrite q(xt1xt,x0)q(x_{t-1}| x_t, x_0) to derive different training objectives of diffusion models.

1st Training Objective: Reconstruct Data Sample

We’ve already derived q(xtx0)q(x_t | x_0), which is

q(xtx0)=αˉtx0+1αˉtϵ0q(x_t | x_0) = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon_0

By Bayes rule and using the above fact with a lot of algebra, we can show that

q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)=q(xtxt1)q(xt1x0)q(xtx0)N(xt1;αt(1αˉt1)xt+αˉt1(1αt)x01αˉt,(1αt)(1αˉt1)1αˉtI):=N(xt1;μq(xt,x0),σq2(t)I)\begin{aligned} q(x_{t-1}| x_t, x_0) &= \frac{q(x_{t}| x_{t-1}, x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &= \frac{q(x_{t}| x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &\propto \mathcal{N}\Big(x_{t-1}; \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t)x_0}{1-\bar{\alpha}_t} , \frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} I \Big) \\ &:= \mathcal{N}\Big(x_{t-1}; \mu_q(x_t, x_0), \sigma_q^2(t)I) \end{aligned}

Set our decoder to be pθ(xt1xt)=N(xt1;μθ(xt,t),σq2(t)I)p_\theta(x_{t-1}|x_t) = \mathcal N(x_{t-1}; \mu_\theta(x_t, t), \sigma_q^2(t)I), then we can use the DL divergence for Gaussians to derive,

arg minθDKL(q(xtxt1)pθ(xtxt+1))=arg minθ12σq2(t)μθμq22\begin{aligned} \argmin_{\theta} D_{\text{KL}}\left(q(x_t| x_{t-1}) \| p_{\theta}(x_t| x_{t+1}) \right) &= \argmin_{\theta} \frac{1}{2\sigma_q^2(t)}||\mu_{\theta} - \mu_q||_2^2 \end{aligned}

Finally, instead of predicting the mean μθ(xt,t)\mu_\theta(x_t, t), we can directly predict the sample x^θ(xt,t)\hat x_\theta(x_t, t) and set the mean to match μq(xt,x0)\mu_q(x_t, x_0)

μθ(xt,t)=αt(1αˉt1)xt+αˉt1(1αt)x^θ(xt,t)1αˉt\mu_\theta(x_t, t) = \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t) \hat x_\theta (x_t, t)}{1-\bar{\alpha}_t}

Now substitue this assumption into the DL divergence term,

arg minθDKL(q(xt1xt,x0)pθ(xt1xt))=arg minθ12σq2(t)αˉt1(1αt)2(1αˉt)2x^θ(xt,t)x022\begin{aligned} \argmin_{\theta} D_{\text{KL}}\left(q(x_t-1| x_t, x_0) \| p_{\theta}(x_t-1| x_{t}) \right) &= \argmin_{\theta} \frac{1}{2\sigma_q^2(t)} \frac{\bar{\alpha}_{t-1}(1-\alpha_t)^2}{(1-\bar{\alpha}_t)^2} ||\hat x_\theta (x_t, t) - x_0||_2^2 \end{aligned}

Thus, the training objective becomes to predict the original data sample from a noisified version after tt steps.

2nd Training Objective: Predicting Error

The most popular interpretation and the training objective in DDPM (Denoising Diffusion Probablistic Model), however, is to predict the noise in each denoising step. To derive this equivalent training objective, we need to rewrite μq(xt,x0)\mu_q(x_t, x_0) by substituting in x0=xt1αˉtϵ0αˉtx_0 = \frac{x_t-\sqrt{1-\bar\alpha_t}\epsilon_0}{\sqrt{\bar\alpha_t}}.

μq(xt,x0)=αt(1αˉt1)xt+αˉt1(1αt)x01αˉt=1αtxt1αt1αˉtαtϵ0\begin{aligned} \mu_q(x_t, x_0) &= \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t)x_0}{1-\bar{\alpha}_t} \\ & = \frac{1}{\sqrt{\alpha_t}}x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}\sqrt{\alpha_t}}\epsilon_0 \end{aligned}

Now instead of predicting the sample x^θ(xt,t)\hat x_\theta(x_t, t), we predict the noise given a noised sample at time step tt, ϵ^(xt,t)\hat\epsilon(x_t, t), and set

μθ(xt,t)=1αtxt1αt1αˉtαtϵ^(xt,t)\mu_\theta(x_t, t) =\frac{1}{\sqrt{\alpha_t}}x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}\sqrt{\alpha_t}}\hat\epsilon(x_t, t)

Now substitue this assumption into the DL divergence term,

arg minθDKL(q(xt1xt,x0)pθ(xt1xt))=arg minθ12σq2(t)(1αt)2(1αˉt)αtϵ^θ(xt,t)ϵ022\begin{aligned} \argmin_{\theta} D_{\text{KL}}\left(q(x_{t-1}| x_t, x_0) \| p_{\theta}(x_{t-1}| x_{t}) \right) &= \argmin_{\theta} \frac{1}{2\sigma_q^2(t)} \frac{(1-\alpha_t)^2}{(1-\bar{\alpha}_t)\alpha_t} ||\hat \epsilon_\theta (x_t, t) - \epsilon_0||_2^2 \end{aligned}

Thus, the training objective becomes to predict the noise at a time step tt.

3rd Training Objective: Predict the Score Function

In my opinion, the most profound understanding of diffusion models is that it is training a score function. Roughly speaking, the score function predicts the gradient field of a data distribution, so given any point, it predicts the direction to move into dense regions. To derive this formulation, we make use of the Tweedie’s Formula. Given zN(z;μz,Σz)z \sim \mathcal N(z; \mu_z, \Sigma_z),

E[μzz]=z+σzzlogp(z)\mathbb E[\mu_z|z] = z + \sigma_z \nabla_z \log p(z)

Applying this formula, we have that

E[μxtxt]=xt+(1αˉt)σxtzlogp(xt)\mathbb E[\mu_{x_t}|x_t] = x_t + (1-\bar\alpha_t)\sigma_{x_t} \nabla_z \log p(x_t)

We’ve drived that xtN(xt;αˉtx0,(1αˉt)I)x_t \sim \mathcal N(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I), which gives us E[μxtxt]=αˉtx0E[\mu_{x_t}|x_t] = \sqrt{\bar\alpha_t}x_0. Thus, we can rewrite x0x_0 as

x0=xt+(1αˉt)xtlogp(xt)αˉtx_0 = \frac{x_t+(1-\bar\alpha_t) \nabla_{x_t}\log p(x_t)}{\sqrt{\bar\alpha_t}}

Substitute this back into μq(xt,x0)\mu_q(x_t, x_0),

μq(xt,x0)=αt(1αˉt1)xt+αˉt1(1αt)x01αˉt=1αtxt+1αtαtxtlogp(xt)\begin{aligned} \mu_q(x_t, x_0) &= \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})x_t + \sqrt{\bar{\alpha}_{t-1}} (1-\alpha_t)x_0}{1-\bar{\alpha}_t} \\ & = \frac{1}{\sqrt{\alpha_t}}x_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}}\nabla_{x_t} \log p(x_t) \end{aligned}

Now instead of predicting the sample x^θ(xt,t)\hat x_\theta(x_t, t) or the noise ϵ^(xt,t)\hat\epsilon(x_t, t), we predict the score function sθ(xt,t)s_\theta(x_t, t) to match xtlogp(xt)\nabla_{x_t} \log p(x_t),

μtheta(xt,t)=1αtxt+1αtαtsθ(xt,t)\mu_theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}x_t + \frac{1-\alpha_t}{\sqrt{\alpha_t}}s_\theta(x_t, t)

Now substitue this assumption into the DL divergence term,

arg minθDKL(q(xt1xt,x0)pθ(xt1xt))=arg minθ12σq2(t)(1αt)2αtsθ(xt,t)xtlogp(xt)22\begin{aligned} \argmin_{\theta} D_{\text{KL}}\left(q(x_{t-1}| x_t, x_0) \| p_{\theta}(x_{t-1}| x_{t}) \right) &= \argmin_{\theta} \frac{1}{2\sigma_q^2(t)} \frac{(1-\alpha_t)^2}{\alpha_t} ||s_\theta (x_t, t) - \nabla_{x_t} \log p(x_t)||_2^2 \end{aligned}

Guidance for Generation

Now everthing is great. We can either train a neural network to predict sample from Gaussian samples, or to predict noise and denoise, or to predict the score function. What if we want to guide the generation given some prior information, say we want to generate a cat but we train on all kinds of animals? The two main guidance methods are classifer guidance and classifer-free guidance.

Classifer Guidance

More concretely, given some prior yy to condition on, we want to generate a sample according to p(x0y)p(x_0 | y). Let’s decompose this using the Bayes’ rule,

p(xty)=p(xt)p(yxt)p(y)p(x_t |y) = \frac{p(x_t)p(y|x_t)}{p(y)}

and the score function is

xtp(xty)=xtp(xt)+xtp(yxt)xtp(y)=xtp(xt)+xtp(yxt)\begin{aligned} \nabla_{x_t} p(x_t|y) &= \nabla_{x_t}p(x_t) + \nabla_{x_t} p(y|x_t) - \nabla_{x_t} p(y) \\ &=\nabla_{x_t}p(x_t) + \nabla_{x_t} p(y|x_t) \end{aligned}

The first term is the usual unconditional score, and the second term is an adversarial gradient of a classifer p(yxt)p(y|x_t). Thus, in practice, we can train a classifier for each class yy to guide the generation process.

To gain fine-grained control over the guidance, we can introduce a hyperparameter γ\gamma to scale the adversarial gradient,

xtp(xty)=xtp(xt)+γxtp(yxt)\nabla_{x_t} p(x_t|y) = \nabla_{x_t}p(x_t) + \gamma \nabla_{x_t} p(y|x_t)

Classifer-free Guidance

Nevertheless, it’s costly and even impossible to train a classifer for each new condition. To introduce classifer-free guidance, we rewrite the scaled conditional score,

xtp(xty)=xtp(xt)+γxtp(yxt)=xtp(xt)+γ(xtp(xty)xtp(xt))=γxtp(xty)+(1γ)xtp(xt)\begin{aligned} \nabla_{x_t} p(x_t|y) &= \nabla_{x_t}p(x_t) + \gamma \nabla_{x_t} p(y|x_t) \\ &= \nabla_{x_t}p(x_t) + \gamma (\nabla_{x_t} p(x_t|y) - \nabla_{x_t} p(x_t)) \\ &= \gamma \nabla_{x_t} p(x_t|y) + (1-\gamma) \nabla_{x_t} p(x_t) \end{aligned}

The first term is the conditional score, and the second term is the unconditional score. In practice, we can train a neural network that always takes in a conditional imformation (which is empty for unconditional score).

Toy Implementation

For a toy impelmentation, you can check this colab notebook.

References

[1] Luo, Calvin. “Understanding diffusion models: A unified perspective.” arXiv preprint arXiv:2208.11970 (2022).

[2] Song, Yang, et al. “Score-based generative modeling through stochastic differential equations.” arXiv preprint arXiv:2011.13456 (2020).

[3] Song, Jiaming, Chenlin Meng, and Stefano Ermon. “Denoising diffusion implicit models.” arXiv preprint arXiv:2010.02502 (2020).

[4] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.

[5] Ho, Jonathan, and Tim Salimans. “Classifier-free diffusion guidance.” arXiv preprint arXiv:2207.12598 (2022).


By NowhereMan who goes nowhere.


© 2024, NowhereLog by NowhereMan