Diffusion model can be formulated as a Hierarchichal Variational Encoder (HVAE). To optimize variational auto-encoders, the critical technique is the use of evidence lower bound (ELBo) to estimate otherwise intractable distribution.
Eveidence Lower Bound (ELBo) and Vartiational Auto-encoder (VAE)
Given a variable x and its latent variable z,
p(x)=∫p(x,z)dz
However, the integration over all latent variables is intractable, so we introduce an encoder parametrized by ϕ, qϕ(z∣x), to derive the evidence lower bound,
In the training procedure, given a sample x, we obtain the encoding z by sampling from qϕ(z∣x), and reconstruct a x′ by sampling from pθ(z∣x). The first term in ELBo is a reconstruction term, and the second term is a prior matching term. We usually assume z follows a simple distribution like a Gaussian distribution. In actual optimization, we use Montecarlo samples to estimate the expectation value.
Hierarchical VAE (HVAE)
One problem of VAE is that its prior distribution is usually very simple. To model more complex distributions, we can introduce T latent variables z1,…,zT.
The most intuitive way to understand diffusion models is the process of gradually denoising a data sample into Gaussian noise, and the reverse process generates new data samples from Gaussian noise. Mathematically, the easist formulation of diffusion models is a special HVAE. In particular, instead of usual dimensionaly reduction in HVAE, each z1,…,zT has the same dimension as x. Therefore, we denote x with x0 instead, and z1,…,zT with x1,…,xT. Further, we assume the encoder follows a fixed Gaussian distribution,
q(xt∣xt−1)=N(xt;αtxt−1,(1−αt)I)
where αt is a hyper-parameter.
Noising Encoder
We will show that after multiple encoding steps, the encoded sample is simply the original sample plus some Gaussian noise. Using the reparamatrization trick, we can write
q(xt∣xt−1)=αtxt−1+1−αtϵ0
where ϵ0∼N(ϵ0;0,1)
By applying the encoder repeatedly, one can show that
Now the first term is the reconstruction term, which can be approximated and optimized using a Monte Carlo estimate. The second term is the prior matching term, but since our encoder and prior distribution is fixed, there are no parameters to optimize here. The new third term is the consistency term. With smarter term collection and cancellation, we decompose the ELBo differently so that each term is only an expectation over one random variable.
With the new derivation, we can understand the third term as a noise matching term, where pθ(xt−1∣xt) is estimating the noise added by xt and approximating the GT denoising fucntion q(xt−1∣xt,x0) which has access to the data. We will now rewrite q(xt−1∣xt,x0) to derive different training objectives of diffusion models.
1st Training Objective: Reconstruct Data Sample
We’ve already derived q(xt∣x0), which is
q(xt∣x0)=αˉtx0+1−αˉtϵ0
By Bayes rule and using the above fact with a lot of algebra, we can show that
Thus, the training objective becomes to predict the original data sample from a noisified version after t steps.
2nd Training Objective: Predicting Error
The most popular interpretation and the training objective in DDPM (Denoising Diffusion Probablistic Model), however, is to predict the noise in each denoising step. To derive this equivalent training objective, we need to rewrite μq(xt,x0) by substituting in x0=αˉtxt−1−αˉtϵ0.
Thus, the training objective becomes to predict the noise at a time step t.
3rd Training Objective: Predict the Score Function
In my opinion, the most profound understanding of diffusion models is that it is training a score function. Roughly speaking, the score function predicts the gradient field of a data distribution, so given any point, it predicts the direction to move into dense regions. To derive this formulation, we make use of the Tweedie’s Formula. Given z∼N(z;μz,Σz),
E[μz∣z]=z+σz∇zlogp(z)
Applying this formula, we have that
E[μxt∣xt]=xt+(1−αˉt)σxt∇zlogp(xt)
We’ve drived that xt∼N(xt;αˉtx0,(1−αˉt)I), which gives us E[μxt∣xt]=αˉtx0. Thus, we can rewrite x0 as
Now everthing is great. We can either train a neural network to predict sample from Gaussian samples, or to predict noise and denoise, or to predict the score function. What if we want to guide the generation given some prior information, say we want to generate a cat but we train on all kinds of animals? The two main guidance methods are classifer guidance and classifer-free guidance.
Classifer Guidance
More concretely, given some prior y to condition on, we want to generate a sample according to p(x0∣y). Let’s decompose this using the Bayes’ rule,
The first term is the usual unconditional score, and the second term is an adversarial gradient of a classifer p(y∣xt). Thus, in practice, we can train a classifier for each class y to guide the generation process.
To gain fine-grained control over the guidance, we can introduce a hyperparameter γ to scale the adversarial gradient,
∇xtp(xt∣y)=∇xtp(xt)+γ∇xtp(y∣xt)
Classifer-free Guidance
Nevertheless, it’s costly and even impossible to train a classifer for each new condition. To introduce classifer-free guidance, we rewrite the scaled conditional score,
The first term is the conditional score, and the second term is the unconditional score. In practice, we can train a neural network that always takes in a conditional imformation (which is empty for unconditional score).
Toy Implementation
For a toy impelmentation, you can check this colab notebook.