All About Diffusion & Flow Models
On this page
This article offers a comprehensive overview of diffusion models from multiple perspectives. We begin with the foundations—DDPM, DDIM, and Score Matching—and explore their relationships. From there, we introduce the ODE/SDE framework, showing how DDPM can be derived from stochastic differential equations and how this connects to Flow Matching.
We then highlight key model variants such as Stable Diffusion and Movie Gen, discussing their architectures and applications. Finally, we broaden the scope to examine how diffusion models are being adapted beyond image generation, including diffusion policies in reinforcement learning and their emerging role in large language models (LLMs).
1 Preliminary
Before diving into the DDPM algorithm, we’ll first review some key mathematical concepts that will make the content easier to understand. If you’re already familiar with them, feel free to skip this section and return later only if you need a refresher.
1.1 Multivariate Gaussian Distribution
The probability density function of a random vector \(x \in \mathbb{R}^d\) that follows a multivariate Gaussian distribution with mean vector \(\mu \in \mathbb{R}^d\) and covariance matrix \(\Sigma \in \mathbb{R}^{d \times d}\) is given by:
\[ p(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left( -\tfrac{1}{2}(x - \mu)^{\top}\Sigma^{-1}(x - \mu) \right) \tag{1}\]
A special case arises when the covariance matrix is the identity, \(\Sigma = \mathbf{I}_{d} \in \mathbb{R}^{d \times d}\). This is known as the isotropic Gaussian. In deep learning practice, it is common to only predict the mean of the Gaussian, denoted \(\mu_{\theta}\), while assuming an isotropic covariance:
\[ p(x) = \frac{1}{(2\pi)^{d/2}} \exp\left( -\tfrac{1}{2}(x - \mu_{\theta})^{\top}(x - \mu_{\theta}) \right) \tag{2}\]
A fundamental property of Gaussian distributions is that the sum of independent Gaussians is itself Gaussian:
\[ x + y \sim \mathcal{N}(\mu_1 + \mu_2,\ \Sigma_1 + \Sigma_2) \tag{3}\]
As a simple example, consider two independent random Gaussian variables \(\varepsilon_1, \varepsilon_2 \sim \mathcal{N}(0, \mathbf{I}_d)\). Define: \[ \mathrm{x}_1 = \sigma_1 \varepsilon_1, \quad \mathrm{x}_2 = \sigma_2 \varepsilon_2 \] Then, since \(\mathrm{x}_1\) and \(\mathrm{x}_2\) are independent, their sum satisfies:
\[ \begin{split} \mathrm{x}_1 + \mathrm{x}_2 &\sim \mathcal{N}(0, (\sigma_1^2 + \sigma_2^2)\mathbf{I}_d) \\ \mathrm{x}_1 + \mathrm{x}_2 &= \sqrt{\sigma_1^2 + \sigma_2^2},\varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \mathbf{I}_d) \end{split} \tag{4}\]
1.1.1 Linear Gaussian
A linear Gaussian model specifies the conditional distribution of \(\mathbf{y}\) given \(\mathbf{x}\) as: \[ q(\mathbf{y}\mid \mathbf{x}) = \mathcal{N}\big(\mathbf{A}\mathbf{x} + \mathbf{b}, \ \boldsymbol{\Sigma}\big) \tag{5}\]
where the mean of the \(\mathbf{y}\) is depend on the \(\mathbf{x}\). One simple case is: \[ q(\mathbf{y}\mid \mathbf{x}) =\mathcal{N} \big(\alpha \mathbf{x},\beta\mathbf{I}_{d}\big) \]
An important point to note is that when \(\beta\) is large, the posterior distribution \(q(\mathbf{x}\mid \mathbf{y})\) deviates significantly from being Gaussian. However, in the regime where \(\beta \ll 1\), the posterior can be well approximated by a Gaussian.This is the one important property we need to understand when are implementing the DDPM, where inference relies on approximating posterior distributions during the reverse diffusion process.
Code Generated above graph: GitHub
1.2 KL-Divergence & Fisher Divergence
The Kullback–Leibler (KL) divergence is a measure of how one probability distribution \(Q\) diverges from a reference distribution \(P\). It is defined as: \[ D_{\text{KL}}(Q \| P) = \int Q(z) \log \frac{Q(z)}{P(z)} dz = \mathbb{E}_{Q}\left[ \log \frac{Q}{P} \right] \tag{6}\]
Key properties:
- \(D_{\text{KL}} \geq 0,\) with equality if and only if \(Q = P\) almost everywhere.
- It is asymmetric: \(D_{\text{KL}}(Q \| P) \neq D_{\text{KL}}(P \| Q)\).
The Fisher divergence provides another way to measure discrepancy between two distributions \(Q\) and m\(P\), focusing on their score functions (the gradients of log densities, we will introduce score function later.). It is defined as: \[ D_{F}(Q \| P) = \frac{1}{2} \mathbb{E}_{z \sim Q} \Big[ \big| \nabla_z \log Q(z) - \nabla_z \log P(z) \big|^2 \Big] \tag{7}\]
For example, for the gaussian distribution Equation 1, the KL-Divergence is:
\[ D_{\text{KL}}(Q \| P) = \frac{1}{2} \Big( \mathrm{tr}(\Sigma_p^{-1}\Sigma_q)(\mu_p - \mu_q)^{\top}\Sigma_p^{-1}(\mu_p - \mu_q)d + \ln \frac{\det \Sigma_p}{\det \Sigma_q} \Big) \tag{8}\]
1.3 ELBO
In probabilistic modeling and variational inference, we often want to compute the marginal likelihood of observed data \(x\):
\[ p(x) = \int p(x, z),dz = \int p(x \mid z),p(z),dz \tag{9}\]
where:
- \(z\): is the latent variable.
- \(p(z)\): is the prior distribution of the latent variable (we often assume Gaussian for the continuous variable).
- \(p(x | z)\): is the likelihood of the data point \(x\).
However, directly computing \(p(x)\) is usually intractable.
Direct computation of \(p(x)\) is generally intractable, since the integral is both high-dimensional \(z \in \mathbb{R}^{d}\) and involves nonlinear functions (e.g., neural networks in generative models).
To address this, we will introduce an tractable approximate distribution(also known as variational distribution) \(Q_{\phi}(z |x)\) to approximate the true posterior \(P(z |x)\). Now, let’s re-write the log-likelihood, and insert \(Q_{\phi}(z | x)\) in the equation:
\[ \begin{split} \log_{\theta}P(\mathrm{x}) &= \log \int P_{\theta}(\mathrm{x} | \mathrm{z}) \, d\mathrm{z} \\ &= \log \int P_{\theta}(\mathrm{x, z}) \frac{Q_{\phi}(\mathrm{z} | \mathrm{x})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \, dx \\ &= \log \mathbb{E}_{\mathrm{z} \sim Q_{\phi}} \left[ \frac{P_{\theta}(\mathrm{x, z})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \right] \\ &\geq \boxed{\mathbb{E}_{\mathrm{z} \sim Q_{\phi}} \left[ \log \frac{P_{\theta}(\mathrm{x, z})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \right] } \\ &= \mathbb{E}_{\mathrm{z} \sim Q_{\phi}} \left[ \log\frac{P_{\theta}(\mathrm{x} | \mathrm{z}) P(\mathrm{z})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \right] \\ & = \mathbb{E}_{\mathrm{z} \sim Q_{\phi}}[\log P_{\theta}(\mathrm{x} | \mathrm{z})] - D_{KL}[Q_{\phi}(\mathrm{z} | \mathrm{x}) \| P(\mathrm{z})] \end{split} \tag{10}\]
The inequality follows from Jensen’s inequality (\(\log \mathbb{E}[f] \geq \mathbb{E}[\log f]\), since \(\log\) is concave).
The boxed expectation is the Evidence Lower Bound (ELBO):
\[ \text{ELBO} = \underbrace{ \mathbb{E}_{\mathrm{z} \sim Q_{\phi}}[\log P_{\theta}(\mathrm{x} | \mathrm{z})] }_{ \text{Reconstruction term} }- \underbrace{ D_{KL}[Q_{\phi}(\mathrm{z} | \mathrm{x}) \| P(\mathrm{z})] }_{ \text{Regularization term} } \tag{11}\]
- The first term encourages the model to reconstruct the data well.
- The second term regularizes the approximate posterior q_{}(z x) to stay close to the prior p(z).
Maximizing the ELBO therefore makes \(Q_{\phi}(z \mid x)\) approximate the true posterior, while also maximizing the likelihood of the observed data.
Now, let’s derive the ELBO from the another perspective, let’s measure how different \(Q_{\phi}(\mathrm{z} | \mathrm{x})\) and \(P(\mathrm{z}|\mathrm{x})\) through KL-divergence:
\[ \begin{align} D_{KL}[Q_{\phi}(\mathrm{z} | \mathrm{x}) \| P(\mathrm{z} | \mathrm{x})] & = \mathbb{E}_{\mathrm{z} \sim Q_{\phi}(\mathrm{z} | \mathrm{x})} \left[ \log \frac{Q_{\phi}(\mathrm{z} | \mathrm{x})}{P(\mathrm{z} | \mathrm{x})} \right] \\ & = \int Q_{\phi}(\mathrm{z} | \mathrm{x}) \log \frac{Q_{\phi}(\mathrm{z} | \mathrm{x})}{P(\mathrm{z} | \mathrm{x})} \, d\mathrm{z} \\ & = - \int Q_{\phi}(\mathrm{z} | \mathrm{x}) \log \frac{P(\mathrm{z} | \mathrm{x})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \, d\mathrm{z} \\ & = - \int Q_{\phi}(\mathrm{z} | \mathrm{x}) \log \frac{P(\mathrm{z} | \mathrm{x}) P_{\theta}(\mathrm{x})}{Q_{\phi}(\mathrm{z} | \mathrm{x}) P_{\theta}(\mathrm{x})} \, d\mathrm{z} \\ & = - \int Q_{\phi}(\mathrm{z} | \mathrm{x}) \log \frac{P_{\theta}(\mathrm{x, z})}{Q_{\phi}(\mathrm{z} | \mathrm{x}) P_{\theta}(\mathrm{x})} \, d\mathrm{z} \\ & = - \int Q_{\phi}(\mathrm{z} | \mathrm{x}) \log \frac{P_{\theta}(\mathrm{x, z})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \, d\mathrm{z} + \int Q_{\phi}(\mathrm{z} | \mathrm{x}) \log P_{\theta}(\mathrm{x}) \, d\mathrm{z} \\ & = - \boxed{\mathbb{E}_{\mathrm{z} \sim Q_{\phi}(\mathrm{z} | \mathrm{x})}\left[\log \frac{P_{\theta}(\mathrm{x, z})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \right]} + \log P_{\theta}(\mathrm{x}) \end{align} \tag{12}\]
That lead to: \[ \log P_{\theta}(\mathrm{x}) = \underbrace{ \boxed{\mathbb{E}_{\mathrm{z} \sim Q_{\phi}(\mathrm{z} | \mathrm{x})}\left[\log \frac{P_{\theta}(\mathrm{x, z})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \right] } }_{ ELBO }+ D_{KL}[Q_{\phi}(\mathrm{z} | \mathrm{x}) \| P(\mathrm{z} | \mathrm{x})] \tag{13}\]
The KN-Divergence is greater than 0, so, the log-likelihood is greater or equal ELBO. When the variational distribution \(Q_{\phi}(\mathrm{z} | \mathrm{x})\) is same as the true distribution \(P(\mathrm{z} | \mathrm{x})\), the ELBO is equal to the log-likelihood.
So, in summary, the ELBO is, which is defined as: \[ EBLO = \mathbb{E}_{\mathrm{z} \sim Q_{\phi}(\mathrm{z} | \mathrm{x})}\left[\log \frac{P_{\theta}(\mathrm{x, z})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \right] = \mathbb{E}_{\mathrm{z} \sim Q_{\phi}}[\log P_{\theta}(\mathrm{x} | \mathrm{z})] - D_{KL}[Q_{\phi}(\mathrm{z} | \mathrm{x}) \| P(\mathrm{z})] \tag{14}\]
One of the most well-known applications of the ELBO in deep learning is the Variational AutoEncoder (Kingma and Welling 2022). The VAE is a generative model that combines probabilistic latent variable modeling with neural networks. It introdce an encoder network to parameterize the variational distribution \(q_{\phi}(z \mid x)\) and a decoder network to model the likelihood \(p_{\theta}(x \mid z)\). Training the VAE corresponds to maximizing the ELBO, which balances two objectives: (1) accurately reconstructing the input data from latent codes, and (2) regularizing the latent distribution to remain close to a simple prior (typically Gaussian). This makes VAEs powerful tools for both representation learning and generative modeling. For those who are interested in the implementation of the VAE, and deep dive in to VAE, please to check:
1.4 Score function & Langevin Dynamics
The score function of a probability distribution \(p(x)\) is defined as the gradient of its log-density with respect to the variable \(x\): \[ s(x) = \nabla_x \log p(x) \tag{15}\]
the score function points toward regions of higher probability mass. In high-dimensional spaces, where the explicit density \(p(x)\) may be intractable to compute, the score function provides a powerful alternative representation: instead of knowing the density itself, we only need to know the direction in which probability increases.
Langevin dynamics originates from statistical physics and describes the motion of particles subject to both deterministic forces and random noise. In the context of sampling from a distribution p(x), Langevin dynamics provides a stochastic iterative update rule: \[ x_{t+1} = x_t + \frac{\eta}{2} \nabla_x \log p(x_t) + \sqrt{\eta}\varepsilon_t \quad \varepsilon_t \sim \mathcal{N}(0, I) \tag{16}\]
Here:
- \(\eta > 0\) is the step size
- the gradient term drives samples toward high-probability regions,
- the noise term ensures proper exploration of the space.
This stochastic process converges to the target distribution \(p(x)\) under suitable conditions, making it a foundational method for Markov Chain Monte Carlo (MCMC) sampling.
For example, the score of the Gaussian Distribution is: \[ s(x) = \nabla_x \log p(x) = -\Sigma^{-1}(x - \mu) \tag{17}\]
So, we can run the langevin dynamics as following: \[ x_{t+1} = x_t + \frac{\eta_t}{2}\left(-\frac{x_t-\mu}{\sigma^2}\right) + \sqrt{\eta_t}\,\varepsilon_t \tag{18}\]
def langevin_dynamics_update(x, step_size, score):
noise = np.random.randn()
x = x + (step_size / 2.0) * score + np.sqrt(step_size) * noise
return xBelow are two plot showing the Langevin Dynamics on 1-d Gaussian Distribution, 
The Langevin Dynamics is very similary as the algorithm we used to update the parameters in the neural network, which if defined as: \[ x_{t+1} = x_t - \eta \,\nabla f(x_t) \] where \(\eta\) is the learning rate. However, there are several different: - The Graident Descient is Determinsitc while Langevin Dynamics is storchasic becuase of \(\sqrt{ \eta } \varepsilon_t\) - Gradient Descent minimizes an explicit function f(x) while Langevin Dynamics simulates a Markov chain whose stationary distribution is p(x) with constant \(\eta\), it generates samples from that distribution.
2 DDPM
In this section, we will introduce what is the Denoising Diffusion Probabilistic Model (DDPM) (Ho, Jain, and Abbeel 2020). DDPM in one sentence is that: a generative model that creates realistic data by learning to reverse a step-by-step noising process, gradually denoising random noise into meaningful samples. The central idea of DDPM is take each training image and to corrupt it using a multi-step noise process to transform it into a sample from a Gaussian distribution. Than, a neural network, is trained to invert this process. Once the network is trained, it can than generate new images starting with samples from Gaussian.
We will derive three predictor:
- Image predictor \(\hat{\mathrm{x}}_{\theta}\)
- Mean Predictor \(\hat{\mu}_{\theta}\)
- Noise Predictor \(\hat{\epsilon}_{\theta}\)
It is ok if you don’t understand right now.

2.1 Forward and Backward Diffusion Process
For the diffusion process, we gradually add standard normal distribution, until it become the pure gaussian, mathematically, it can be express as:
\[ p(\mathrm{x}_{t} | \mathrm{x}_{t- 1}) = \mathcal{N}(\mathrm{x}_{t}; \sqrt{ 1 - \beta_{t} }\mathrm{x}_{t -1}, \beta_{t} \mathbf{I}_{d}) \]
where \(\{ \beta_{t} \in (0, 1) \}_{t = 1}^{T}\) and \(\beta_{1} \leq \beta_{2} \leq \dots \leq \beta_{T}\). The \(\mathrm{x}_{t}\) can be expressed as: \[ \mathrm{x}_{t} = \sqrt{ 1 - \beta_{t} }\mathrm{x}_{t -1} +\sqrt{ \beta_{t} } \epsilon_{t} \]
There are many different choice of \(\beta\):
- Learned
- Constant
- Linearly or quadratically increased
- Follows a cosine function
One thing to notice is that \(\beta_{t} \ll 1\) to make sure that \(p_{\theta}(\mathrm{x}_{t-1} | \mathrm{x}_{t})\) can be approximated to the Gaussian Distribution. Look at the expression of the \(\mathrm{x}_{t}\), we can see that it depends on the \(\mathrm{x}_{t-1}\), while \(\mathrm{x}_{t -1 }\) depends on the \(\mathrm{x}_{t - 2}\), and so on, so, the \(\mathrm{x}_{t}\) can also be expressed as:
\[ \mathrm{x}_{t} = \sqrt{ \alpha_{t} }\mathrm{x}_{0}+ \sqrt{ 1-\alpha_{t} } \epsilon_{t} \quad \text{where}\ \alpha_{t} = \prod_{\tau=1}^{t}(1 - \beta_{\tau}) \]
This is called the forward process,
And the whole forward process format a Markov Chain:
\[ p(\mathrm{x}_{0}, \mathrm{x}_{1: T}) = p(\mathrm{x}_{0}) \prod_{t = 1}^{T}p(\mathrm{x}_{t} | \mathrm{x}_{t -1 }) \]
Let’s quick summary the forward process, and get familiar with following equations:
\[ \begin{split} \mathrm{x}_{t} & = \sqrt{ 1 - \beta_{t} }\mathrm{x}_{t -1} +\sqrt{ \beta_{t} } \epsilon_{t} \\ \mathrm{x}_{t} & = \sqrt{ \alpha_{t} }\mathrm{x}_{0}+ \sqrt{ 1-\alpha_{t} } \epsilon_{t} \\ \mathrm{x}_{0} & = \frac{\mathrm{x}_{t} - \sqrt{ 1-\alpha_{t} } \epsilon_{t}}{\sqrt{ \alpha_{t} }} \\ \epsilon_t & = \frac{\mathrm{x}_t - \sqrt{\alpha_t} \mathrm{x}_0 }{\sqrt{1 - \alpha_t}} \end{split} \]
From the last three equation, we can conclude that, as long as we know two of three \(\mathrm{x}_{0}, \mathrm{x}_{t}, \mathrm{\epsilon}_{t}\), we can get other three, this is very useful when we are training the DDPM.
Backward Process, backward process is from \(\mathrm{x}_{t}\) to \(\mathrm{x}_{t -1}\), it can be expression as: \[ p(\mathrm{x}_{t - 1} | \mathrm{ x}_{t}) = \int p(\mathrm{x}_{t - 1} | \mathrm{x}_{t}, \mathrm{x}_{0})p(\mathrm{x}_{0} | \mathrm{x}_{t}) d\mathrm{x}_{0} \]
This is intractable if we marginal over \(\mathrm{x}_{0}\). However, if we also conditioned on the \(\mathrm{x}_{0}\), we will get: \[ p(\mathrm{x}_{t - 1} | \mathrm{x}_{t}, \mathrm{x}_{0}) = \frac{q(\mathrm{x}_{t} | \mathrm{x}_{t - 1}, \mathrm{x}_{0}) q(\mathrm{x}_{t - 1} | \mathrm{x}_{0})}{q(\mathrm{x}_{t} | \mathrm{x}_{0})} \]
We now that the Markov property of the forward process, we have: \[ q(\mathrm{x}_{t} | \mathrm{x}_{t - 1}, \mathrm{x}_{0}) = q(\mathrm{x}_{t} | \mathrm{x}_{t - 1}) \] which we know, the exact equation is, and we also know what \(q(\mathrm{x}_{t} | \mathrm{x}_{0})\), so, we know extact what the \(p(\mathrm{x}_{t - 1} | \mathrm{x}_{t}, \mathrm{x}_{0})\) is: \[ \begin{split} p(\mathrm{x}_{t - 1} | \mathrm{x}_{t}, \mathrm{x}_{0}) & = \mathcal{N}(\mathrm{x}_{t -1} | \mu_{t}(\mathrm{x}_{0}, \mathrm{x}_{t}), \sigma_{t}^{2}\mathbf{I}_{d}) \\ \\ \quad \text{where}\ \mu_{t}(\mathrm{x}_{0}, \mathrm{x}_{t}) & = \frac{(1 - \alpha_{t - 1})\sqrt{ 1- \beta _{t}}\mathrm{x}_{t} + \sqrt{ \alpha_{t - 1} }\beta_{t}\mathrm{x}_{0}}{1 -\alpha_{t}} \\ \sigma_{t}^{2} & = \frac{\beta_{t}( 1- \alpha_{t - 1})}{1 -\alpha_{t}} \end{split} \]
However, when we are generating the example, we don’t know what \(\mathrm{x}_{0}\) is. That why we need train a neural network to approximate it.
Next, derive the loss function we needed to train the neural network.
2.2 Loss Function
Let’s first derive the loss function of the DDPM. DDPM can be view as the hierarchical VAE. So, we can derive the loss function using the ELBO Equation 14. Recall, the log-likelihood with ELBO is defined as following:
\[ \begin{align} \log P_{\theta}(\mathrm{x}) & = ELBO + D_{KL}[Q_{\phi}(\mathrm{z} | \mathrm{x}) \| P(\mathrm{z} | \mathrm{x})] \\ & \geq \mathbb{E}_{\mathrm{z} \sim Q_{\phi}(\mathrm{z} | \mathrm{x})}\left[\log \frac{P_{\theta}(\mathrm{x, z})}{Q_{\phi}(\mathrm{z} | \mathrm{x})} \right] \\ & = \mathbb{E}_{\mathrm{x}_{1:T} \sim Q_{\phi}(\mathrm{x}_{1:T} | \mathrm{x}_{0})}\left[\log \frac{P_{\theta}(\mathrm{x}_{0}, \mathrm{x}_{1:T})}{Q_{\phi}(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \right] \end{align} \] where \(\mathrm{x}_{1:T}\) is the latent variable.
One thing good about DDPM is that, we know what is the posteriror distribution \(Q_{\phi}(\mathrm{x}_{1:T} | \mathrm{x}_{0})\) exactly: \[ Q_{\phi}(\mathrm{x}_{1:T} | \mathrm{x}_{0}) = \prod_{t=1}^{T}P(\mathrm{x}_{t} | \mathrm{x}_{t - 1}) \]
So, the ELBO become: \[ \begin{align} EBLO & = \mathbb{E}_{\mathrm{x}_{1:T} \sim Q_{\phi}(\mathrm{x}_{1:T} | \mathrm{x}_{0})}\left[\log \frac{P_{\theta}(\mathrm{x}_{0}, \mathrm{x}_{1:T})}{Q_{\phi}(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \right] \\ & = \mathbb{E}_{\mathrm{x}_{1:T} \sim Q(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \left[ \log \frac{P_{\theta}(\mathrm{x}_{0} | \mathrm{x_{1}}) \prod_{t = 2}^{T}P_{\theta}(\mathrm{x_{t - 1} | \mathrm{x}_{t}})P(\mathrm{x}_{T})} {Q(\mathrm{x}_{1} | \mathrm{x}_{0})\prod_{t=2}^{T}Q(\mathrm{x}_{t} | \mathrm{x}_{t - 1})} \right] \\ & = \mathbb{E}_{ Q(\mathrm{x}_{T} | \mathrm{x}_{0})} [\log P(\mathrm{x}_{T})] + \mathbb{E}_{ Q(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \left[ \log \frac{\prod_{t=2}^{T}P_{\theta}(\mathrm{x_{t - 1} | \mathrm{x}_{t}})}{\prod_{t=2}^{T}Q(\mathrm{x}_{t} | \mathrm{x}_{t - 1})} \right] + \mathbb{E}_{ Q(\mathrm{x}_{1} | \mathrm{x}_{0})} \left[ \log \frac{P_{\theta}(\mathrm{x}_{0} | \mathrm{x}_{1})}{Q(\mathrm{x}_{1} | \mathrm{x}_{0})} \right] \\ & = \mathbb{E}_{ Q(\mathrm{x}_{T} | \mathrm{x}_{0})} [\log P(\mathrm{x}_{T})] + \log \sum_{t=2}^{T}\mathbb{E}_{ Q(\mathrm{x}_{t-1}, \mathrm{x_{t}} | \mathrm{x}_{0})}\left[ \log \frac{P_{\theta}(\mathrm{x}_{t-1} | \mathrm{x}_{t})}{Q(\mathrm{x}_{t} | \mathrm{x}_{t-1})} \right] + \mathbb{E}_{ Q(\mathrm{x}_{1} | \mathrm{x}_{0})} \left[ \log \frac{P_{\theta}(\mathrm{x}_{0} | \mathrm{x}_{1})}{Q(\mathrm{x}_{1} | \mathrm{x}_{0})} \right] \end{align} \]
As we can see, to calculate the second term, we need to sample from two random distribution, to get \(\mathrm{x_{t}}, \mathrm{x_{t-1}}\). This will create very noisy estimate with high variance. So, we need to re-write the ELBO, to make it better low variance, using the Bayesian Rule, we get:
\[ \begin{align} ELBO &= \mathbb{E}_{\mathrm{x}_{1:T} \sim Q(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \left[ \log \left( P(\mathrm{x}_{T})\frac{\prod_{t = 2}^{T}P_{\theta}(\mathrm{x_{t - 1} | \mathrm{x}_{t}})} {\prod_{t=2}^{T}Q(\mathrm{x}_{t} | \mathrm{x}_{t - 1})} \frac{P_{\theta}(\mathrm{x}_{0} | \mathrm{x_{1}})}{Q(\mathrm{x}_{1} | \mathrm{x}_{0})} \right) \right] \\ &= \mathbb{E}_{\mathrm{x}_{1:T} \sim Q(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \left[ \log \left( P(\mathrm{x}_{T})\frac{\prod_{t = 2}^{T}P_{\theta}(\mathrm{x_{t - 1} | \mathrm{x}_{t}})} {\prod_{t=2}^{T}Q(\textcolor{green}{\mathrm{x}_{t} | \mathrm{x}_{t - 1}, \mathrm{x}_{0}})} \frac{P_{\theta}(\mathrm{x}_{0} | \mathrm{x_{1}})}{Q(\mathrm{x}_{1} | \mathrm{x}_{0})} \right) \right] \\ &= \mathbb{E}_{\mathrm{x}_{1:T} \sim Q(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \left[ \log \left( P(\mathrm{x}_{T})\frac{\prod_{t = 2}^{T}P_{\theta}(\mathrm{x_{t - 1} | \mathrm{x}_{t}})} {\prod_{t=2}^{T}Q(\textcolor{green}{\mathrm{x}_{t-1} | \mathrm{x}_{t}, \mathrm{x}_{0}})} \frac{Q(\mathrm{x}_{t-1} | \mathrm{x}_{0})}{Q(\mathrm{x}_{t}|\mathrm{x}_{0})}\frac{P_{\theta}(\mathrm{x}_{0} | \mathrm{x_{1}})}{Q(\mathrm{x}_{1} | \mathrm{x}_{0})} \right) \right] \\ &= \mathbb{E}_{\mathrm{x}_{1:T} \sim Q(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \left[ \log \left( P(\mathrm{x}_{T})\frac{\prod_{t = 2}^{T}P_{\theta}(\mathrm{x_{t - 1} | \mathrm{x}_{t}})} {\prod_{t=2}^{T}Q(\textcolor{green}{\mathrm{x}_{t-1} | \mathrm{x}_{t}, \mathrm{x}_{0}})} \frac{Q(\mathrm{x}_{1} | \mathrm{x}_{0})}{Q(\mathrm{x}_{T}|\mathrm{x}_{0})}\frac{P_{\theta}(\mathrm{x}_{0} | \mathrm{x_{1}})}{Q(\mathrm{x}_{1} | \mathrm{x}_{0})} \right) \right] \\ &= \mathbb{E}_{\mathrm{x}_{1:T} \sim Q(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \left[ \log \left( \frac{P(\mathrm{x}_{T})}{Q(\mathrm{x}_{T}|\mathrm{x}_{0})} \frac{\prod_{t = 2}^{T}P_{\theta}(\mathrm{x_{t - 1} | \mathrm{x}_{t}})} {\prod_{t=2}^{T}Q(\textcolor{green}{\mathrm{x}_{t-1} | \mathrm{x}_{t}, \mathrm{x}_{0}})}P_{\theta}(\mathrm{x}_{0} | \mathrm{x_{1}}) \right) \right] \\ &= \mathbb{E}_{\mathrm{x}_{1:T} \sim Q(\mathrm{x}_{1:T} | \mathrm{x}_{0})} \left[ \log \left( \frac{P(\mathrm{x}_{T})}{Q(\mathrm{x}_{T}|\mathrm{x}_{0})} \frac{\prod_{t = 2}^{T}P_{\theta}(\mathrm{x_{t - 1} | \mathrm{x}_{t}})} {\prod_{t=2}^{T}Q(\textcolor{green}{\mathrm{x}_{t-1} | \mathrm{x}_{t}, \mathrm{x}_{0}})}P_{\theta}(\mathrm{x}_{0} | \mathrm{x_{1}}) \right) \right] \\ & = \mathbb{E}_{\mathrm{x}_{T} \sim Q(\mathrm{x}_{T} | \mathrm{x}_{0})} \left[\log \frac{P(\mathrm{x}_{T})}{Q(\mathrm{x}_{T}|\mathrm{x}_{0})} \right] + \sum_{t=2}^{T} \mathbb{E}_{\mathrm{x}_{t} \sim Q(\mathrm{x}_{t} |\mathrm{x}_{0} )} \left[\log \frac{P_{\theta}(\mathrm{x_{t - 1} | \mathrm{x}_{t}})} {Q(\textcolor{green}{\mathrm{x}_{t-1} | \mathrm{x}_{t}, \mathrm{x}_{0}})} \right] + \mathbb{E}_{\mathrm{x}_{1} \sim Q(\mathrm{x}_{1} | \mathrm{x}_{0})}[P_{\theta}(\mathrm{x}_{0} | \mathrm{x_{1}}) ] \\ & = -D_{KL}[Q(\mathrm{x}_{T} | \mathrm{x}_{0}) \| P(\mathrm{x}_{T})] \\ & \quad - \sum_{t=2}^{T}\mathbb{E}_{\mathrm{x}_{t} \sim Q(\mathrm{x}_{t} |\mathrm{x}_{0} )} [D_{KL}[Q(\mathrm{x}_{t-1} |\mathrm{x}_{t} \mathrm{x}_{0}) \| P_{\theta}(\mathrm{x}_{t-1} | \mathrm{x}_{t})] ]\\& \quad + \mathbb{E}_{\mathrm{x}_{1} \sim Q(\mathrm{x}_{1} | \mathrm{x}_{0})}[P_{\theta}(\mathrm{x}_{0} | \mathrm{x_{1}}) ] \end{align} \]
The first term is the prior matching term, which is the constant, no need to optimize. The third term is the reconstruction term, which is the negilibale, because the variance schedule make it almost constant and its learning signal is weak compared to the denoising terms. Now, let’s check the most complex and horriable term, the second term is the consistent term, which the KL Divergence Equation 6 between two gaussian distribution, which has close form: \[ \mathbb{E}_{\mathrm{x}_{t} \sim Q(\mathrm{x}_{t} |\mathrm{x}_{0} )} [D_{KL}[Q(\mathrm{x}_{t-1} |\mathrm{x}_{t} \mathrm{x}_{0}) \| P_{\theta}(\mathrm{x}_{t-1} | \mathrm{x}_{t})] ] = \mathbb{E}_{\mathrm{x}_{t} \sim Q(\mathrm{x}_{t} | \mathrm{x}_{0})} \left [\frac{1}{2\tilde{\sigma}_{t}^{2}}\| \mu_{\theta}(\mathrm{x}_{t}, t) - \tilde{\mu}(\mathrm{x}_{t} | \mathrm{x}_{0})\|^{2} \right] \]
Since we know that the \(\tilde{\mu}(\mathrm{x}_{t} | \mathrm{x}_{0})\) exact it, we can optimize this by:
- Sample \(\mathrm{x}_{0}\), \(t\)
- Sample \(\mathrm{x}_{t}\) from \(\mathcal{N}(\tilde{\mu}(\mathrm{x}_{t} | \mathrm{x}_{0}), \beta_{t} \varepsilon_{t})\)
- Pass the \(\mathrm{x}_{t}\) and \(t\) to the neural network \(\mu_{\theta}\)
- Calculate the mean square loss and update the parameters \(\theta\)
- Repeat
Let’s see how can we get \(\mathrm{x}_{0}\) deriectly from the neural network, let rewrite the ELBO:
\[ \begin{align} \mathbb{E}_{\mathrm{x}_{t} \sim Q(\mathrm{x}_{t} |\mathrm{x}_{0} )} [D_{KL}[Q(\mathrm{x}_{t-1} |\mathrm{x}_{t} \mathrm{x}_{0}) \| P_{\theta}(\mathrm{x}_{t-1} | \mathrm{x}_{t})] ] &= \mathbb{E}_{\mathrm{x}_{t} \sim Q(\mathrm{x}_{t} | \mathrm{x}_{0})} \left [\frac{1}{2\tilde{\sigma}_{t}^{2}}\| \mu_{\theta}(\mathrm{x}_{t}, t) - \tilde{\mu}(\mathrm{x}_{t}, \mathrm{x}_{0})\|^{2} \right] \\ & = \frac{1}{2 \tilde{\sigma}_t^2} \cdot \frac{\bar{\alpha}_{t-1} \beta_t^2}{(1-\bar{\alpha}_t)^2} \mathbb{E}_{\mathrm{x}_{t}\sim Q(\mathrm{x}_t|\mathrm{x}_0)} \left[ \| \hat{x}_\theta(x_t, t) - x_0 \|^2 \right] \\ & =\omega_t \mathbb{E}_{\mathrm{x}_{t}\sim Q(\mathrm{x}_t|\mathrm{x}_0)} \left[ \| \hat{\mathrm{x}}_\theta(\mathrm{x}_t, t) - \mathrm{x}_0 \|^2 \right] \end{align} \]
As we can see, the \(\mathrm{x}_{\theta}\) can be write as the \(\mu_{\theta}\) to some constant.
Finally, let’s get our noise predictor \(\varepsilon_t\)-predictor: \[ \begin{align} \mathbb{E}_{\mathrm{x}_{t} \sim Q(\mathrm{x}_{t} |\mathrm{x}_{0} )} [D_{KL}[Q(\mathrm{x}_{t-1} |\mathrm{x}_{t} \mathrm{x}_{0}) \| P_{\theta}(\mathrm{x}_{t-1} | \mathrm{x}_{t})] ] &= \mathbb{E}_{\mathrm{x}_{t} \sim Q(\mathrm{x}_{t} | \mathrm{x}_{0})} \left [\frac{1}{2\tilde{\sigma}_{t}^{2}}\| \mu_{\theta}(\mathrm{x}_{t}, t) - \tilde{\mu}(\mathrm{x}_{t}, \mathrm{x}_{0})\|^{2} \right] \\ & = \frac{1}{2 \tilde{\sigma}_t^2} \cdot \frac{(1-\bar{\alpha}_t)^2}{\bar{\alpha}_t (1-\bar{\alpha}_t)} \mathbb{E}_{\mathrm{x}_{t} \sim Q(\mathrm{x}_t|\mathrm{x}_0)} \left[ \| \hat{\varepsilon}_\theta(x_t, t) - \varepsilon_t \|^2 \right] \\ & =\omega_{t}' \mathbb{E}_{\mathrm{x}_{t}\sim Q(\mathrm{x}_t|\mathrm{x}_0)} \left[ \| \hat{\varepsilon}_\theta(x_t, t) - \varepsilon_t \|^2 \right] \end{align} \]
Summary, from the DDPM, we have derive 3 different predictor:
- Mean Predictor
- \(x_{0}\) Predictor
- Noise Predictor


In practice, we can simply drop the weight term in training: and use noise predictor
In this blog, we will first introduce what is the diffusion models, than we will introduce how to implement the DDPM from scratch using PyTorch. After that, we will explore the flow matching and score matching model through the ODE/SDE. By the end of the blog, I believe you will gain a comprehensive understanding of the diffusion model, and SOTA generative models.
2.3 Sampling from DDPM
Diffusion Model


Forward Diffusion Process: \[ q(\mathrm{x}_{t} | \mathrm{x}_{t - 1}) =\mathcal{N}( \mathrm{x}_{t}; \sqrt{ 1 - \beta_{t} }\mathrm{x}_{t}, \beta_{t}\mathbf{I} ) \]
\[ \mathrm{x}_{t} = \sqrt{ 1 - \beta_{t} }\mathrm{x}_{t} + \beta_{t}\epsilon_{t}, \quad \text{where} \ \epsilon_{t} \sim \mathcal{N}(0, \mathbf{I}_{}) \]
\[ \mathrm{x}_{t} = \sqrt{ \bar{\alpha}_{t} }\mathrm{x_{0}} + \sqrt{ 1 - \bar{\alpha}_{t} }\epsilon \]
Langevin dynamics: $$ t = {t-1} + {} p({t-1}) + ,_t, _t (0, )
$$
Backward Diffusion Process: $$ \[\begin{align} & p_{\theta}(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_{\theta}(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \\ & p_{\theta}(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1}; \mu_{\theta}(\mathbf{x}_t, t), \Sigma_{\theta}(\mathbf{x}_t, t)\right) \end{align}\] $$
The above content is intractable, one thing to notice that is is tractable when we conditioned on the \(\mathrm{x}_{0}\) \[ q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_{t-1}; \textcolor{blue}{\tilde{\mu}(\mathbf{x}_t, \mathbf{x}_0)}, \, \textcolor{red}{\tilde{\beta}_t \mathbf{I}}\right) \] where : \[ \begin{align} \tilde{\mu}_t & = \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\alpha_t}} \left(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\,\epsilon_t\right) \\ & = \textcolor{cyan}{\frac{1}{\sqrt{\alpha_t}} \left(\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \,\epsilon_t\right)} \end{align} \]
So, the loss function become: $$
\[\begin{align} \mathcal{L}_t^{\text{simple}} & = \mathbb{E}_{t \sim [1,T], \mathbf{x}_0, \epsilon_t} \left[ \left\| \epsilon_t - \epsilon_\theta(\mathbf{x}_t, t) \right\|^2 \right] \\ & = \mathbb{E}_{t \sim [1,T], \mathbf{x}_0, \epsilon_t} \left[ \left\| \epsilon_t - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\epsilon_t,\, t \right) \right\|^2 \right] \end{align}\] $$
and the loss is: \[ \mathcal{L} = \mathcal{L}_{t} + C \] where \(C\) is some constant not depend on \(\theta\)
2.4 Time Embedding
def get_timestep_embedding(timesteps, embedding_dim):
"""
This matches the implementation in Denoising Diffusion Probabilistic Models:
From Fairseq.
Build sinusoidal embeddings.
This matches the implementation in tensor2tensor, but differs slightly
from the description in Section 3.5 of "Attention Is All You Need".
"""
assert len(timesteps.shape) == 1
half_dim = embedding_dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
emb = timesteps.float()[:, None] * emb[None, :]
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
if embedding_dim % 2 == 1: # zero pad
emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
return emb2.5 Sampling
After training a noise denoiser, we can sample from the \(p_{\text{init}}\), and convert it to the \(p_{\text{data}}\)
This is relatively simple,
3 Score Matching
The in previous section, we have derive the DDPM, and how to train and sample the data point. In this section, let’s see how the DDPM related to score matching (Song et al. 2021). Recalled that the score is the gradient of the loglikelihood function with respect to a data point, so the score of the \(q(\mathrm{x}_{t} | \mathrm{x_{0}})\) is:
\[ \nabla_{x_t} \log q(x_t \mid x_0) = \nabla_x \left( - \frac{\|x_t - \sqrt{\bar{\alpha}_t} x_0\|^2}{2(1 - \bar{\alpha}_t)} \right) = - \frac{x_t - \sqrt{\bar{\alpha}_t} x_0}{1 - \bar{\alpha}_t} \] We know that \(\varepsilon_t= \frac{1}{\sqrt{1 - \bar{\alpha}_t}}\left( x_t - \sqrt{\bar{\alpha}_t}\, x_0 \right)\), plug into the equation we get:
\[ \nabla_{x_t} \log q(x_t \mid x_0) = -\frac{\varepsilon_t}{\sqrt{ 1 - \bar{\alpha}_t }} \]
As we can see, once we have trained noise predictor \(\hat{\varepsilon}_{t}\), we get the score of \(q(\mathrm{x}_{t} | \mathrm{x_{0}})\) up to some scaling factor \(- \frac{1}{\sqrt{1 - \bar{\alpha}_t}}\) for free!!
So, this is good, but how do we know the \(\mathrm{x}_{0}\) if we want to sampling from it? What we
\[ s_\theta(x_t, t) \approx \nabla_{x_t} \log q(x_t) \]
\[ \mathcal{L}(\theta) = \mathbb{E}{t, x_0, \varepsilon} \Big[ \lambda(t) , \big| s\theta(x_t, t) + \tfrac{1}{\sqrt{1-\bar{\alpha}_t}} \varepsilon \big|^2 \Big] \]
\[ x_{t-\Delta t} = x_t + \Delta t , s_\theta(x_t, t) + \sqrt{2\Delta t},\varepsilon. \]
4 Conditioned Generation
So far in the DDPM model, the image generated is un-conditioned. How can we generated content from some condition \(y\) such as some prompts
4.1 Classifier Generation
4.2 Classifier-Free Generation
5 Speed Up Diffusion Models
The
5.1 DDIM
DDIM is determinstic
5.2 Progressive Distillation
As proposed in (ProgressiveDistillationFast2022salimans?) 
5.3 Consistency Models
As proposed in the (Song et al. 2023)

5.4 Latent Diffusion Model
Variance Autoencoder
5.5 Score Matching
\[ \nabla_{x_t} \log q(x_t|x_0) = \nabla_x \left( - \frac{\| x_t - \sqrt{\bar{\alpha}_t} x_0 \|^2}{2(1-\bar{\alpha}_t)} \right) = - \frac{x_t - \sqrt{\bar{\alpha}_t} x_0}{1-\bar{\alpha}_t} \]
$$ _{x_t} q(x_t|x_0) = - = -
$$
So, can be interpreted as predicting the score \(\nabla_{x_t} \log q(x_t|x_0)\) up to a scaling factor \(- \frac{1}{\sqrt{1-\bar{\alpha}_t}}\)
According to the Tweedie’s formula, we have: \[ \nabla_{x_t} \log q(x_t) = - \frac{x_t - \sqrt{\bar{\alpha}_t}\,\mathbb{E}[x_0 \mid x_t]}{1-\bar{\alpha}_t} \]


So, this is the Noise-Conditional Score-Based Models 

So, the solution is the Annealed Langevin Dynamics
At the beginning (when \(\sigma_{t}\) is large), As time progresses (and \(\sigma_{t}\) decreases), 
6 From ODE and SDE view point
Ok, we have learned a lot of the DDPM from the probabilist model view, let’s switch gear, and discuss how to get the DDPM from stochastic process view, and derive flow matching. For this Part, we will main focus on the Flow Matching, which is based on the ODE. The Score matching can understand.
6.1 ODE vs. SDE
Before talk about the ODE and SDE, let’s first understand some concepts to solid our understanding. DDPM can be viewed as the discreted version of the SDE, and SDE can be viewed as continuous version of DDPM.
6.1.1 Vector Field
Vector Field is a function that assign a vector to every point in space. For example: imagine a weather map, at each location, an arrow shows the wind’s direction and strength. That arrow map is a vector field.

\[ F: \mathbb{R}^{n} \to \mathbb{R}^{n} \]
And every ODE \(u\) is defined by a vector field, and take in two variable \(\mathrm{x}\) and \(t\) \[ u: \mathbb{R}^{d} \times [0, 1] \to \mathbb{R}^{d}, \quad (x, t) \to u_{t}(x) \] that for every time \(t\) and location \(\mathrm{x}\), we get a vector \(u_{t}(\mathrm{x}) \in \mathbb{R}^{d}\) that point to some direction. Image a point in the weather map, \(x\) is a point in the map, and \(u(x)\) tell \(x\), which direction should go next.






Because for every location \(\mathrm{x}\), we might arrive same location at different time, due to the random start point \(\mathrm{x}_{0}\)




\[
\begin{align}
\frac{d}{dt}\mathrm{x}_{t } &= u_{t}(\mathrm{x}_{t}) \\
\mathrm{x_{0}}&=x_{0}
\end{align}
\]
So, another question we want to ask it: when we start at \(x_{0}\), where are we at \(t\). This can be solved by flow, which is a solution to the ODE:
$$
\[\begin{align} \psi : \mathbb{R}^d \times [0,1] \mapsto \mathbb{R}^d &, \quad (x_0, t) \mapsto \psi_t(x_0) \\ \frac{d}{dt} \psi_t(x_0) & = u_t(\psi_t(x_0)) \\ \psi_0(x_0)& = x_0\\ \end{align}\] $$
\[ \mathrm{x}_{1} \sim p_{\text{data}} \] However, we can not solve the problem. But we can use the numerical analysis. One of the simplest and intuitive methods is Euler method:
\[ \mathrm{x}_{t + h} = \mathrm{x}_{t} + h u_{t}(\mathrm{x}_{t}) \quad (t = 0, h, 2h, 3h, \dots, 1- h) \]
Stochastic Differential Equations extend the ODEs with stochastic(random) trajectories, which is also known as stochastic process. The stochastic is add through a Brownian motion. A Brownain motion \(W = (W_{t})_{0\leq t \leq 1}\) is a stochastic process such that: \(W_{0} = 0\): - Normal Increments: \(W_{t} - W_{s} \sim \mathcal{N}(0, (t - s)\mathbf{I}_{d})\) for all \(0 \leq s \leq t\) - Independent Increments
Brownian Motion is also known as Wiener Process: \[ W_{t + h} = W_{t} + \sqrt{ h }\epsilon_{t}, \quad \text{where} \ \epsilon_{t} \sim \mathcal{N}(0, \mathbf{I}_{d}) \]
Ornstein-Unlenbeck(OU) process
Euler-Maruyama Method is a numerical method.


6.2 Conditional Vector Field & Marginal Vector Field
Given data point \(\mathrm{z}\), we can construct conditional vector field that: \[ \frac{d}{dt}\mathrm{x}_{t} = u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z}) \] where, by following the ODE \(u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z})\), the \(\mathrm{x}_{t}\) will end in the data point \(\mathrm{z}\). However, what we actually want is the marginal vector field: \[ \begin{split} u_{t}^{\text{target}}(\mathrm{x}_{t} ) &= \int u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z}) p(\mathrm{z} | \mathrm{x}) \, d\mathrm{z} \\ &= \int u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z}) \frac{p_{t}(\mathrm{x}|\mathrm{z})p_{\text{data}}(\mathrm{z})}{p_{t}(\mathrm{x})} \, d\mathrm{z} \\ \end{split} \]
This is statisfy the property we want. We can derive is through continuity equation.
\[ \begin{split} \mathcal{L}_{FM}(\theta) = \mathbb{E}_{t \sim [0,1], \mathrm{z} \sim p_{data}, \mathrm{x_{t}} \sim p_{t}(\mathrm{x}_{t} | \mathrm{z})} \left[\| u_{t}^{\theta}(\mathrm{x}_{t}) - u_{t}^{\text{target}}(\mathrm{x}_{t}) \|^{2} \right] \end{split} \]
One problem of this is that \(u_{t}^{\text{target}}(\mathrm{x}_{t} )\) is intractable due to the marginal over high-dimensional.
Let’s rewrite the \(\mathcal{L}_{FM}\), by using the factor \(\|a- b \|^{2} = \|a\|^{2} - 2a^{T}b + \|b\|^{2}\):
\[ \begin{split} \mathcal{L}_{FM}(\theta) &= \mathbb{E}_{t\sim[0,1],\, z\sim p_{\rm data},\, x_t\sim p_t(x_t|z)} \left[\|u^\theta_t(x_t)-u^{\rm target}_t(x_t)\|^2\right] \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)}\left[\|u^\theta_t(x_t)\|^2\right] -2\,\mathbb{E}_{t \sim \text{Unif},\, x \sim p_t(\cdot|z)}\left[u^\theta_t(x_t)^{T} u^{\rm target}_t(x_t)\right] +\underbrace{ \mathbb{E}_{t \sim \text{Unif},\, x \sim p_t(\cdot|z)}\left[\|u^{\rm target}_t(x_t)\|^2\right] }_{ C_{1} } \end{split} \]
As we can see, the third term is the constant w.r.t to the \(\theta\), let check the second term: \[ \begin{split} \mathbb{E}_{t \sim \text{Unif},\, x \sim p_t} \!\left[u_t^\theta(x)^{T} u_t^{\text{target}}(x)\right] &\overset{(i)}{=} \int_0^1 \!\!\int p_t(x)\, u_t^\theta(x)^{T} u_t^{\text{target}}(x)\, dx\, dt \\ &\overset{(ii)}{=} \int_0^1 \!\!\int p_t(x)\, u_t^\theta(x)^{T} \left[\int u_t^{\text{target}}(x|z)\, \frac{p_t(x|z)\,p_{\text{data}}(z)}{p_t(x)}\, dz \right] dx\, dt \\ &\overset{(iii)}{=} \int_0^1 \!\!\int\!\!\int u_t^\theta(x)^{T} u_t^{\text{target}}(x|z)\, p_t(x|z)\, p_{\text{data}}(z)\, dz\, dx\, dt \\ &\overset{(iv)}{=} \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)} \!\left[u_t^\theta(x)^{T} u_t^{\text{target}}(x|z)\right] \end{split} \]
So, we can get that: \[ \begin{split} \mathcal{L}_{FM} & = \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)}\left[\|u^\theta_t(x_t)\|^2\right] -2\,\mathbb{E}_{t \sim \text{Unif},\, x \sim p_t(\cdot|z)}\left[u^\theta_t(x_t)^{T} u^{\rm target}_t(x_t)\right] + C_{1} \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)}\left[\|u^\theta_t(x_t)\|^2\right] - 2 \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)} \left[u_t^\theta(x)^{T} u_t^{\text{target}}(x|z)\right] + C_{1}\\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)}\left[\|u^\theta_t(x_t)\|^2 - 2u_t^\theta(x)^{T} u_t^{\text{target}}(x|z)\right] + C_{1} \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)}\left[\|u^\theta_t(x_t)\|^2 - 2u_t^\theta(x)^{T} u_t^{\text{target}}(x|z) + \|u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z})\|^{2} - \|u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z}) \|^{2} \right] + C_{1} \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)}\left[\|u^\theta_t(x_t)-u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z})\|^2 - \|u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z}) \|^{2} \right] + C_{1} \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)}\left[\|u^\theta_t(x_t)-u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z})\|^2 \right] \underbrace{ -\mathbb{E} \left[\|u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z}) \|^{2} \right] }_{ C_{2} } + C_{1} \\ &= \mathcal{L}_{CFM}(\theta) + C_{2} + C_{1} \\ \end{split} \]
As we can see the \(\mathcal{L}_{CFM}\) is the \(\mathcal{L}_{FM}\) to some constant \(C\). So, we can just minimizing \(\mathcal{L}_{CFM}\), we will get the minizer value of \(\mathcal{L}_{FM}\) as well. \[ \mathcal{L}_{CFM} = \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)}\left[\|u^\theta_t(x_t)-u_{t}^{\text{target}}(\mathrm{x}_{t} | \mathrm{z})\|^2 \right] \]
From there, we get the Flow Matching. This is just a simple regression problem with respect to the vector field. Now, let’s see how to derive the score matching from the SDE ## Conditional Score Function & Marginal Score Function






6.3 Mean Flow
Mean Flows for One-step Generative Modeling
MMDiT
7 Model Architecture
7.1 U-Net
U-Net: Convolutional Networks for Biomedical Image Segmentation

- Summarize the most important point of this section in 1–2 sentences.
- Keep it concise and action-oriented so readers walk away with clarity.
- Pose a reflective or guiding question to the reader.
- Example: How would this method scale if we doubled the dataset size?
7.2 Control Net
Adding Conditional Control to Text-to-Image Diffusion Models

7.3 Diffusion Transformer (DiT)

8 Applications
8.1 Text-Image Generation
8.1.1 Imagen
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
8.1.2 DALL·E
8.1.3 Stable Diffusion

8.2 Text-Video Generation
8.2.1 Meta Movie Gen Video
Movie Gen: A Cast of Media Foundation Models
8.2.2 Veo
8.3 Language Modeling
8.4 Diffusion Policy
Rectified Flow: Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow https://arxiv.org/pdf/2209.03003
Mean Flow Mean Flows for One-step Generative Modeling https://arxiv.org/pdf/2505.13447
9 Learning Resource
There are many good learning resource available online, thanks those who are opening those content:
Lectures:
- MIT 6.S183: A Practical Introduction to Diffusion Models:
- MIT 6.S184: Generative AI with Stochastic Differential Equations: focusing on diffusion models through the lens of SDEs. The ODE/SDE is based on this lecture
- KAIST: CS492(D): Diffusion Models and Their Applications: More comprehensive introduction to the diffusion models
- Stanford CS236 Deep Generative Models: Introduce different generative models from VAE to GAN and DDPM
Blogs: