Generative Modeling by Estimating Gradients of the Data Distribution
1. Setup
With have samples from an unknown data distribution \(p_{data}(x)\) and our objective is to be able to get new samples from that distribution.
We train a score network \(s_\theta\) that approximates the score \(\nabla_x p(x)\) such that we can later use that score function to sample from the original distribution.
Training of such score network is done by score matching. And sampling from it is done by Langevine dynamics.
2. Score matching
Score matching is learns statistical models based on i.i.d. samples.
The objective is to find the minimizer of
\begin{align*} \frac 1 2 \mathbb E_{p(x)} \left[ || s_\theta(x) - \nabla_x \log p(x) ||^2_2 \right] \end{align*}It turns out that this is same as finding minimizer of the following:
\begin{align*} \mathbb E_{p(x)} \left[ tr(\nabla_x s_\theta(x)) + \frac 1 2 || s_\theta(x) ||^2_2 \right] \end{align*}This still is not scalable because of the trace term. So there are other equivalent formulation that derive from the above like:
- Denoising score matching
- Sliced score matching
3. Denoising score matching
We perturb the original data \(x\) to \(\tilde x\) with a distribution \(q(\tilde x|x) \sim \mathcal N(0, \sigma^2)\). Then the previous objective is equivalent to:
\begin{align*} \frac 1 2 \mathbb E_{q(\tilde x)} \left[ ||s_\theta(\tilde x) - \nabla_{\tilde x} \log q(\tilde x | x) ||_2^2 \right] \end{align*}This objective uses just score of error distribution for which we have analytical expression:
\begin{align*} \nabla_{\tilde x} \log q(\tilde x | x) = - \frac {\tilde x - x} {\sigma^2} \end{align*}4. Langevin dynamics
Langevin dynamics produces samples from a probability density by only using the score function \(\nabla_x \log p(x)\).
We start at a point and make noisy steps towards the direction pointed by the score function. The obtained distribution of samples after \(T\) steps is same as original distribution (when \(\epsilon \to 0\) and \(T \to \infty\)).
\begin{align*} x_t = x_{t-1} + \frac \epsilon 2 \nabla_x \log p(x_{t-1}) + \sqrt \epsilon z_t \end{align*}where, the noise \(z_t \sim \mathcal N(0,I)\).
5. Challenges
Low data density regions
In those regions there are not sufficient samples to estimate score accurately. This gives two issues:
- Inaccurate score estimates
- Slow mixing of Langevin dynamics
-
If the underlying data is low dimensional manifold embedded in the ambient space (of \(x\)), then score is undefined at some places in \(x\). In other words, when the support of \(x\) is not the whole space, score is undefined at some places.
Then the score matching objective can't give consistent score estimator in those places.
6. Solution - NCSN (Noise Conditioned Score Network)
Perturbing the data distribution makes score based generative modeling more robust.
We add gaussian noise of different level \(\sigma\) to the original data, and train score estimator \(s_\theta(x, \sigma)\) conditioned on the noise level. For large \(\sigma\), this provides was some benefits:
- Full support (i.e. \(x\) covers the whole space now)
- Low data density regions are filled due to large \(\sigma\)
Take \(L\) noise levels \(\{\sigma_1, \sigma_2, ... \sigma_L\}\) in geometric progression such that \(\sigma_1\) is large enough to mitigate the challenges of low data density and \(\sigma_L\) has small effect on the original data.
7. Learning NCSNs
Learning can be done by both denoising score matching and sliced score matching. Lets look at denoising score matching.
The noise distribution is \(q_\sigma(\tilde x | x) = \mathcal N(\tilde x | x, \sigma^2 I)\) so the score is
\begin{align*} \nabla_{\tilde x} \log q_\sigma(\tilde x | x) = - \frac {\tilde x - x} {\sigma^2} \end{align*}And thus the objective function for a given \(\sigma\) level is:
\begin{align*} l(\theta, \sigma) = \frac 1 2 \mathbb E_{p(x)} \mathbb E_{q_\sigma(\tilde x | x)} \left [ || s_\theta(\tilde x, \sigma) - \nabla_{\tilde x} \log q_\sigma(\tilde x | x) ||_2^2 \right ] \end{align*}In effect this objective function trains the score network to estimate the negative of noise added to original data.
And the final objective function combines all the small loss functions:
\begin{align*} \mathcal L (\theta) = \frac 1 L \sum_{i=1}^L \lambda_i l(\theta, \sigma_i) \end{align*}we take \(\lambda_i = \sigma_1^2\).
8. Inference by Annealed Langevin dynamics
Start with \(\sigma_1\), the big value of \(\sigma\) and sample using \(T\) timesteps with step size \(\alpha_i\) , and then in the next iteration start with the previously obtained sample and sample with decreased \(\sigma\). For \(t\) th timestep of \(i\) th iteration, the update is:
\begin{align*} \tilde x_t^{(i)} \leftarrow x_{t-1}^{(i)} + \frac {\alpha_i} 2 s_\theta(\tilde x_{t-1}^{(i)}, \sigma_i) + \sqrt {\alpha_i} z_t \end{align*}The choice of step size was made in such a way that the signal \(\frac {\alpha_i} 2 s_\theta(\tilde x_{t-1}^{(i)}, \sigma_i)\) to noise (\(\sqrt {\alpha_i} z_t\)) ratio is independet of \(\sigma\). \(\alpha_i = \epsilon \cdot \sigma_i^2 / \sigma_L^2\). (The maths is in # pg 7)