Denoising Diffusion Probabilistic Models

10 minute read

Published:

Denoising Diffusion Probabilistic Models

Description: DDPM: high quality image synthesis results using different probabilisitc models

Related:

Main Contribution:

  1. Introduction to the Diffusion model
  2. We present high quality image synthesis results using diffusion probabilistic models
  3. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics
  4. We show that a certain parameterization of diffusion models reveals an equivalence with denoising score matching over multiple noise levels during training and with annealed Langevin dynamics during sampling

DDPM


A diffusion probabilistic model is a parameterized Markov chain trained using variational inference to produce samples matching the data after finite time. Transitions of this chain are learned to reverse a diffusion process, which is a Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed. When the diffusion consists of small amounts of Gaussian noise, it is sufficient to set the sampling chain transitions to conditional Gaussians too, allowing for a particularly simple neural network parameterization

Formula Derivation

Diffusion models are latent variable models of the form $p_{\theta}(x_0) \doteq \int p_{\theta}(x_{0:T})dx_{1:T}$, where $x_1,…,x_T$ are latents of the same dimensionality as the data $x_0 \sim q(x_0)$. The joint distribution $p_{\theta}(x_{0:T})$ is called the reverse process, and it is defined as a Markov chain with learned Gaussian transitions starting at $p(x_T)=\mathcal{N}(x_T;0, I)$:

\[p_{\theta}(x_{0:T}) \doteq p(x_T) \prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_t) \\ p_{\theta}(x_{t-1}|x_t) \doteq \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t,t), \Sigma_{\theta}(x_t,t))\]

What distinguishes diffusion models from other types of latent variable models is that the approximate posterior $q(x_{1:T}|x_0)$ called the forward process or diffusion process, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule $\beta_1, …,\beta_T$:

\[q(x_{1:T}|x_0) \doteq \prod_{t=1}^{T}q(x_t|x_{t-1}) \\ q(x_t|x_{t-1}) \doteq \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)\]

Training is performed by optimizing the usual variational bound on negative log likelihood:

\[E[-\log p_{\theta}(x_0)] \leq E_q[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}] \\ = E_q[-\log p(x_T) - \sum_{t\geq 1} \log \frac{p_{\theta}(x_{t-1}|x_t)]}{q(x_t|x_{t-1})} \doteq L\]

Since we have (From VAE):

\[\begin{align*} &D_{KL}(q_{\phi}(z|x^{(i)}) \Vert p_{\theta}(z|x_i)) + \mathcal{L}(\theta,\phi;x^{(i)}) \\ &= \int q_{\phi}(z|x^{(i)})\log \frac{q_{\phi}(z|x^{(i)})}{p_{\theta}(z|x^{(i)})}dz + E_{q_{\phi}}[-\log q_{\phi}(z|x^{(i)})+ \log p_{\theta}(z,x^{(i)})] \\ &=E_{q_{\phi}}[\log q_{\phi}(z|x^{(i)})-\log p_{\theta}(z|x^{(i)})-\log q_{\phi}(z|x^{(i)})+\log p_{\theta}(z,x^{(i)})] \\ &=E_{q_{\phi}}[\log p_{\theta}(x^{(i)})] = \log p_{\theta}(x^{(i)}) \end{align*}\]

then $\log p_{\theta}(x^{(i)}) \geq \mathcal{L}(\theta, \phi; x^{(i)})$, where \(\mathcal{L}(\theta, \phi; x^{(i)}) = -D_{KL}(q_{\phi}(z|x^{(i)})\Vert p(z)) + E_{q_{\phi}}(\log p_{\theta}(x^{i}|z) )\)

Then

\[\begin{align*} -\log p_{\theta}(x_0) &\leq -\log p_{\theta}(x_0) + D_{KL}(q(x_{1:T}|x_0) \Vert p_{\theta}(x_{1:T}|x_0)) \\ &= -\log p_{\theta}(x_0) + E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{1:T}|x_0)}] \\ &= -\log p_{\theta}(x_0) + E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})/p_{\theta}(x_0)} \\ &= -\log p_{\theta}(x_0) + E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}+\log p_{\theta}(x_0)] \\ &=E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}] \\ E_{q(x_0)}[-\log p_{\theta}(x_0)] &\leq E_{q(x_0)}\{E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}]\}=E_{q(x_{0:T})}[\log \frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}] \end{align*}\]

The forward process variance $\beta_t$ can be learned by reparameterization or held constant as hyperparameters, and expressiveness of the reverse process is ensured in part by the choice of Gaussian conditionals in $p_{\theta}(x_{t-1}|x_t)$, because both processes have the same functional form when $\beta_t$ are small. A notable property of the forward process of the forward process is that it admits sampling $x_t$ at an arbitrary timestep $t$ in closed form: using the notation $\alpha_t \doteq 1-\beta_t$ and $\bar \alpha_t \doteq \prod_{s=1}^t \alpha_s$ we have

\[q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar \alpha_t}x_0, (1-\bar \alpha_t)I)\]

Efficient training is therefore possible by optimizing random terms of $L$ with stochastic gradient descent. Further improvements come from variance reduction by rewriting $L$ as:

\[E_q[\underbrace{D_{KL}(q(x_T|x_0)\Vert p(x_T))}_{L_{T}}+\sum_{t>1}\underbrace{D_{KL}(q(x_{t-1}|x_t,x_0)\Vert p_{\theta}(x_{t-1}|x_t))}_{L_{t-1}} - \underbrace{\log p_{\theta}(x_0|x_1)}_{L_0}]\]

Since

\[\begin{align*} L &= E_q[-\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}] = E_q[\log p(x_T) - \sum_{t>1}\log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})}] \\ &= E_q[-\log p(x_T)-\sum_{t>1}\log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})}- \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)}] \\ &(\frac{1}{q(x_{t-1}|x_t,x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_{t}|x_0)} = \frac{q(x_{t},x_0)}{q(x_{t},x_{t-1},x_0)} \cdot \frac{q(x_{t-1},x_0)}{q(x_0)} \cdot \frac{q(x_0)}{q(x_{t},x_0)} = \frac{q(x_{t-1},x_0)}{q(x_{t},x_{t-1},x_0)} =\frac{1}{q(x_{t}|x_{t-1},x_0)} = \frac{1}{q(x_{t}|x_{t-1})}) \\ &=E_q[-\log p(x_T)-\sum_{t>1}\log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_{t}|x_0)}- \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)}] \\ &= E_q[-\log \frac{p(x_T)}{q(x_T|x_0)}-\sum_{t>1}\log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)} - \log p_{\theta}(x_0|x_1)] \\ &=E_q[\underbrace{D_{KL}(q(x_T|x_0)\Vert p(x_T))}_{L_{T}}+\sum_{t>1}\underbrace{D_{KL}(q(x_{t-1}|x_t,x_0)\Vert p_{\theta}(x_{t-1}|x_t))}_{L_{t-1}} - \underbrace{\log p_{\theta}(x_0|x_1)}_{L_0}] \end{align*}\]

Equation uses KL divergence to directly compare $p_{\theta}(x_{t-1}|x_t)$ against forward peocess posteriors, which are tractable when conditioned on $x_0$:

\[q(x_{t-1}|x_t,x_0) = \mathcal{N}(x_{t-1}; \tilde\mu_t(x_t,x_0), \tilde \beta_t I)\]

where $\tilde\mu_t(x_t,x_0) \doteq \frac{\sqrt{\bar \alpha_{t-1}} \beta_t}{1-\bar \alpha_{t-1}}x_0 + \frac{\sqrt{\bar \alpha_{t}}(1-\bar \alpha_{t-1})}{1-\bar \alpha_{t}}x_t$ and $\tilde \beta_t \doteq \frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_{t}}\beta_t$

\[\begin{align*} q(x_{t-1}|x_t,x_0)&= \frac{q(x_t,x_{t-1},x_0)}{q(x_t,x_0)} = \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)}\\ &=\frac{\frac{1}{\sqrt{2\pi \beta_t}}\exp\{-\frac{(x_t-\sqrt{1-\beta_t} x_{t-1})^2}{2\beta_t}\} \frac{1}{\sqrt{2\pi (1-\bar \alpha_{t-1})}}\exp\{-\frac{(x_t-\sqrt{\bar \alpha_{t-1}} x_{t-1})^2}{2(1-\bar \alpha_{t-1})}\}}{\frac{1}{\sqrt{2\pi (1-\bar \alpha_{t})}}\exp\{-\frac{(x_t-\sqrt{\bar \alpha_{t}} x_{t-1})^2}{2(1-\bar \alpha_{t})}\}} \\ &=\frac{1}{\sqrt{2\pi \frac{\beta_t(1-\bar \alpha_{t-1})}{(1-\bar \alpha_{t})}}} \cdot \exp \{-\frac{(x_t-\sqrt{1-\beta_t} x_{t-1})^2}{2\beta_t}-\frac{(x_t-\sqrt{\bar \alpha_{t-1}} x_{t-1})^2}{2(1-\bar \alpha_{t-1})}+ \frac{(x_t-\sqrt{\bar \alpha_{t}} x_{t-1})^2}{2(1-\bar \alpha_{t})}\} \\ &=\frac{1}{\sqrt{2\pi \frac{\beta_t(1-\bar \alpha_{t-1})}{(1-\bar \alpha_{t})}}}\cdot \exp\{-\frac{1}{2}[(\frac{\alpha_t^2}{\beta_t}+\frac{1}{1-\bar \alpha_{t-1}})x_{t-1}^{2}-(\frac{2\sqrt\alpha_t}{\beta_t}x_t+\frac{2\sqrt{\bar\alpha_{t-1}}}{1-\bar \alpha_{t-1}}x_0)x_{t-1}+C(x_t,x_0)] \} \end{align*}\]

where $\frac{1}{\tilde \beta_t} = \frac{1}{\sigma^2}= \frac{\alpha_t^2}{\beta_t}+\frac{1}{1-\bar \alpha_{t-1}}$ and $\frac{\mu}{\sigma^2}=\frac{\sqrt\alpha_t}{\beta_t}x_t+\frac{\sqrt{\bar\alpha_{t-1}}}{1-\bar \alpha_{t-1}}x_0$ then:

\[\tilde \mu_t(x_t,x_0)\doteq \frac{\sqrt{\bar \alpha_{t-1}}\beta_t}{1-\bar \alpha_t}x_0+\frac{\sqrt{\alpha_{t}}(1-\bar \alpha_t)}{1-\bar \alpha_t}x_t, \ \tilde \beta_t \doteq \frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_t}\beta_t\]

Consequently, all KL divergences in Equation are comparisons between Gaussians, so they can be calculated in Rao-Blackwellized fashion with closed form expressions instead of high variance Monte Carlo estimates

Diffusion models and Denoising autoencoders

Forward process and $L_T$

We ignore the fact that the forward process variances \beta_t are learnable by reparameterization and instead fix them to constants. Thus in our implementaion, the approximate posterior $q$ has no learnable parameters, so $L_T$ is constant during training and can be ignored.

Reverse process and $L_{1:T-1}$

Now we discuss our choices in $p_{\theta}(x_{t-1}|x_t)=\mathcal{N}(x_{t-1}; \mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)), 1<t \leq T$.

First, we set $\Sigma_{\theta}(x_t,t) = \sigma_t^2I$ to untrained time dependent constants. Experimentally, both $\sigma_t^2=\beta_t$ and $\sigma_t^2=\tilde \beta_t=\frac{1-\bar \alpha_{t-1}}{1-\bar \alpha_t}\beta_t$ had similar results. The first choice is optimal for $x_0\sim\mathcal{N}(0,I)$, and the second is optimal for x_0 deterministically set to one point. These are the two extreme choices corresponding to upper and lower bounds on reverse process entropy for data with coordinatewise unit variance

Second, to represent the mean $\mu_{\theta}(x_t,t)$, we propose a specific parameterization motivated by the following analysis of $L_t$. With $p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t,t), \sigma_t^2I)$, we can write:

\[L_{t-1}=E_q[\frac{1}{2\sigma^2}\Vert\tilde \mu_t(x_t,x_0)-\mu_{\theta}(x_t,t) \Vert^2]+C\]

Where $C$ is a constant which don’t depend on $\theta$.

Since:

\[\begin{align*} KL(p(x)\Vert q(x))&=E_{x\sim p(x)}[\log \frac{p(x)}{q(x)}] = E_{x\sim p(x)}[\log p(x)]+E_{x\sim p(x)}[-\log q(x)] \\ E_{x\sim p(x)}[-\log q(x)]&=E_{x\sim p(x)}[\frac{n}{2}\log 2\pi + \frac{1}{2}\log\det(\Sigma_q)+\frac{1}{2}(x-\mu_q)^T\Sigma_q^T(x-\mu_q)] \\ &=\frac{n}{2}\log 2\pi + \frac{1}{2}\log\det(\Sigma_q)+E_{x\sim p(x)}[\frac{1}{2}(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q)] \\ E_{x\sim p(x)}[(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q)] &=E_{x\sim p(x)}\{tr[(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q)]\}=E_{x\sim p(x)}\{tr[\Sigma_q^{-1}(x-\mu_q)(x-\mu_q)^T]\} \\ &=tr\{\Sigma_q^{-1}E_{x\sim p(x)}[(x-\mu_q)(x-\mu_q)^T]\} = tr\{\Sigma_q^{-1}E_{x\sim p(x)}[xx^T-\mu_qx^T-x^T\mu_q+\mu_q\mu_q^T]\} \\ &=tr\{\Sigma_q^{-1}(\Sigma_p+\mu_p\mu_p^T-\mu_q\mu_p-\mu_p\mu_q+\mu_q\mu_q^T)\} \\ &=tr\{\Sigma_q^{-1}\Sigma_p+\Sigma_q^{-1}(\mu_p-\mu_q)(\mu_p-\mu_q)^T\} \\ &=tr(\Sigma_q^{-1}\Sigma_p)+(\mu_p-\mu_q)\Sigma_q^{-1}(\mu_p-\mu_q)^T \\ \end{align*}\]

when $\mu_q=\mu_p, \Sigma_q=\Sigma_p$, we have $E_{x\sim p(x)}[(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q)]=n$

\[\begin{align*} KL(p(x)\Vert q(x)) &= \frac{1}{2}[n\log 2\pi + \log \det(\Sigma_q)+tr(\Sigma_q^{-1}\Sigma_p)+(\mu_p-\mu_q)^T\Sigma_q^{-1}(\mu_p-\mu_q)]-\frac{1}{2}[n\log 2\pi + \log\det(\Sigma_p)+n] \\ &=\frac{1}{2}[(\mu_q-\mu_p)^T\Sigma_q^{-1}(\mu_p-\mu_q)-\log \det(\Sigma_q^{-1}\Sigma_p)+tr(\Sigma_q^{-1}\Sigma_p)] \end{align*}\]

so we can get:

\[L_{t-1}=D_{KL}(q(x_{t-1}|x_t,x_0)\Vert p_{\theta}(x_{t-1}|x_t)) = E_q[\frac{1}{2\sigma^2}\Vert \tilde \mu_t(x_t,x_0) - \mu_{\theta}(x_t,t) \Vert^2]+C\]

So we see that the most straightforward parameterization of $\mu_{\theta}$ is a model that predicts $\tilde \mu_t$, the forward process posterior mean. However we can expand equation further by reparameterizing equation as $x_t(x_0,\epsilon) = \sqrt{\bar \alpha_t}x_0+\sqrt{1-\bar \alpha_t}\epsilon,\ \epsilon \sim \mathcal{N}(0,I)$ and applying the forward process posterior formula:

\[\begin{align*} x_t(x_0,\epsilon) &= \sqrt{\bar \alpha_t}x_0+\sqrt{1-\bar \alpha_t}\epsilon,\ \epsilon \sim \mathcal{N}(0,I) \Leftrightarrow x_0=\frac{1}{\sqrt{\bar \alpha_t}}(x_t(x_0,\epsilon)-\sqrt{1-\bar\alpha_t}\epsilon)\\ L_{t-1}-C &= E_{x_0,\epsilon}[\frac{1}{2\sigma^2}\Vert \tilde \mu_t(x_t,\frac{1}{\sqrt{\bar \alpha_t}}(x_t(x_0,\epsilon)-\sqrt{1-\bar\alpha_t}\epsilon)) - \mu_{\theta}(x_t(x_0,\epsilon),t) \Vert^2] \\ \tilde\mu_t(x_t,x_0) &= \frac{\sqrt{\bar \alpha_{t-1}}\beta_t}{1-\bar \alpha_t}x_0+\frac{\sqrt{\alpha_{t}}(1-\bar \alpha_t)}{1-\bar \alpha_t}x_t = \frac{\sqrt{\alpha_{t}}(1-\bar \alpha_t)}{1-\bar \alpha_t}x_t+\frac{\sqrt{\bar \alpha_{t-1}}\beta_t}{1-\bar \alpha_t}\frac{1}{\sqrt{\bar \alpha_t}}(x_t(x_0,\epsilon)-\sqrt{1-\bar\alpha_t}\epsilon) \\ &=\frac{1}{1-\bar \alpha_t}[\sqrt{\alpha_t}(1-\bar \alpha_{t-1})+\frac{1}{\sqrt{\alpha_t}}(1-\bar \alpha_{t})]x_t - \frac{1}{\sqrt{\alpha_{t}}}\frac{\beta_t}{\sqrt{1-\bar\alpha_{t}}}\epsilon \\ &=\frac{1}{\sqrt{\alpha_t}}\{[\frac{\alpha_t(1-\bar\alpha_{t-1})}{1-\bar\alpha_{t}}+\frac{1-\alpha_{t}}{1-\bar\alpha_{t}}]x_t - \frac{\beta_t}{\sqrt{1-\bar \alpha_t}}\epsilon\} \\ &=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon) \\ L_{t-1}-C &= E_{x_0,\epsilon}[\frac{1}{2\sigma^2}\Vert \tilde \mu_t(x_t,\frac{1}{\sqrt{\bar \alpha_t}}(x_t(x_0,\epsilon)-\sqrt{1-\bar\alpha_t}\epsilon)) - \mu_{\theta}(x_t(x_0,\epsilon),t) \Vert^2] \\ &=E_{x_0,\epsilon}[\frac{1}{2\sigma^2}\Vert \frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon) - \mu_{\theta}(x_t(x_0,\epsilon),t) \Vert^2] \end{align*}\]

Equation reveals that $\mu_{\theta}$ must predict $\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar \alpha_t}}\epsilon)$ given $x_t$. Since $x_t$ is available as input to the model, we may choose the parameterization

\[\mu_{\theta}(x_t,t)=\tilde \mu_t(x_t,\frac{1}{\sqrt{\bar \alpha_t}}(x_t-\sqrt{1-\bar\alpha_t}\epsilon_{\theta}(x_t))) = \frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar \alpha_t}}\epsilon_{\theta}(x_t,t))\]

where $\epsilon_{\theta}$ is a function approximator intended to predict $\epsilon$ from $x_t$. To sample $x_t \sim p_{\theta}(x_{t-1}|x_t)$ is to compute $x_{t-1}=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar \alpha_t}}\epsilon_{\theta}(x_t,t))+\sigma_t z,\ z \sim \mathcal{N}(0,I)$. The complete sampling procedure, Algorithm 2 resembles Langevin dynamics with $\epsilon_{\theta}$ as a learned gradient of the data density.

Algorithm

Furthermore, with the parameterization, Equation simplifies to:

\[E_{x_0,\epsilon}[\frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar \alpha_t)}\Vert\epsilon - \epsilon_{\theta}(\sqrt{\bar \alpha_t}x_0+\sqrt{1-\bar \alpha_t}\epsilon,t) \Vert^2]\]

which resembles denoising score matching over multiple noise scales indexed by $t$. As Equation is equal to the variational bound for the Lagevin-like reverse process, we see that optimizing an objective resembling denoising score matching is equivalent to using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics

To summarize, we can train the reverse process process mean function approximator $\mu_{\theta}$ to predict $\tilde \mu_t$, or by modifying its parameterization, we can train it to predict $\epsilon$.

We have shown that the $\epsilon$-prediction parameterization both resembles Langevin dynamics and simplifies the diffusion model’s variational bound to an objective that resembles denoising score matching

Data scaling, reverse process decoder, and $L_0$

We assume that image data consists of integers in ${0,1,…,255}$ scaled linearly to $[-1,1]$. This ensures that the neural network reverse process operators on consistently scaled inputs starting from the standard normal prior $p(x_T)$. To obtain discrete log likelihoods, we set the last term of the reverse process to an independent discrete decoder derived from the Gaussian $\mathcal{N}(x_0; \mu_{\theta}(x_1,1), \sigma_1^2I)$:

\[\begin{align*} p(x_0|x_1) &= \prod_{i=1}^D\int_{\delta_{-}(x_0^i)}^{\delta_{+}(x_0^i)}\mathcal{N}(x;\mu_{\theta}^i(x_1,1),\sigma_1^2)dx \\ \delta_{+}(x)&=\begin{cases} \infty \space &\text{if }x=1 \\ x+\frac{1}{255}\space &\text{if }x<1 \end{cases} \ \ \ \ \ \delta_{-}(x)=\begin{cases} -\infty \space &\text{if }x=-1 \\ x-\frac{1}{255}\space &\text{if }x>-1 \end{cases} \end{align*}\]

where $D$ is the data dimensionality and the $i$ superscript indicates extraction of one coordinate.

Simplified training objective

Withe the reverse process and decoder defined above, the variational bound, consisting of terms derived from Equation, is clearly differentiable with respect to $\theta$ and is ready to be employed for training. However, we found it beneficial to sample quality(and simpler to implement) to train on the following variant of the variational bound:

\[L_{simple}(\theta) \doteq E_{t,x_0\epsilon}[\Vert \epsilon-\epsilon_\theta(\sqrt{\bar \alpha_t}x_0+\sqrt{1-\bar \alpha_t}\epsilon,t) \Vert^2]\]

where $t$ is uniform between $1$ and $T$. The $t=1$ case corresponds to $L_0$ with the intergral in the discrete decoder definition approximated by the Gaussian probability density function times the bin width, ignoring $\sigma_1^2$ and edge effects. The $t>1$ cases correspond to an unweighted version of Equation, analogous to the loss weighting used by the NCSN denosing score matching model

Since our simplified objective discards the weighting in Equation, it is a weighted variational bound that emphasizes different aspects of reconstruction compared to the standard variational bound.

Result:

Set T = 1000 for all experiments so that the number of neural network evaluations needed during sampling matches previous work. We set the forward process variances to constants increasing linearly from $\beta_1 = 10^{-4}$ to $\beta_T=0.02$. These constants were chosen to be small relative the data scaled to $[-1,1]$

Sampling quality:

Results