Evidence Lower Bound
Table of Contents
Epistemic Status: This note is me thinking out loud. Some formulas deviate from what I have seen in literature. Somethings here might be wrong.
In Bayesian statistic, the "Evidence" is the probability of the observed data.
Let say our objective is to model the relationship between the observed data \(x\) and hidden latent variables \(z\).
The true distribution of hidden latent variables is \(p(z|x)\) which is not accessible to us. But we can try to model it using an approximation \(q(z|x)\) and try to minimize the KL divergence between our model and the true distribution.
\begin{align*} D_{KL}\left(q(z|x) || p(z|x)\right) &= E_{z \sim q} \left [ \log q(z|x) - \log p(z|x) \right ] \\ &= \log p(x) - E_{z \sim q(z|x)} \left[ \log p(x,z) - \log q(z|x) \right ] \end{align*}Since the true evidence \(p(x)\) is a constant, minimizing the KL divergence is same as maximizing the second term called the ELBO.
Now the ELBO can be expressed sa:
\begin{align*} ELBO &= E_{z \sim q(z|x)} \left[ \log p(x,z) - \log q(z|x) \right ] \\ &= E_{z \sim q(z|x)} \left[ \log p(x|z) + \log p(z) - \log q(z|x) \right ] \\ &= E_{z \sim q(z|x)} \left[ \log p(x|z) \right ] - E_{z \sim q(z|x)} \left[ \log q(z|x) - \log p(z) \right ] \\ &= E_{z \sim q(z|x)} \left[ \log p(x|z) \right ] - D_{KL} \left( q(z|x) || p(z) \right ) \end{align*}The objective is to maximize ELBO. The first term says we have to find \(q(z|x)\) such that sampling from it increases the marginal probability \(\log p(x|z)\) of getting \(x\). And the second term says we should minimize the distance of \(q(z|x)\) from the prior \(p(z)\) we impose on \(z\).
But we don't have access to the true distribution \(p(x|z)\) either. So, we again need to model it using \(q(x|z)\). So the ELBO becomes
\begin{align*} ELBO \approx E_{z \sim q(z|x)}\left[\log q(x|z) \right] - D_{KL} \left( q(z|x) || p(z) \right ) \end{align*}This now leaves us with the choice on how to model the conditional distributions \(q(x|z)\) and \(q(z|x)\). In context of Variational Autoencoder the first one is called decoder and the second one is called encoder. Parameterized respectively by set of parameters \(\theta\) and \(\phi\), they are written in literature as: \(p_\phi (x|z)\) and \(q_\theta(z|x)\).
1. VAE Formulation
1.1. Decoder
If we model \(q(x|z)\) as multivariate gaussian distribution with
- mean \(D(z) = E [q(x|z)]\) (where we say \(D\) is a decoder and \(\hat x = D(z)\) is reconstruction of \(x\))
- and covariance matrix \(\sigma^2 I\) (i.e. all the output values are independent with constant variance \(\sigma^2\))
Then the first terms reduces to some constant plus mean square error between original data \(x\) and reconstructed data \(D(z)\) :
\begin{align*} ELBO \approx C - \frac 1 {2\sigma^2} E_{z \sim q(z|x)} \left[ || x - D(z) ||_2^2 \right ] - D_{KL} \left( q(z|x) || p(z) \right ) \end{align*}This is the formulation used in Variational Autoencoders. If we use small \(\sigma\) the weightage to mean square error increases.
1.2. Encoder
Now we also have a choice on how to model \(q(z|x)\), if we model it as multivariate gaussain with
- mean \(\mu(x)\) and
- and covariance matrix \(\sigma(x) I\)
And we similarly model the prior \(p(z)\) as being gaussian with mean 0 and covariance matrix \(I\), then we can express the KL divergence as:
\begin{align*} D_{KL} \left( q(z|x) || p(z) \right) = \frac 1 2 \left(N \sigma(x)^2 + ||\mu(x)||_2^2 - N - 2 N \log \sigma (x) \right) \end{align*}where \(N\) is dimension of latent vector \(z\).
If we instead choose a diagonal covariance matrix \(diag(\sigma_1(x), ..., \sigma_N(x)\) then the KL divergence can be computed as follows:
\begin{align*} D_{KL} \left( q(z|x) || p(z) \right) = \frac 1 2 \sum_i^{N} \left(\sigma_i(x)^2 + \mu_i(x)^2 - 1 - 2 \log \sigma_i (x) \right) \end{align*}This is the standard formulation used in Variational Autoencoder.
2. Discussion
For \(p(z|x)\) we formulated the problem of finding a good approximation to it \(q(z|x)\) by minimizing the KL divergence between our model and the true posterior \(p(z|x)\) which in turn got expressed as ELBO.
Why didn't we do the same for \(q(x|z)\) ?
One reason is doing so would recursively formulate the problem again in terms of \(p(z|x)\). And another reason is that, we don't need to go to the route of KL divergence and ELBO. Because since we know the true \(x\), we can just express it as a reconstruction loss. But for \(q(z|x)\) we don't know the true \(z\) yet, and thus there is nothing to compare it with. There is not reconstruction loss for latent variables.
Why do I write \(ELBO \approx\) while standard literature writes it as \(ELBO =\) ?