Variational Inference - ELBO

probability
optimization
Author

Gabriel Stechschulte

Published

June 23, 2022

We don’t know the real posterior so we are going to choose a distribution \(Q(\theta)\) from a family of distributions \(Q^*\) that are easy to work with and parameterized by \(\theta\). The approximate distribution should be as close as possible to the true posterior. This closeness is measured using KL-Divergence. If we have the joint \(p(x, z)\) where \(x\) is some observed data, the goal is to perform inference: given what we have observed, what can we infer about the latent states?, i.e , we want the posterior.

Recall Bayes theorem:

\[p(z | x) = \frac{p(x|z)p(z)}{p(x)}\]

The problem is the marginal \(p(x = D)\) as this could require a hundred, thousand, . . .dimensional integral:

\[p(x) = \int_{z_0},...,\int_{z_{D-1}}p(x, z)dz_0,...,d_{z_D{-1}}\]

If we want the full posterior and can’t compute the marginal, then what’s the solution? Surrogate posterior. We want to approximate the true posterior using some known distribution:

\[q(z) \approx p(z|X=D)\]

where \(\approx\) can mean you want the approximated posterior to be “as good as possible”. Using variational inference, the objective is to minimize the distance between the surrogate \(q(z)\) and the true posterior \(p(x)\) using KL-Divergence:

\[q^*(z) = argmin_{q(z) \in Q} (KL(q(z) || p(z|x=D)))\]

where \(Q\) is a more “simple” distribution. We can restate the KL-divergence as the expectation:

\[KL(q(z) || p(z|D)) = \mathbb{E_{z \sim q(z)}}[log \frac{q(z)}{p(z|D)}]\]

which, taking the expectation over \(z\), is equivalent to integration:

\[\int_{z_0}, . . .,\int_{z_{D-1}}q(z)log\frac{q(z)}{p(z|D)}d_{z_0},...,d_{z_{D-1}}\]

But, sadly we don’t have \(p(z \vert D)\) as this is the posterior! We only have the joint. Solution? Recall our KL-divergence:

\[KL(q(z) || p(z|D))\]

We can rearrange the terms inside the \(log\) so that we can actually compute something:

\[\int_{z}q(z)log(\frac{q(z)p(D)}{p(z, D)})dz\]

We only have the joint. Not the posterior; nor the marginal. We know from Bayes rule that we can express the posterior in terms of the joint \(p(z, D)\) divided by the marginal \(p(x=D)\):

\[p(z|D) = \frac{p(Z, D)}{p(D)}\]

We plug this inside of the \(log\):

\[\int_{z}q(z)log(\frac{q(z)p(D)}{p(z, D)})dz\]

However, the problem now is that we have reformulated our problem into another quantity that we don’t have, i.e., the marginal \(p(D)\). But we can put the quantity that we don’t have outside of the \(log\) to form two separate integrals.

\[\int_z q(z)log(\frac{q(z)}{p(z, D)})dz + \int_zq(z)log(p(D)dz\]

This is a valid rearrangement because of the properties of logarithms. In this case, the numerator is a product, so this turns into a sum of the second integral. What do we see in these two terms? We see an expectation over the quantity \(\frac{q(z)}{p(z, D)}\) and another expectation over \(p(D)\). Rewriting in terms of expectation:

\[\mathbb{E_{z{\sim q(z)}}}[log(\frac{q(z)}{p(z, D)})] + \mathbb{E_{z \sim q(z)}}[log(p(D))]\]

The right term contains information we know—the functional form of the surrogate \(q(z)\) and the joint \(p(z, D)\) (in the form of a directed graphical model). We still don’t have access to \(p(D)\) on the right side, but this is a constant quantity. The expectation of a quantity that does not contain \(z\) is just whatever the expectation was taken over. Because of this, we can again rearrange:

\[-\mathbb{E_{z \sim q(z)}}[log \frac{p(z, D)}{q(z)}]+log (p(D))\]

The minus sign is a result of the “swapping” of the numerator and denominator and is required to make it a valid change. Looking at this, the left side is a function dependent on \(q\). In shorthand form, we can call this \(\mathcal{L(q)}\). Our KL-divergence is:

\[KL = \mathcal{-L(q)} + \underbrace{log(p(D))}_\textrm{evidence}\]

where \(p(D)\) is a value between \([0, 1]\) and this value is called the evidence which is the log probability of the data. If we apply the \(log\) to something between \([0, 1]\) then this value will be negative. This value is also constant since we have observed the dataset and thus does not change.

\(KL\) is the distance (between the posterior and the surrogate) so it must be something positive. If the \(KL\) is positive and the evidence is negative, then in order to fulfill this equation, \(\mathcal{L}\) must also be negative (negative times a negative is a positive). The \(\mathcal{L}\) should be smaller than the evidence, and thus it is called the lower bound of the evidence \(\rightarrow\) Evidence Lower Bound (ELBO).

Again, ELBO is defined as: \(\mathcal{L} = \mathbb{E_{z \sim q(z)}}[log(\frac{p(z, D)}{q(z)})]\) and is important to note that the ELBO is equal to the evidence if and only if the KL-divergence between the surrogate and the true posterior is \(0\):

\[\mathcal{L(q)} = log(p(D)) \textrm{ i.f.f. } KL(q(z)||p(z|D))=0\]