Variational Inference

July 6, 2022

[This is a note, which is the seed of an idea and something I’ve written quickly, as opposed to articles that I’ve optimized for readability and transmission of ideas.]

This is a primer on variational inference in machine learning, based on sections of Jordan et al. (An Introduction to Variational Methods for Graphical Models; 1999). I go over the mathematical forms of variational inference, and I include a discussion on what it means for something to be “variational.” I hope this conveys a bit of the generating ideas that give rise to the various forms of variational inference.

\newcommand{\kl}[2]{D_{\text{KL}}\left(#1\ \| \ #2\right)}


My experience with other sources that explain variational inference is that they leave me confused about what the word “variational” is pointing at. E.g. Murphy 2012 (Machine Learning: a Probabilistic Perspective), Bishop 2006 (Pattern Recognition and Machine Learning), Blei 2016 (Variational Inference: A Review for Statisticians), and even Wikipedia: Variational Bayesian methods.

Aside from being irksome, I find the lack of clarity around this word inhibits my ability to grok what authors are thinking when they write about variational inference. Specifically, I want to understand the underlying generating idea for this class of methodologies. As an ML practitioner and researcher, aim to build off of existing ideas. To do that I need to know how previous authors thought about their ideas. Having the generator of an idea, I can play with it and potentially generate something not previously considered.

To that end, this section is a detour to explore the meaning of the word “variational” in the context of variational inference. If this does not interest you, feel free to skip to the next section, #Variational Inference.

What is Variational ?

In the context of variational inference, the word “variational” refers to the calculus of variations (CoV). Some variational inference texts mention the CoV connection explicitly, e.g. Bishop 2006, section 10.1.

According to Wikipedia,

The calculus of variations is a field of mathematical analysis that uses variations, which are small changes in functions and functionals, to find maxima and minima of functionals: mappings from a set of functions to the real numbers. gives a slightly different description for the CoV,

A branch of mathematics that is a sort of generalization of calculus. Calculus of variations seeks to find the path, curve, surface, etc., for which a given function has a stationary value (which, in physical problems, is usually a minimum or maximum).

The “stationary value” here refers to the min or max of a functional, and a functional is a mapping from functions to real numbers. Both accounts of the CoV make reference optimizing a functional.

If we want to use “variational” as an adjective, e.g. in “variational method” (examples), how could we reasonably extrapolate from the above description of the CoV?

Here’s my proposal: something is “variational” if it involves an optimization problem with uncountably many degrees of freedom. Equivalently, we are searching over a function space to find an optimal function (according to our optimization criteria). Then a “variation” is some “perturbation” to a candidate function that moves us infinitesimally in some direction in function space.

So with this notion, regular calculus is not variational, despite involving searching over function spaces, because a given function is either the solution or not, and is not scored. I suppose that if we recast integration and differentiation as functionals whose optima are the derivatives and integrals of the desired function, we would be reposing calculus as a variational problem (similar to the kind of problem-reposing we will see in the next section).

Variational methods in ML

Jordan et al., section 4, Basics of variational methodology, gives us a primer on variational methods.

Let us begin by considering a simple example. In particular, let us express the logarithm function variationally:
$$\ln(x) = \min_\l\set{\l x-\ln\l-1}\,.$$
In this expression $\l$ is the variational parameter, and we are required to perform the minimization for each value of $x$.

Jordan et al. is implicitly defining what “variational” means here. We are given an optimization criteria, namely $\fa x,\ \min_{\l_x}\set{\l_x x-\ln\l_x-1}$, which has infinitely many optimization parameters, $\l_x$ for every $x$. We could view this optimization as being over the space of (continuous) functions, $\R\to\R$, and searching for some $\l:\R\to\R$ s.t. $\l(x) x-\ln\l(x)-1$ is minimal for every $x\in(0,\infty)$. In this case, $\l(x)=1/x$ is the solution.

Note that in this problem we don’t have a functional. Nevertheless Jordan et al. is taking this to be a variational method. If we are to broaden our notion of variational from the previous section, #What is Variational ?, to accommodate this usage, we could say that a variational problem is an optimization problem over an infinite dimensional function space, or equivalently with infinitely many optimization parameters. In most cases the optimization problem can be specified with a functional, but in this case it is not.

Question: Is $\min_{\l:\R\to\R}\set{\int_\X\l(x) x-\ln\l(x)-1\ \d x}$ an equivalent problem? Then $\mc{F}[\l]=\int_\X\l(x) x-\ln\l(x)-1\ \d x$ is our functional.

To generalize this variational method, suppose we have a function $f(x)$ which is intractable to calculate directly. Luckily, we know of some other tractable function, $g(x,\l)$, s.t. $g(x,\l) \geq f(x)$ for all $\l\in\R$ and $x\in\X$. Furthermore, $g(x,\l) = f(x)$ for some $\l\in\R$, for all $x\in\X$ - i.e. $g$ is a tight upper bound of $f$. Let $\l(x) = \argmin_{\tilde{\l}}g(x,\tilde{\l})$. Then it is the case that $f(x) = g(x,\l(x))$ for all $x\in\X$. So for any $f$, a suitable choice of $g$ gives us a variational form of $f(x)$.

If we are not able to perform the exact minimization $\argmin_{\l}g(x,\l)$ symbolically, but we instead find an approximate minimum $\hat{\l}_x$ numerically (e.g. with gradient descent). Then $g(x,\hat{\l}_x)$ is an approximation of $f(x)$, and $g(x,\hat{\l}_x) \geq f(x)$ is guaranteed. This is now useful as a way to numerically approximate $f(x)$ by converting it into an optimization problem which we know how to numerically approximate.


Let $x = (x_1,x_2,x_3,\dots) \in \X = \X_1\pr\X_2\pr\X_3\pr\dots$ and $z = (z_1,z_2,z_3,\dots) \in \Z = \Z_1\pr\Z_2\pr\Z_3\pr\dots$ be finite or infinite length tuples, where $x$ is the data variable and $\X$ is the data space, and $z$ is the hidden or latent variable and $\Z$ is the latent space.

When we visualize $(x,z)$ as a graph, we call each $x_i$ or $z_j$ a node (in the context of the graphical model literature). However, in this post I will call them dimensions. Each $\X_i$ and $\Z_j$ can be a finite or countable set, or an uncountable set with a metric space (typically the Euclidean metric on $\R$).

Suppose we have a family of joint distributions $\set{p_\t\mid\t\in\T}$, i.e. $p_\t(x,z)$ is the joint probability of $x$ and $z$ for some choice of $\t\in\T$. We say that $p_\t$ is parametrized by $\t$, where $\T$ is a finite dimensional metric space, e.g. $\T\subseteq\R^r$ with $r\in\N$.

We assume that $p_\t(x,z)$ and $p_\t(z)$ are tractable (i.e. fast) quantities to calculate with a computer (to sufficient precision) for all $(x,z)\in\X\pr\Z$, but that $p_\t(x)$ and $p_\t(z\mid x)$ are intractable (too slow to be useful) while also being quantities of interest.

Variational inference is a method for tractably approximating $p_\t(x)$ and $p_\t(z\mid x)$ for all $(x,z)\in\X\pr\Z$.

Note about probability notation

When I write $p_\t(x,z)$, I am treating $p_\t$ as a function of $x$ and $z$ that outputs a probability mass or density.

However, I will overload $p_\t$, by argument name, to represent a number of related functions. For instance, the marginal probabilities $p_\t(x) = \int_\Z p_\t(x,z)\ \d z$ and $p_\t(z) = \int_\X p_\t(x,z)\ \d x$ (replacing integrals with sums when variables are discrete) are different quantities depending on whether the argument to $p_\t$ is $x$ or $z$. Additionally I might consider marginal probabilities of specific dimensions, e.g. $p_\t(x_{i_1},x_{i_2},\dots,z_{j_1},z_{j_2},\dots)$ is the integral of $p_\t(x,z)$ over all dimensions not included as arguments.

Furthermore, we have conditional probabilities, e.g. $p_\t(x\mid z)=p_\t(x,z)/p_\t(z)$ or $p_\t(x_{i_1},z_{j_1}\mid x_{i_2},z_{j_2})=p_\t(x_{i_1},x_{i_2},z_{j_1},z_{j_2})/p_\t(x_{i_2},z_{j_2})$.

This covers the various functions that $p_\t$ can represent, and you can see how the form of the arguments to $p_\t$ determines which function we are considering.

Variational Inference

See Jordan et al., section 6, The block approach.

Continuing from the #Setup above, $p_\t(z\mid x)$ is an intractable quantity (for most $z$ and $x$). So instead we introduce the family of tractable distributions $q_\p$ on $z$ parametrized by $\p\in\P$, where $q_\p(z)$ is intended to be our approximation of $p_\t(z\mid x)$ for a particular $x$, and $\p$ will be our variational parameter(s) w.r.t. $x$ (i.e. the optimization solution will be a function from $\X$ to $\P$).

Define the following quantities,

\M(\t;\ x) &\df \log\par{1/p_\t(x)}\,, \\
\Er(\t,\p;\ x) &\df \kl{q_\p(z)}{p_\t(z\mid x)} \\&= \int_\Z q_\p(z)\log\par{\frac{q_\p(z)}{p_\t(z\mid x)}}\ \d z\,, \\
\L(\t,\p;\ x) &\df \M(\t;\ x) + \Er(\t,\p;\ x)\,.

($\kl{\cdot}{\cdot}$ is the KL-divergence, which is always non-negative.)

All three quantities are non-negative for all $\t,\p,x$. Then we have

\L(\t,\p;\ x) \geq \M(\t;\ x)\,,

with equality when $\Er(\t,\p;\ x)=0$, i.e. when $q_\p(z)=p_\t(z\mid x)$ for all $z$.

Let $\t$ be constant. When $p_\t(\cdot \mid x) \in\set{q_\p(\cdot)\mid \p\in\P}$ for all $x$, then we can write $\M(\t;\ x) = \min_\p\L(\t,\p;\ x)$, with the solution being some function $\p^*:\X\to\P$. Hence we have a variational optimization problem whose solution is a function.

Otherwise if $p_\t(\cdot \mid x) \notin\set{q_\p(\cdot)\mid \p\in\P}$, then $\L(\t,\p;\ x)$ is forever an upper bound of $\M(\t;\ x)$, but its minimum may still be close enough to $\M(\t;\ x)$ for it to be a reasonable substitute. So the variational minimization problem, $\fa x,\ \min_{\p_x}\L(\t,\p_x;\ x)$, is of interest in either case.

A nice property of this particular setup is that the minimization objective, $\fa x,\ \min_{\p_x}\L(\t,\p_x;\ x)$, will also achieve $\fa x,\ \min_{\p_x}\Er(\t,\p_x;\ x)$ because $\M(\t;\ x)$ remains constant w.r.t. $\p_x$ (i.e. the gap $\Er(\t,\p_x;\ x)=\L(\t,\p;\ x) - \M(\t;\ x)$ is minimized). Minimal $\Er(\t,\p_x;\ x)$ in turn implies that $q_\p(z)$ is the best approximation of $p_\t(z\mid x)$ for all $z,x$. So from our single variational objective, we get two approximations of intractable quantities: $\L(\t,\p_x;\ x)$ for $\M(\t;\ x)$ which gives us $p_\t(x)$, and $q_{\p_x}(z)$ for $p_\t(z\mid x)$.

Note that a tractable approximation $\L(\t,\p_x;\ x)$ of $\M(\t;\ x)$ automatically gives us a tractable approximation $q_{\p_x}(z)$ of $p_\t(z\mid x)$, and vice versa, since we can easily get one from the other using $\M(\t;\ x)=\log\par{p_\t(z\mid x)/p_\t(x,z)}$. Indeed, replacing $p_\t(z\mid x)$ for $q_{\p_x}(z)$ in our formula for $\M(\t;\ x)$ does get us to $\L(\t,\p_x;\ x)$,
$$\begin{aligned}\M(\t;\ x)&= \log\par{p_\t(z\mid x)/p_\t(x,z)} \\&=\E_{z\sim q_{\p_x}(z)}\brak{\log\par{p_\t(z\mid x)/p_\t(x,z)}} \\&\leq \E_{z\sim q_{\p_x}(z)}\brak{\log\par{q_{\p_x}(z)/p_\t(x,z)}}\\&=\L(\t,\p_x;\ x)\,.\end{aligned}$$


Suppose $\M(\t;\ x)$ is intractable. Then we can subsitute it for our variational approximation, $\L(\t,\p;\ x)$. Now the question is whether we can tractably calculate $\L(\t,\p;\ x)$ for all $\t,\p,x$.

We have

\L(\t,\p;\ x)=\kl{q_\p(z)}{p_\t(z)} + \E_{z\sim q_\p(z)}\left[\log\par{1/p_\t(x \mid z)}\right]\,,

where $\kl{q_\p(z)}{p_\t(z)}=\int_\Z q_\p(z)\log\par{\frac{q_\p(z)}{p_\t(z)}}\ \d z$
and $\E_{z\sim q_\p(z)}\left[\log\par{1/p_\t(x \mid z)}\right]=\int_\Z q_\p(z)\log\par{1/p_\t(x \mid z)}\ \d z$.

See #Appendix for derivation.

This form is promising because it involves only quantities that we know are tractable, $q_\p(z)$, $p_\t(z)$ and $p_\t(x \mid z)$. However, calculating expectations w.r.t. $q_\p(z)$, and optimizing $\p$ through those expectations (e.g. with gradient descent) may still be tricky to perform tractably, and further approximations are likely needed. E.g. Kingma et al. discusses this problem and ways to get around it.

Maximum Likelihood Estimation

See Jordan et al., section 6.2, Parameter estimation via variational methods.

Often, we are interested in solving

\t^* = \argmin_{\t} \M(\t;\ x)\,.

This is maximum likelihood estimation, which is one way to fit a model (i.e. the family of distributions $p_\t$ for $\t\in\T$) to a dataset. Here, $\M(\t;\ x)$ is called a loss function. Often, we cannot solve this optimization exactly, so we use numerical optimization, such as gradient descent w.r.t. $\t$.

When the loss $\M(\t;\ x)$ is intractable to calculate, we can replace it with the upper bound $\L(\t,\p;\ x)$ and then perform the joint minimization

(\t^*,\p^*)=\argmin_{\t,\p} \L(\t,\p;\ x)\,,

which also gives us the approximation $q_{\p^*}(z)\approx p_{\t^*}(z\mid x)$.

i.i.d. datasets

Usually, we are performing maximum likelihood estimation on a dataset that we are modeling as i.i.d. Specifically, our dataset is a list, $X=(x\up{1},\dots,x\up{n})$, of $n$ instances of the visible dimensions, where each $x\up{k} \in \X$. As before we have the parametrized distribution $p_\t(x,z)$ on $\X\pr\Z$. Since we assume the dataset is i.i.d., then we have

$$p_\t(X) = p_\t(x\up{1},\dots,x\up{n}) = \prod_{k=1}^n p_\t(x\up{k})\,,$$

and when $Z \in \Z^n$,

$$p_\t(X,Z) = \prod_{k=1}^n p_\t(x\up{k},z\up{k})\,.$$

Then we have

$$\begin{aligned}\M(\t;\ X) &= \log\par{1/p_\t(X)} \\&= \sum_{k=1}^n \log\par{1/p_\t(x\up{k})} \\&= \sum_{k=1}^n\M(\t;\ x\up{k})\,.\end{aligned}$$

We want to approximate $p_\t(z\mid x)$ like before, but unlike before $x$ is not held fixed. In addition to having $n$ instances of $x$ in our dataset, we also want to efficiently calculate $p_\t(z\mid x)$ on any arbitrary $x\in\X$ outside our dataset.

Following the example of #Variational methods in ML, let’s perform a separate minimization for every $x\in\X$. That is to say, let $q_\p(z)$ be a distribution over $\Z$ like before, with $\Er(\t,\p;\ x)=\kl{q_\p(z)}{p_\t(z\mid x)}$ and $\L(\t,\p;\ x) = \M(\t;\ x) + \Er(\t,\p;\ x)$.

Then for each $x\up{k} \in X$ we have separate parameters $\p_{x\up{k}}$ so that $q_{\p_{x\up{k}}}(z)$ is our working approximation for $p_\t(z\mid x\up{k})$. Then our approximation of $p_\t(Z\mid X)$ is $q_{\p_{x\up{1}},\dots,\p_{x\up{n}}}(Z)=\prod_{k=1}q_{\p_{x\up{k}}}(z\up{k})$, and so

&\Er(\t,\p;\ X) \\
&= \int_{\Z^n} q_{\p_{x\up{1}},\dots,\p_{x\up{n}}}(Z)\log\par{\frac{q_{\p_{x\up{1}},\dots,\p_{x\up{n}}}(Z)}{p_\t(Z\mid X)}}\ \d Z \\
&= \int_{\Z^n} \brak{\prod_{k=1}^n q_{\p_{x\up{k}}}(z\up{k})}\brak{\sum_{k=1}^n\log\par{\frac{q_{\p_{x\up{k}}}(z\up{k})}{p_\t(z\up{k}\mid x\up{k})}}}\ \d z\up{1}\dots\d z\up{n} \\
&= \sum_{k=1}^n \int_{\Z} q_{\p_{x\up{k}}}(z)\log\par{\frac{q_{\p_{x\up{k}}}(z)}{p_\t(z\mid x\up{k})}}\ \d z \\
&= \sum_{i=1}^n \Er(\t,\p_{x\up{k}};\ x\up{k})\,.

Then our dataset loss is $\sum_{k=1}^n\L(\t,\p_{x\up{k}};\ x\up{k})$ with $\L(\t,\p_{x\up{k}};\ x\up{k}) = \M(\t;\ x\up{k}) + \Er(\t,\p_{x\up{k}};\ x\up{k})$.

To fit the model to $X=(x\up{1},\dots,x\up{n})$ w.r.t. $\t$, we perform the joint minimization,

(\ht,\hp_{x\up{1}},\dots,\hp_{x\up{n}}) = \argmin_{\t,\p_{x\up{1}},\dots,\p_{x\up{n}}} \sum_{k=1}^n\L(\t,\p_{x\up{k}};\ x\up{k})\,,
to obtain $\ht$ (we can discard the $\set{\hp_{x\up{k}}}$).

Then for any $x\in\X$ of interest, we perform the minimization

\hp_x = \argmin_{\p} \L(\ht,\p;\ x)\,,

giving us $q_{\hp_x}(z) \approx p_\ht(z\mid x)$.

Another Approach

An alternative way to approximate $p_\t(z\mid x)$ for arbitrary $x\in\X$ is to have our approximate distribution $q_\p$ be a function of both $z$ and $x$ (for a single choice of parameters $\p$). This is typically notated as $q_\p(z\mid x)$, though this is technically an abuse of notation since we do not define the joint $q_\p(x,z)$. We could instead write $q_{\p,x}(z)$ or $q_{\p}(z;\ x)$, both of which make it clear that the probability distribution is only over $\Z$-space. However, I will continue with $q_\p(z\mid x)$ since it visually mimics $p_\t(z\mid x)$ and I find that easier to think about. This gives us

\L(\t,\p;\ x) = \E_{z\sim q_{\p}(z\mid x)}\brak{\log\par{\frac{q_{\p}(z\mid x)}{p_\t(x,z)}}}

Then we have $q_\p(Z\mid X)=\prod_{k=1}q_\p(z\up{k}\mid x\up{k})$, and so

&\Er(\t,\p;\ X) \\
&= \int_{\Z^n} q_\p(Z\mid X)\log\par{\frac{q_\p(Z\mid X)}{p_\t(Z\mid X)}}\ \d Z \\
&= \int_{\Z^n} \brak{\prod_{k=1}^n q_\p(z\up{k}\mid x\up{k})}\brak{\sum_{k=1}^n\log\par{\frac{q_\p(z\up{k}\mid x\up{k})}{p_\t(z\up{k}\mid x\up{k})}}}\ \d z\up{1}\dots\d z\up{n} \\
&= \sum_{k=1}^n \int_{\Z} q_\p(z\mid x\up{k})\log\par{\frac{q_\p(z\mid x\up{k})}{p_\t(z\mid x\up{k})}}\ \d z \\
&= \sum_{i=1}^n \Er(\t,\p;\ x\up{k})\,.

Then our dataset loss is $\L(\t,\p;\ X) = \sum_{k=1}^n\L(\t,\p;\ x\up{k})$ with $\L(\t,\p;\ x\up{k}) = \M(\t;\ x\up{k}) + \Er(\t,\p;\ x\up{k})$.

Our optimization problem becomes

(\ht,\hp) = \argmin_{\t,\p} \sum_{k=1}^n\L(\t,\p;\ x\up{k})\,.

Then $q_\hp(z\mid x) \approx p_\ht(z\mid x)$ for any $x\in\X$ of interest.

This approach may let us approximate $p_\t(z\mid x)$ faster since we don’t rerun our optimization process for every $x$.

Question: What other reasons are there to prefer one of these two approaches to i.i.d. MLE. What are the pros and cons of each?

Variational Bayes

See Jordan et al., section 7.1.3, Bayesian methods.

Variational inference (VI) applied to Bayesian inference is sometimes called “variational Bayes” (VB) (short for “variational Bayesian inference”). Mathematically VI and VB are the same. Whether we are doing Bayesian inference or just inference is a difference about the meaning we ascribe to that math, which determines how we apply it.

Bayesian inference operates in a paradigm which I would call “Bayesian epistemology.” Informally, in this paradigm we observe pieces of data over a period of time, we have a set of hypotheses which remain unobserved, and each hypothesis gives us a prediction about what we will observe in the future given the present. To each hypothesis we assign a weight representing our belief about the plausibility of that hypothesis. Then we take the weighted average of all the hypotheses' predictions to get our final prediction which we use, e.g. for decision making. See Deconstructing Bayesian Inference for a more in-depth account of Bayesian epistemology.

Formally, suppose we have a joint distribution $P(x,z)$ as before, except that $P$ is not parametrized. We take $\Z$ to be our hypothesis space and $\X$ to be the data space. We also suppose that we have partial data, e.g. we observe $x_{\leq n} = (x_1,\dots,x_n) \in \X_1\pr\dots\pr\X_n$, which is a subset of all the dimensions of the data space $\X$. I didn’t specify that we observe each dimension in the order of its indexing, but for convenience let’s re-index so that that is the case.

Let $P(z)$ be the belief weight we assign for each $z\in\Z$. Call $P(z)$ our prior about $z$. Upon observing $x_{\leq n}$, our beliefs change, where $P(z\mid x_{\leq n})$ is now the weight we assign to $z$, called the posterior. We are interested in our average prediction distribution over the unobserved data, $x_{>n} = (x_{n+1},x_{n+2},\dots) \in \X_{n+1}\pr\X_{n+2}\pr\dots$, given the observed data $x_{\leq n}$,

P(x_{>n}\mid x_{\leq n}) &= \int_{\Z}P(x_{>n}\mid x_{\leq n},z)P(z\mid x_{\leq n})\ \d z \\

for all $x_{>n}\in\X_{>n}$. We can see that the posterior $P(z\mid x_{\leq n})$ is involved in this calculation.

When the posterior is intractable, we can perform VI (now restyled as VB), by defining an approximate posterior $q_\p(z)$ parametrized by $\p$. Like before, let

$$\M(x_{\leq n}) = \log\par{1/P(x_{\leq n})}$$

$$\Er(\p;\ x_{\leq n}) = \kl{q_\p(z)}{P(z\mid x_{\leq n})}$$


$$\L(\p;\ x_{\leq n}) = \M(x_{\leq n}) + \Er(\p;\ x_{\leq n})$$

Then we want to find the infinite vector $(\p^*_{x_{\leq n}})_{x_{\leq n} \in \X_{\leq n}}$ (equivalently represented as the function $\p^*:\X_{\leq n} \to \P$) which solves the minimization problem, $\fa x_{\leq n},\ \min_{\p_{x_{\leq n}}}\L(\p_{x_{\leq n}};\ x_{\leq n})$.

Question: Supposing our prediction distribution $P(x_{>n}\mid x_{\leq n})$ is intractable, can we obtain a variational approximation of this probability?

In machine learning, model parameters are hypotheses. Bayesian machine learning often seeks to convert a parametrized model like $p_\t(x,z)$ into a Bayesian model by putting a prior on $\T$.

Let $P(x,z,\t) = p_\t(x,z)P(\t)$ where $P(\t)$ is the prior probability of $\t$. Instead of finding a single $\t^*$ which maximizes $p_\t(x)$, we are interested in the posterior $P(\t \mid x)$ and our average prediction of the latent variable, $P(z\mid x)$.

This is the same as what I outlined above, replacing $x_{\leq n} \mapsto x$, $z\mapsto\t$ and $x_{>n}\mapsto z$.


& \L(\t,\p;\ x) \\
=\ & \Er(\t,\p;\ x) + \M(\t;\ x) \\
=\ & \kl{q_\p(z)}{p_\t(z\mid x)} + \log \par{1/p_\t(x)} \\
=\ & \int_\Z q_\p(z) \log\par{\frac{q_\p(z)}{p_\t(z\mid x)}}\ \d z + \int_\Z q_\p(z)\log \par{\frac{1}{p_\t(x)}}\ \d z \\
=\ & \int_\Z q_\p(z) \log\par{\frac{q_\p(z)}{p_\t(x\mid z)p_\t(z)/p_\t(x)}\cdot\frac{1}{p_\t(x)}}\ \d z \\
=\ & \int_\Z q_\p(z) \log\par{\frac{q_\p(z)}{p_\t(z)}\cdot\frac{1}{p_\t(x\mid z)}}\ \d z \\
=\ & \int_\Z q_\p(z) \log\par{\frac{q_\p(z)}{p_\t(z)}}\ \d z + \int_\Z q_\p(z) \log\par{\frac{1}{p_\t(x\mid z)}}\ \d z \\
=\ & \kl{q_\p(z)}{p_\t(z)} + \E_{z\sim q_\p(z)}\left[\log\par{1/p_\t(x \mid z)}\right] \,.


Shannon vs Universal Compression

Ideal Gas Entropy Derivation