Gaussian Processes

March 30, 2023

[This is a note, which is the seed of an idea and something I’ve written quickly, as opposed to articles that I’ve optimized for readability and transmission of ideas.]

$$
\newcommand{\0}{\mathrm{false}}
\newcommand{\1}{\mathrm{true}}
\newcommand{\mb}{\mathbb}
\newcommand{\mc}{\mathcal}
\newcommand{\mf}{\mathfrak}
\newcommand{\and}{\wedge}
\newcommand{\or}{\vee}
\newcommand{\es}{\emptyset}
\newcommand{\a}{\alpha}
\newcommand{\t}{\theta}
\newcommand{\T}{\Theta}
\newcommand{\o}{\omega}
\newcommand{\O}{\Omega}
\newcommand{\x}{\xi}
\newcommand{\z}{\zeta}
\newcommand{\k}{\kappa}
\newcommand{\fa}{\forall}
\newcommand{\ex}{\exists}
\newcommand{\X}{\mc{X}}
\newcommand{\Y}{\mc{Y}}
\newcommand{\Z}{\mc{Z}}
\newcommand{\P}{\Phi}
\newcommand{\y}{\psi}
\newcommand{\p}{\phi}
\newcommand{\l}{\lambda}
\newcommand{\s}{\sigma}
\newcommand{\pr}{\times}
\newcommand{\B}{\mb{B}}
\newcommand{\N}{\mb{N}}
\newcommand{\R}{\mb{R}}
\newcommand{\E}{\mb{E}}
\newcommand{\e}{\varepsilon}
\newcommand{\d}{\mathrm{d}}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\par}[1]{\left(#1\right)}
\newcommand{\tup}{\par}
\newcommand{\brak}[1]{\left[#1\right]}
\newcommand{\vtup}[1]{\left\langle#1\right\rangle}
\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
$$

Gaussian processes have been on my backlog of ML topics to understand better. I refreshed myself on the basics by reading Bishop, section 6.4, and Rasmussen & Williams. Based on my reading, here is a brief account of what Gaussian processes are.

Two Views

There are two ways to understand what a GP is, respectively called the weight-space view and the function-space view.

Weight-Space

In the weight-space view, we have a parametrized function space $f(x;\ \t)$ where $x$ is the input and $\t$ is/are the parameter(s) (e.g. both real vectors). We put a prior $p(\t)$ on the parameters and define $p(y \mid x, \t) = \mc{N}(y \mid f(x;\ \t), \s I)$, a multivariate Gaussian centered at $f(x;\ \t)$ and with some diagonal covariance matrix and variance $\s$ in each dimension of $y$. This induces a marginal distribution $p(y \mid x)$, which is what we get if we first draw $\hat{\t}\sim p(\t)$ and then draw $\hat{y} \sim p(y \mid x, \hat{\t})$. This turns out to be Gaussian when $f(x;\ \t)=\t\cdot\Phi(x)$ is a linear function spaceis this true for arbitrary parametrized function spaces with Gaussian parameter prior and Gaussian output noise?, where $\Phi(x)$ defines the features of $x$, i.e. is some constant function from $x$-space to $\t$-space so that $\t\cdot\Phi(x)$ is the dot-product of two vectors.

Importantly, the joint distribution over two outputs, $y_1$ and $y_2$, given $x_1$ and $x_2$ respectively is not independent. We have

$$
p(y_1,y_2 \mid x_1,x_2) = \int p(y_1 \mid x_1, \t)p(y_2 \mid x_2, \t)p(\t)\ \d \t
$$

where the integral over $\t$ removes the independence of the individual $p(y \mid x, \t)$ distributions. This generalizes to a vector $X=(x_1,\dots,x_n)$ of inputs in the same way.

This is how “learning” takes place in a GP - by conditioning the prediction distribution $p(y'\mid x', D)$ on a dataset of observed input-output pairs, $D=((x_1,y_1),\dots,(x_n,y_n))$.

$$
p(y'\mid x', D) = \frac{p(y',y_1,\dots,y_n \mid x',x_1,\dots,x_n)}{p(y_1,\dots,y_n, \mid x',x_1,\dots,x_n)}
$$

where $p(y_1,\dots,y_n, \mid x',x_1,\dots,x_n)=p(y_1,\dots,y_n, \mid x_1,\dots,x_n)$.

Function-Space

We define a probability distribution over all functions, say $f:\R^k\to\R^r$, by basically treating such function as vectors (with an uncountable infinity of dimensions) where each input $x$ is a dimension labeled with $x$ and each output $f(x)$ is the value of the vector on dimension $x$, and defining an infinite dimensional Gaussian distribution over these vectors.

This implies that the mean “vector” of the Gaussian is represented as a function, $m(x)$, and the covariance “matrix” of the Gaussian a function of two inputs, $\k(x,x')$.

E.g.,

$$
\k(x,x') = \exp(-\tfrac{1}{2}\abs{x-x'}^2)
$$

Then we have that for fixed inputs $x,x'$,

$$
\E_f[f(x)] = m(x)
$$

and

$$
\text{Cov}_f[f(x),f(x')] = \E_f[(f(x)-\E_f[f(x)])(f(x')-\E_f[f(x')])] = \k(x,x')\,.
$$

From context, I am guessing that for a single input $x$, the marginal output distribution is

$$
p(y \mid x) = \mc{N}(y \mid m(x), \k(x,x))
$$

where $\k(x,x) = \text{Cov}_f[f(x),f(x)] = \text{Var}_f[f(x)]$.

And likewise, in the multivariate case where $X=(x_1,\dots,x_n)$ and $Y=(y_1,\dots,y_n)$ are vectors of inputs and outputs, $p(Y\mid X) = \mc{N}(Y \mid m(X), \k(X,X))$, where $m(X)=(m(x_1),\dots,m(x_n))$ is $m$ applied element-wise to $X$, and $\k(X,X)_{i,j} = \k(x_i,x_j)$ is $\k$ applied to the cartesian product $X\times X$ to produce a matrix.

Questions

Why use a GP? Why is this useful? Where has it been successfully applied?
Why are GPs theoretically interesting?Yes GPs are an elegant way of defining distributions over function spaces, and moreover, provide a way to tractable sample functions. But I am really asking what the GP formulation allows us to say or know more broadly. What are the consequences of GPs to our understanding of learning from data?
Which GPs are universal function approximations?
What makes us think that a GP would generalize well on test inputs?
Is there a corresponding parametrized function for every function-space view with a given kernel?
Are GPs equivalent to Bayesian ML?
What happens in the parameter-space view if we don’t put noise on the output of $f(x;\ \t)$?

MLnote

Bits-Back Coding

Variational Autoencoders