18.1. Introduction to Gaussian Processes — Dive into Deep Learning 1.0.3 documentation (2024)

In many cases, machine learning amounts to estimating parameters fromdata. These parameters are often numerous and relatively uninterpretable— such as the weights of a neural network. Gaussian processes, bycontrast, provide a mechanism for directly reasoning about thehigh-level properties of functions that could fit our data. For example,we may have a sense of whether these functions are quickly varying,periodic, involve conditional independencies, or translation invariance.Gaussian processes enable us to easily incorporate these properties intoour model, by directly specifying a Gaussian distribution over thefunction values that could fit our data.

Let’s get a feel for how Gaussian processes operate, by starting withsome examples.

Suppose we observe the following dataset, of regression targets(outputs), \(y\), indexed by inputs, \(x\). As an example, thetargets could be changes in carbon dioxide concentrations, and theinputs could be the times at which these targets have been recorded.What are some features of the data? How quickly does it seem to varying?Do we have data points collected at regular intervals, or are theremissing inputs? How would you imagine filling in the missing regions, orforecasting up until \(x=25\)?

Fig. 18.1.1 Observed data.¶

In order to fit the data with a Gaussian process, we start by specifyinga prior distribution over what types of functions we might believe to bereasonable. Here we show several sample functions from a Gaussianprocess. Does this prior look reasonable? Note here we are not lookingfor functions that fit our dataset, but instead for specifyingreasonable high-level properties of the solutions, such as how quicklythey vary with inputs. Note that we will see code for reproducing all ofthe plots in this notebook, in the next notebooks on priors andinference.

Fig. 18.1.2 Sample prior functions that we may want to represent with our model.¶

Once we condition on data, we can use this prior to infer a posteriordistribution over functions that could fit the data. Here we show sampleposterior functions.

Fig. 18.1.3 Sample posterior functions, once we have observed the data.¶

We see that each of these functions are entirely consistent with ourdata, perfectly running through each observation. In order to use theseposterior samples to make predictions, we can average the values ofevery possible sample function from the posterior, to create the curvebelow, in thick blue. Note that we do not actually have to take aninfinite number of samples to compute this expectation; as we will seelater, we can compute the expectation in closed form.

Fig. 18.1.4 Posterior samples, alongside posterior mean, which can be used forpoint predictions, in blue.¶

We may also want a representation of uncertainty, so we know howconfident we should be in our predictions. Intuitively, we should havemore uncertainty where there is more variability in the sample posteriorfunctions, as this tells us there are many more possible values the truefunction could take. This type of uncertainty is called epistemicuncertainty, which is the reducible uncertainty associated with lackof information. As we acquire more data, this type of uncertaintydisappears, as there will be increasingly fewer solutions consistentwith what we observe. Like with the posterior mean, we can compute theposterior variance (the variability of these functions in the posterior)in closed form. With shade, we show two times the posterior standarddeviation on either side of the mean, creating a credible intervalthat has a 95% probability of containing the true value of the functionfor any input \(x\).

Fig. 18.1.5 Posterior samples, including 95% credible set.¶

The plot looks somewhat cleaner if we remove the posterior samples,simply visualizing the data, posterior mean, and 95% credible set.Notice how the uncertainty grows away from the data, a property ofepistemic uncertainty.

Fig. 18.1.6 Point predictions, and credible set.¶

The properties of the Gaussian process that we used to fit the data arestrongly controlled by what’s called a covariance function, also knownas a kernel. The covariance function we used is called the RBF(Radial Basis Function) kernel, which has the form

(18.1.1)¶\[k_{\textrm{RBF}}(x,x') = \textrm{Cov}(f(x),f(x')) = a^2 \exp\left(-\frac{1}{2\ell^2}||x-x'||^2\right)\]

The hyperparameters of this kernel are interpretable. The amplitudeparameter \(a\) controls the vertical scale over which the functionis varying, and the length-scale parameter \(\ell\) controls therate of variation (the wiggliness) of the function. Larger \(a\)means larger function values, and larger \(\ell\) means more slowlyvarying functions. Let’s see what happens to our sample prior andposterior functions as we vary \(a\) and \(\ell\).

The length-scale has a particularly pronounced effect on thepredictions and uncertainty of a GP. At \(||x-x'|| = \ell\) , thecovariance between a pair of function values is \(a^2\exp(-0.5)\).At larger distances than \(\ell\) , the values of the functionvalues becomes nearly uncorrelated. This means that if we want to make aprediction at a point \(x_*\), then function values with inputs\(x\) such that \(||x-x'||>\ell\) will not have a strong effecton our predictions.

Let’s see how changing the lengthscale affects sample prior andposterior functions, and credible sets. The above fits use alength-scale of \(2\). Let’s now consider\(\ell = 0.1, 0.5, 2, 5, 10\) . A length-scale of \(0.1\) isvery small relative to the range of the input domain we are considering,\(25\). For example, the values of the function at \(x=5\) and\(x=10\) will have essentially no correlation at such alength-scale. On the other hand, for a length-scale of \(10\), thefunction values at these inputs will be highly correlated. Note that thevertical scale changes in the following figures.

Notice as the length-scale increases the ‘wiggliness’ of the functionsdecrease, and our uncertainty decreases. If the length-scale is small,the uncertainty will quickly increase as we move away from the data, asthe datapoints become less informative about the function values.

Now, let’s vary the amplitude parameter, holding the length-scale fixedat \(2\). Note the vertical scale is held fixed for the priorsamples, and varies for the posterior samples, so you can clearly seeboth the increasing scale of the function, and the fits to the data.

We see the amplitude parameter affects the scale of the function, butnot the rate of variation. At this point, we also have the sense thatthe generalization performance of our procedure will depend on havingreasonable values for these hyperparameters. Values of \(\ell=2\)and \(a=1\) appeared to provide reasonable fits, while some of theother values did not. Fortunately, there is a robust and automatic wayto specify these hyperparameters, using what is called the marginallikelihood, which we will return to in the notebook on inference.

So what is a GP, really? As we started, a GP simply says that anycollection of function values \(f(x_1),\dots,f(x_n)\), indexed byany collection of inputs \(x_1,\dots,x_n\) has a joint multivariateGaussian distribution. The mean vector \(\mu\) of this distributionis given by a mean function, which is typically taken to be a constantor zero. The covariance matrix of this distribution is given by thekernel evaluated at all pairs of the inputs \(x\).

(18.1.2)¶\[\begin{split}\begin{bmatrix}f(x) \\f(x_1) \\ \vdots \\ f(x_n) \end{bmatrix}\sim \mathcal{N}\left(\mu, \begin{bmatrix}k(x,x) & k(x, x_1) & \dots & k(x,x_n) \\ k(x_1,x) & k(x_1,x_1) & \dots & k(x_1,x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x) & k(x_n, x_1) & \dots & k(x_n,x_n) \end{bmatrix}\right)\end{split}\]

Equation (18.1.2) specifies a GP prior. We can compute theconditional distribution of \(f(x)\) for any \(x\) given\(f(x_1), \dots, f(x_n)\), the function values we have observed.This conditional distribution is called the posterior, and it is whatwe use to make predictions.

In particular,

(18.1.3)¶\[f(x) | f(x_1), \dots, f(x_n) \sim \mathcal{N}(m,s^2)\]

where

(18.1.4)¶\[m = k(x,x_{1:n}) k(x_{1:n},x_{1:n})^{-1} f(x_{1:n})\]

(18.1.5)¶\[s^2 = k(x,x) - k(x,x_{1:n})k(x_{1:n},x_{1:n})^{-1}k(x,x_{1:n})\]

where \(k(x,x_{1:n})\) is a \(1 \times n\) vector formed byevaluating \(k(x,x_{i})\) for \(i=1,\dots,n\) and\(k(x_{1:n},x_{1:n})\) is an \(n \times n\) matrix formed byevaluating \(k(x_i,x_j)\) for \(i,j = 1,\dots,n\). \(m\) iswhat we can use as a point predictor for any \(x\), and \(s^2\)is what we use for uncertainty: if we want to create an interval with a95% probability that \(f(x)\) is in the interval, we would use\(m \pm 2s\). The predictive means and uncertainties for all theabove figures were created using these equations. The observed datapoints were given by \(f(x_1), \dots, f(x_n)\) and chose a finegrained set of \(x\) points to make predictions.

Let’s suppose we observe a single datapoint, \(f(x_1)\), and we wantto determine the value of \(f(x)\) at some \(x\). Because\(f(x)\) is described by a Gaussian process, we know the jointdistribution over \((f(x), f(x_1))\) is Gaussian:

(18.1.6)¶\[\begin{split}\begin{bmatrix}f(x) \\f(x_1) \\\end{bmatrix}\sim\mathcal{N}\left(\mu,\begin{bmatrix}k(x,x) & k(x, x_1) \\k(x_1,x) & k(x_1,x_1)\end{bmatrix}\right)\end{split}\]

The off-diagonal expression \(k(x,x_1) = k(x_1,x)\) tells us howcorrelated the function values will be — how strongly determined\(f(x)\) will be from \(f(x_1)\). We have seen already that ifwe use a large length-scale, relative to the distance between \(x\)and \(x_1\), \(||x-x_1||\), then the function values will behighly correlated. We can visualize the process of determining\(f(x)\) from \(f(x_1)\) both in the space of functions, and inthe joint distribution over \(f(x_1), f(x)\). Let’s initiallyconsider an \(x\) such that \(k(x,x_1) = 0.9\), and\(k(x,x)=1\), meaning that the value of \(f(x)\) is moderatelycorrelated with the value of \(f(x_1)\). In the joint distribution,the contours of constant probability will be relatively narrow ellipses.

Suppose we observe \(f(x_1) = 1.2\). To condition on this value of\(f(x_1)\), we can draw a horizontal line at \(1.2\) on our plotof the density, and see that the value of \(f(x)\) is mostlyconstrained to \([0.64,1.52]\). We have also drawn this plot infunction space, showing the observed point \(f(x_1)\) in orange, and1 standard deviation of the Gaussian process predictive distribution for\(f(x)\) in blue, about the mean value of \(1.08\).

Now suppose we have a stronger correlation, \(k(x,x_1) = 0.95\). Nowthe ellipses have narrowed further, and the value of \(f(x)\) iseven more strongly determined by \(f(x_1)\). Drawing a horizontalline at \(1.2\), we see the contours for \(f(x)\) support valuesmostly within \([0.83, 1.45]\). Again, we also show the plot infunction space, with one standard deviation about the mean predictivevalue of \(1.14\).

We see that the posterior mean predictor of our Gaussian process iscloser to \(1.2\), because there is now a stronger correlation. Wealso see that our uncertainty (the error bars) have somewhat decreased.Despite the strong correlation between these function values, ouruncertainty is still righly quite large, because we have only observed asingle data point!

This procedure can give us a posterior on \(f(x)\) for any\(x\), for any number of points we have observed. Suppose we observe\(f(x_1), f(x_2)\). We now visualize the posterior for \(f(x)\)at a particular \(x=x'\) in function space. The exact distributionfor \(f(x)\) is given by the above equations. \(f(x)\) isGaussian distributed, with mean

(18.1.7)¶\[m = k(x,x_{1:3}) k(x_{1:3},x_{1:3})^{-1} f(x_{1:3})\]

and variance

(18.1.8)¶\[s^2 = k(x,x) - k(x,x_{1:3})k(x_{1:3},x_{1:3})^{-1}k(x,x_{1:3})\]

In this introductory notebook, we have been considering noise freeobservations. As we will see, it is easy to include observation noise.If we assume that the data are generated from a latent noise freefunction \(f(x)\) plus iid Gaussian noise\(\epsilon(x) \sim \mathcal{N}(0,\sigma^2)\) with variance\(\sigma^2\), then our covariance function simply becomes\(k(x_i,x_j) \to k(x_i,x_j) + \delta_{ij}\sigma^2\), where\(\delta_{ij} = 1\) if \(i=j\) and \(0\) otherwise.

We have already started getting some intuition about how we can use aGaussian process to specify a prior and posterior over solutions, andhow the kernel function affects the properties of these solutions. Inthe following notebooks, we will precisely show how to specify aGaussian process prior, introduce and derive various kernel functions,and then go through the mechanics of how to automatically learn kernelhyperparameters, and form a Gaussian process posterior to makepredictions. While it takes time and practice to get used to conceptssuch as a “distributions over functions”, the actual mechanics offinding the GP predictive equations is actually quite simple — making iteasy to get practice to form an intuitive understanding of theseconcepts.

18.1.1. Summary¶

In typical machine learning, we specify a function with some freeparameters (such as a neural network and its weights), and we focus onestimating those parameters, which may not be interpretable. With aGaussian process, we instead reason about distributions over functionsdirectly, which enables us to reason about the high-level properties ofthe solutions. These properties are controlled by a covariance function(kernel), which often has a few highly interpretable hyperparameters.These hyperparameters include the length-scale, which controls howrapidly (how wiggily) the functions are. Another hyperparameter is theamplitude, which controls the vertical scale over which our functionsare varying. Representing many different functions that can fit thedata, and combining them all together into a predictive distribution, isa distinctive feature of Bayesian methods. Because there is a greateramount of variability between possible solutions far away from the data,our uncertainty intuitively grows as we move from the data.

A Gaussian process represents a distribution over functions byspecifying a multivariate normal (Gaussian) distribution over allpossible function values. It is possible to easily manipulate Gaussiandistributions to find the distribution of one function value based onthe values of any set of other values. In other words, if we observe aset of points, then we can condition on these points and infer adistribution over what the value of the function might look like at anyother input. How we model the correlations between these points isdetermined by the covariance function and is what defines thegeneralization properties of the Gaussian process. While it takes timeto get used to Gaussian processes, they are easy to work with, have manyapplications, and help us understand and develop other model classes,like neural networks.

18.1.2. Exercises¶

What is the difference between epistemic uncertainty versusobservation uncertainty?
Besides rate of variation and amplitude, what other properties offunctions might we want to consider, and what would be real-worldexamples of functions that have those properties?
The RBF covariance function we considered says that covariances (andcorrelations) between observations decrease with their distance inthe input space (times, spatial locations, etc.). Is this areasonable assumption? Why or why not?
Is a sum of two Gaussian variables Gaussian? Is a product of twoGaussian variables Gaussian? If (a,b) have a joint Gaussiandistribution, is a|b (a given b) Gaussian? Is a Gaussian?
Repeat the exercise where we observe a data point at\(f(x_1) = 1.2\), but now suppose we additionally observe\(f(x_2) = 1.4\). Let \(k(x,x_1) = 0.9\), and\(k(x,x_2) = 0.8\). Will we be more or less certain about thevalue of \(f(x)\), than when we had only observed \(f(x_1)\)?What is the mean and 95% credible set for our value of \(f(x)\)now?
Do you think increasing our estimate of observation noise wouldincrease or decrease our estimate of the length-scale of the groundtruth function?
As we move away from the data, suppose the uncertainty in ourpredictive distribution increases to a point, then stops increasing.Why might that happen?

Discussions