Deep Probabilistic Modelling with Gaussian Processes #NIPS2017
I wrote these notes in December 2017 after attending my first NeurIPS and never published them. Eight years later, I’m hitting publish with no edits — just this note as context. Reading them back is strange. The five takeaways from that week — Bayesian deep learning, fairness and bias, the need for theory, deep RL, GANs — all became defining research arcs of the decade that followed. The Rahimi & Recht “alchemy” talk, which was controversial at the time, looks prophetic now. And the throwaway concern about “generating realistic fake videos with geopolitical consequences” landed harder than I think anyone in that room expected.
The corporate circus section is its own kind of time capsule: an Intel Flo Rida concert, Nvidia handing out $3k GPUs to the audience, and an invite-only Tesla party with Elon Musk and Andrej Karpathy. A different era.
What strikes me most, though, is how optimistic and open everything felt. The field was moving fast but still felt legible — you could go to one conference and get your arms around the major themes. That’s long gone. Anyway — notes from the before-times, published from the future.

Deep Probabilistic Modelling with Gaussian Processes
Neil D. Lawrence
ML = data + models → prediction. But predictions alone aren’t enough — we need to make decisions. To combine data and model we need: (1) a prediction function, and (2) an objective function. Sources of uncertainty: scarcity of training data, mismatch of prediction functions, uncertainty in the objective/cost function.
Following MacKay (1992) and Neal (1994): take a probabilistic approach.
1. Neural Networks as Probabilistic Models
A neural network computes:
where is a nonlinear activation. This is linear in the parameters but nonlinear in the inputs — adaptive basis functions. are fixed for a given analysis; in ML we optimize both and .
Probabilistic inference:
- ← data
- ← model (joint distribution over world)
- ← prediction (posterior)
The goal: — the predictive distribution at a new point .
The likelihood of a data point: . Under iid noise (the iid assumption is about the noise, not the underlying function):
Commonly Gaussian likelihood; MLE for supervised learning. With priors over latents, you get unsupervised learning.
Graphical models represent joint distributions through conditional dependencies (e.g., Markov chains). Performing inference is easy to write down but computationally challenging — high-dimensional integrals.
2. From Neural Networks to Gaussian Processes
Fix . Place a Gaussian prior over :
Since sums and scalings of Gaussians are Gaussian, marginalizing out gives:
is distributed with zero mean and covariance where is the design matrix of activations. A neural network with a Gaussian prior over its output weights is already a Gaussian process — but a degenerate one.
Degeneracy: the rank of is at most (the number of hidden units). As , the covariance matrix is not full rank: . The model can’t respond to new data as it comes in — it’s parametric.
Neal’s insight (1994): take . Sample infinitely many hidden units in the kernel function. The prior doesn’t need to be Gaussian. Scale output variance down as increases. You get a non-degenerate Gaussian process.
3. Gaussian Processes
A GP is a distribution over functions: any finite collection of function values is jointly Gaussian. Fully specified by:
- Mean function:
- Covariance (kernel) function:
The kernel matrix: .
Posterior: given observations , , the posterior at test point is:
GPs let you analytically compute the posterior mean and variance at all points. The exponentiated quadratic (RBF) kernel gives infinite smoothness — not always desirable (Brownian motion is also a GP, with very different smoothness properties).
Sparse GPs: full GP inference is in time and in storage (due to the matrix inversion ). In practice, use a sparse GP with inducing variables to get a low-rank approximation of the full covariance.
4. Deep Neural Networks and Bottleneck Layers
A matrix between two 1000-unit layers has parameters — prone to overfitting. One fix: parametrize via its SVD to create bottleneck layers. Stacking neural networks gives a composite function.
If you want to eliminate NN parameters entirely: replace each layer with a GP and integrate them out. Taking each layer to infinitely many units gives a vector-valued GP. Bottleneck layers are critical in this construction.
5. Deep Gaussian Processes
A deep GP is a composition of GPs:
where each is a GP. This is equivalent to a Markov chain under the Markov condition:
Why go deep?
- GPs give priors over functions
- Derivatives of a GP are a GP (when they exist)
- Some kernels are universal approximators
- Depth enables abstraction of features and handles non-Gaussian derivative distributions
Caveat: Gaussian derivatives can be problematic — many functions (jump functions, heavy-tailed) don’t have Gaussian derivatives. Depth helps encode these via process composition.
Difficulties:
- Propagating probability distributions through nonlinearities
- Normalization of the distribution becomes intractable
Solution: use a variational approach to stack GP models [Damianou & Lawrence, 2013]. As depth increases, the derivative distribution becomes heavy-tailed [Duvenaud et al., 2014] — which is actually desirable for modeling complex functions.
Deep GPs handle heteroskedasticity well (e.g., Olympic marathon running times where length scales change over time). Can be extended with a shared latent variable model (LVM) for multi-output settings.