A Primer on Optimal Transport #NIPS2017
I wrote these notes in December 2017 after attending my first NeurIPS and never published them. Eight years later, I’m hitting publish with no edits — just this note as context. Reading them back is strange. The five takeaways from that week — Bayesian deep learning, fairness and bias, the need for theory, deep RL, GANs — all became defining research arcs of the decade that followed. The Rahimi & Recht “alchemy” talk, which was controversial at the time, looks prophetic now. And the throwaway concern about “generating realistic fake videos with geopolitical consequences” landed harder than I think anyone in that room expected.
The corporate circus section is its own kind of time capsule: an Intel Flo Rida concert, Nvidia handing out $3k GPUs to the audience, and an invite-only Tesla party with Elon Musk and Andrej Karpathy. A different era.
What strikes me most, though, is how optimistic and open everything felt. The field was moving fast but still felt legible — you could go to one conference and get your arms around the major themes. That’s long gone. Anyway — notes from the before-times, published from the future.

A Primer on Optimal Transport
Marco Cuturi | Justin M Solomon
What is optimal transport? The natural geometry for probability measures — it puts a distance on the space of distributions, enabling comparisons of “bags of features.” Given statistical models and , OT gives us a notion of divergence to measure how well we’re doing.
Outline: intro → algorithms → apps ( as a loss) → apps ( for estimation). Slides at optimaltransport.github.io.
1. The Monge Problem
Monge (1781): move a pile of dirt into a hole (with shovels).
- : height of pile at
- : destination for point — a map from to
- : distance traveled
- : work done at
must satisfy the pushforward constraint , meaning for preimages mapping into .
The Monge problem: find the map that minimizes total work:
Caveat: an optimal Monge map does not always exist (e.g., when is a Dirac mass and is not).
1a. The Kantorovich Relaxation
Kantorovich’s insight: instead of a deterministic map, allow measure couplings (joint distributions) — i.e., with marginals and . This is just a linear program.
Primal (Kantorovich):
Dual (potential functions):
The dual is elegant: and are the potential functions, and the dual tells you which points are most “expensive” to transport.
Proposition: for well-behaved cost functions, if has a density then an optimal Monge map between and exists.
1b. -Wasserstein Distance
The -Wasserstein distance between probability measures and :
This is a true metric on the space of probability measures. The geometry it induces is very different from information-theoretic metrics like KL divergence. McCann (1995) showed it gives rise to displacement interpolation — geodesics in the space of measures. (Solomon ‘15 has nice applications.)
2. How to Compute OT
Four cases:
- discrete → discrete
- discrete → continuous
- continuous → continuous
- continuous → discrete (open)
Cases 2–3 are largely “up for grabs.” Easy special cases:
- Univariate: compute CDFs and quantile functions. has a closed form:
- Gaussians: closed form, is linear.
- Dirac masses: — Wasserstein distance between point masses equals the ground distance.
- Equal number of points: reduces to the Monge problem (an assignment problem).
Complexity of the LP: via min-cost flow. Ouch.
Entropic Regularization and Sinkhorn
Optimal solutions to the LP are vertices of a polytope — unstable, non-unique, and non-differentiable. We want something faster, scalable, and differentiable.
Entropic regularization (Shannon entropy ):
As , the solution approaches the independent coupling ; as , it recovers the Monge solution. The regularization makes the problem strictly convex and smooth. [Wilson ‘62]
This can be solved with simple Lagrangians — leading to Sinkhorn’s algorithm: alternately rescale rows and columns of the kernel matrix to match the marginals and .
Sinkhorn = block coordinate ascent on the dual. [Altschuler et al. ‘17]
- Convergence: linear in general; on gridded spaces using convolutions.
- Sinkhorn interpolates between (hard OT) and MMD (kernel two-sample test).
Sample complexity caveat [Hashimoto ‘16, Bonneel ‘16, Shalit ‘16]: error in decreases very slowly in — bad sample complexity. The Wasserstein LP is not well-suited for high-dimensional data directly.
3. Applications
Retrieval: [Kusner ‘15] Word Mover’s Distance — document similarity via OT over word embeddings.
Barycenters: averaging measures under vs. gives very different results. The Wasserstein barycenter of distributions with weights :
Averaging histograms is an LP; or use primal descent on regularized [Cuturi ‘14]. Application: brain imaging, finding smooth interpolations between distributions.
Wasserstein Posterior (WASP): aggregate distributed posteriors using Wasserstein barycenters [Srivastava ‘15].
Wasserstein Propagation [Solomon ‘14]: semi-supervised learning on graphs — propagate label distributions via OT. Could fix label noise or handle missing data.
Dictionary learning / topic models [Rolet ‘16]: represent documents as mixtures of dictionary elements under a Wasserstein loss.
Wasserstein PCA: generalized principal geodesics in the space of measures (negative curvature space — worth investigating further).
Distributionally robust optimization [Esfahani ‘17]: learning with Wasserstein ambiguity — robust to perturbations of the training distribution (minimax formulation).
Domain adaptation:
- Estimate transport map from source to target domain
- Transport labeled source samples to target domain
- Train classifier on transported samples
Generative models: density fitting via maximum likelihood is just minimizing . Instead, use a low-dimensional latent space with pushforward :
This is the Wasserstein GAN formulation [Arjovsky et al. ‘17] — use as the loss between data and model rather than JS divergence.