Markov Chain Monte Carlo (MCMC)
Table of Contents
The objective is to draw samples from a probability distribution. Doing that for an arbitrary distribution is not easy. So, instead of generating a sequence of iid random variable with an arbitrary distribution, we generate a sequence of random variables whose stationary distribution is the desired distribution. These random variables not independent but rather are states of a Markov Chain. And the class of algorithms that do this are Markov Chain Monte Carlo algorithms.
Algorithms:
- Metropolis–Hastings (MH): Propose a new sample, accept/reject based on acceptance ratio to ensure detailed balance.
- Gibbs Sampling : Special case of MH where we sample each variable from its conditional distribution given others.
- Hamiltonian Monte Carlo (HMC) : Uses gradient information and simulated physics to explore the space more efficiently.
- No-U-Turn Sampler (NUTS) : Adaptive version of HMC that avoids manual tuning of path length.
1. Uses
1.1. For computing expectations
The objective is to compute:
\begin{align*} \theta = E[h(X)] = \sum_{j=1}^{\infty} h(x_j) P(X = x_j) \end{align*}where, \(X\) is a discrete random variable whose set of values is \(x_j; j \geq 1\), and \(h\) is some function of \(X\).
This is not feasible to do since \(X\) has a lot of values.
One technique is to sample \(X\) from its probability distribution and then use the samples to compute the sample mean.
\begin{align*} \theta = \lim_{n \to \infty} \sum_{i=1}^n \frac {h(X_i)} n \end{align*}But it is not easy to sample i.i.d. \(X_i\) from an arbitrary distribution.
1.2. In Bayesian Inference
Finding the posterior distribution of parameters \(z\) (aka latent variables) given the observed data \(x\) is called Bayesian Inference.
\[ P(z | x) = \frac {P(x, z)} {P(x)} \]
But computing \(P(x) = \int P(x,z) dz\) is usually intractable.
So we instead draw samples from \(P(z|x)\) without computing \(P(x)\) directly. We can compute the non normalized posterior probabilities \(P(x,z)\) at any given value and this is sufficient for algorithms like Metropolis Hastings. And once we have the samples, we can approximate the expectations:
\[ \mathbb{E}_p[f(z)] \approx \frac 1 N \sum_{i=1}^N f(z^i) \]
The key idea is to create a markov chain whose stationary distribution is the target distribution \(P(z | x)\). Then run the chain long enough that it is well mixed, and then use the generated samples to compute quantities of interest.
2. Metropolis Hastings
Metropolis Hastings (MH) algorithm generates a time reversible Markov chain whose stationary probabilities are:
\begin{align*} \pi(j) = b(j) / B \end{align*}where \(B\) is a normalizing constant and \(b(j)\) is an arbitrary function we supply. For Metropolis Hastings we don't need to know \(B\).
Algorithm:
- Start at some state \(i\)
- Sample candidate next state \(j\) based on transition probability \(Q_{ij}\). \(Q\) is transition probability of any other irreducible markov chain.
Goto next state \(j\) with probability \(\alpha(i,j)\) and stay in the same state with probability \(1-\alpha(i,j)\).
This results in a different markov chain with transition probabilities:
\begin{align*} P_{ij} &= q(i,j) \alpha(i,j) \\ P_{ii} &= q(i,i) + \sum_{k \neq i} q(i,k) (1 - \alpha(i, k)) \end{align*}This new markov chain is time reversible and has stationary probabilities \(\pi_i = b(i) / B\) if,
\begin{align*} \alpha(i,j) = \min \left( \frac{b(j) q(j,i)} {b(i) q(i,j)} , 1 \right) \end{align*}
See Example 4.39 for how it can be used to sample from a set of all permutations of number (1, …, n).
For, MCMC we need to just be able to get probability values of a distribution \(p\) upto some proportionality constant at candidate points. And it gives us samples from that distribution. Think of \(q\) as proposal function that propose points in the target distribution and \(\alpha\) as acceptance function that accepts the samples. This acceptance function needs only to evaluate the probability (upto a constant factor) \(b(x) \propto p(x)\) from target distribution and evaluate proposal function \(q(x, x')\).
3. Gibbs Sampling
A version of Metropolis Hastings algorithm.
Objective:
Let \(X = (X_1, ..., X_n)\) is a random vector with probability mass \(p(x)\). We want to sample \(X\).
Requirement:
Gibbs sampler assumes that we can generate a random variable \(X\) having probability mass:
\begin{align*} P(X=x) = P(X_i = x | X_j = x_j, j \neq i) \end{align*}Algorithm:
- Take \(x = (x_1, ..., x_n)\)
- Choose one of the coordinates with equal probability, say \(i\)
Then new value of \(i\) th coordinate is sampled from the conditional distribution based on values in all other coordinates.
\(P(X_i = x | X_j = x_j, j \neq i)\).
The new candidate state \(y\) is same as current state \(x\) but with \(i\) th coordinate replace by \(x\). The candidate state is always selected.
Why? From MH algorithm:
We either stay in \(x\) or transition to \(y\) based on \(\alpha(x,y)\) where,
\begin{align*} \alpha(x, y) &= \min \left( \frac {p(y) q(y,x)} { p(x) q(x, y)} , 1 \right ) \\ &= 1 \end{align*}because,
\begin{align*} q(x, y) &= \frac 1 n P(X_i = X | X_j = x_j ; j \neq i) \\ &= \frac {p(y)} {n P(X_j = x_j, j \neq i)} \end{align*}
See Example 4.40 on how to use gibbs sampling to generate a sets of \(n\) points that are not within a distance \(d\) of each other.
This theory of Gibbs sampler extends in exactly the same way to continuous distributions.