Markov Chain Monte Carlo


Lecture 14

March 18, 2024

Last Class

Bayesian Computation

  • Goal: Sample from posterior distribution to:
    • Capture parametric uncertainty;
    • Compute MAP estimates.
  • Challenging because of arbitrary nature of distributions.

Markov Chain Strategy

  • Generate an appropriate Markov chain so that its stationary distribution of the target distribution \(\pi\);
  • Run its dynamics long enough to converge to the stationary distribution;
  • Use the resulting ensemble of states as Monte Carlo samples from \(\pi\) .

Markov Chain Convergence

Given a Markov chain \(\{X_t\}_{t=1, \ldots, T}\) returned from this procedure, sampling from distribution \(\pi\):

  • \(\mathbb{P}(X_t = y) \to \pi(y)\) as \(t \to \infty\)
  • This means the chain can be considered a dependent sample approximately distributed from \(\pi\).
  • The first values (the transient portion) of the chain are highly dependent on the initial value.

Metropolis-Hastings Algorithm

The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm:

  • The foundational MCMC algorithm (and was named one of the top ten algorithms of the 20th century).
  • Builds a Markov chain based on transitions by:
    • generating proposals for new samples from a conditional proposal distribution \(q(y | x)\);
    • accepting or rejecting those proposals.

The Metropolis-Hastings Algorithm

Given \(X_t = x_t\):

  1. Generate \(Y_t \sim q(y | x_t)\);
  2. Set \(X_{t+1} = Y_t\) with probability \(\rho(x_t, Y_t)\), where \[\rho(x, y) = \min \left\{\frac{\pi(y)}{\pi(x)}\frac{q(x | y)}{q(y | x)}, 1\right\},\] else set \(X_{t+1} = x_t\).

How Simple Is That?

The devil is in the details: performance and efficiency are highly dependent on the choice of \(q\).

Key: There is a tradeoff between exploration and acceptance.

  • Wide proposal: Can make bigger jumps, may be more likely to reject proposals.
  • Narrow proposal: More likely to accept proposals, may not “mix” efficiently.

Proposal Distribution Choice

The original Metropolis et al. (1953) algorithm used symmetric distributions (\(q(y | x) = q(x | y)\)).

Then the acceptance probability reduces to \[\rho = \min \left\{\frac{\pi(y)}{\pi(x)}, 1\right\}.\]

A common choice: \(y \sim \text{Normal}(X_t, \sigma)\) centered around the current point \(X_t\).

Julia Implementation

function mh_transition(x_current, σ)
    # generate new proposal
    x_proposal = rand(Normal(x_current, σ))
    u = rand()
    ρ = log(target_density(x_proposal)) - log(target_density(x_current)) # transition log-probability
    if log(u) < min(ρ, 1)
        y = x_proposal
    else
        y = x_current
    end
    return y, log(target_density(y))
end
mh_transition (generic function with 1 method)

Julia Implementation

function mh_algorithm(n_iter, σ, x₀)
    # initialize storage
    samples = zeros(n_iter) 
    log_target = zeros(n_iter)
    samples[1] = x₀ # start algorithm
    log_target[1] = log(target_density(x₀))
    accept_count = 0
    for i = 2:length(samples) # iterate
        samples[i], log_target[i] = mh_transition(samples[i-1], σ)
        if samples[i] != samples[i-1]
            accept_count += 1
        end
    end
    accept_rate = accept_count / n_iter # compute acceptance rate
    return samples, log_target, accept_rate
end
mh_algorithm (generic function with 1 method)

Linear Regression Example

\[ \begin{gather} y = 5 + 2x + \varepsilon \\ \varepsilon \sim \text{Normal}(0, 3) \end{gather} \]

Code
# create trend for data
x = rand(Uniform(0, 20), 20)
y = 5 .+ 2 * x
# sample and add noise
ε = rand(Normal(0, 3), 20)
y .+= ε

p = scatter(x, y, label="Data", xlabel=L"$x$", ylabel=L"$y$", markershape=:o, markersize=10, tickfontsize=16, guidefontsize=18, legendfontsize=16, bottom_margin=10mm, left_margin=5mm, right_margin=5mm) 
plot!(p, size=(600, 550))
Figure 1: Regression data plot

Model Specification

\[ \begin{gather} y = a + bx + \varepsilon \\ \varepsilon \sim \text{Normal}(0, \sigma). \end{gather} \]

This makes the likelihood: \[ y \sim \text{Normal}(a+bx, \sigma). \]

Prior Selection

\[ \begin{gather} a \sim \text{Normal(0, 2)} \\ b \sim \text{Normal(0, 2)} \\ \sigma \sim \text{Half-Normal}(0, 1) \end{gather} \]

Code
# generate data for samples
function gen_data(a, b, σ)
    x = collect(0:20)
    y = a .+ b * x
    # sample and add noise
    ε = rand(Normal(0, σ), length(x))
    y .+= ε
    return y
end

# sample and plot
n_samples = 1000
a = rand(Normal(0, 2), n_samples)
b = rand(Normal(0, 2), n_samples)
σ = rand(truncated(Normal(0, 1), lower=0), 1000)
y_prior = [gen_data(a[i], b[i], σ[i]) for i in 1:n_samples]
# convert y to a Matrix by vcatting each vector
y_prior = mapreduce(permutedims, vcat, y_prior) 
plt_prior_1 = plot(; ylabel=L"$y$", xlabel=L"$x$",
    tickfontsize=16, legendfontsize=16, guidefontsize=18, bottom_margin=5mm, left_margin=5mm, legend=false)
for x  [0, 5, 10, 15, 20]
    boxplot!(plt_prior_1, [x], y_prior[:, x+1], color=:blue)
end
plot!(plt_prior_1, size=(600, 550))
plt_prior_1
Figure 2: Prior predictive plot for regression example.

Proposal Distribution and Initial Value

To illustrate how the M-H algorithm works, let’s use a proposal \[\mathcal{N}(x_t, 0.01I_3).\]

And let’s start at \[x_0 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\].

First Proposal

Current:

\[X_0 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\]

\[\text{log-posterior} = -2851\]

Iteration:

\[y = \begin{pmatrix}0.94 \\ 1.07 \\ 0.82\end{pmatrix}\]

\[\text{log-posterior} = -3433\]

\[\rho \approx 0 \Rightarrow X_1 = X_0\]

Another Proposal

Current:

\[X_1 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\]

\[\text{log-posterior} = -2851\]

Iteration:

\[y = \begin{pmatrix}1.24 \\ 1.05 \\ 1.04\end{pmatrix}\]

\[\text{log-posterior} = -2165\]

\[\rho =1 \Rightarrow X_2 = y\]

1,000 Iterations

Code
samples, lpost, α = mh_algorithm(100000, [1; 1; 1], x, y)

p = plot(samples[1:1000, :], layout=(1, 3), label="Samples", guidefontsize=18, tickfontsize=16, legendfontsize=16, bottom_margin=15mm, ylabel="Value", xlabel="Iteration", left_margin=10mm)
hline!(p, [5 2 3], color=:red, linestyle=:dash, label="True Value")
plot!(p, size=(1300, 400))
Figure 3: First 1,000 iterations of the MCMC example

100,000 Iterations

Code
p = plot(samples, layout=(1, 3), label="Samples", guidefontsize=18, tickfontsize=16, legendfontsize=16, bottom_margin=15mm, xlabel="Iteration", left_margin=10mm, xticks=0:50000:100000, right_margin=15mm, ylabel = [L"$a$" L"$b$" L"$\sigma$"], legend=[:false :bottomright :false])
hline!(p, [5 2 3], color=:red, linestyle=:dash, label="True Value", linewidth=3)
plot!(p, size=(1300, 400))
Figure 4: 100,000 iterations of the MCMC example

Acceptance rate: 0.50435

Marginal Distributions

Code
p = histogram(samples[10001:end, :], layout=(1, 3), label=false, guidefontsize=18, tickfontsize=16, legendfontsize=14, bottom_margin=15mm, ylabel="Count", left_margin=10mm, xlabel = [L"$a$" L"$b$" L"$\sigma$"], legend=[:false :false :topright], color=:lightblue)
vline!(p, [5 2 3], color=:red, linestyle=:dash, label="True Value", linewidth=3)
vline!(p, mapslices(mean, samples[10001:end, :], dims=1), color=:purple, linestyle=:dash, label="Posterior Mean", linewidth=3)
plot!(p, size=(1300, 450))
Figure 5: 100,000 iterations of the MCMC example

Considerations for Implementation

Proposal Distribution Example

Code
# target density: modified Normal(0, 1) PDF
function target_distribution(x) 
    return sin(x)^2 * sin(2x)^2 * pdf(Normal(0, 1), x)
end

# compute normalizing constant for normalization
marg_dens, error = quadgk(x -> target_distribution(x), -Inf, Inf)
# plot target density
x = -π:0.01:π
p_base = plot(x, target_distribution.(x) ./ marg_dens, linewidth=3, label="Target Density", tickfontsize=16, legendfontsize=16, guidefontsize=18, bottom_margin=5mm, left_margin=5mm)
plot!(xlabel=L"x", ylabel="Density")
# pick current value
x_current = 0.5
vline!([x_current], color=:black, linewidth=2, label=L"$x_t$")
Figure 6: Example target density for Metropolis-Hastings

Proposal Distribution Examples

Code
# plot proposal distributions
p1 = plot!(deepcopy(p_base), x, pdf.(Normal(x_current, 0.1), x) ./ 10, color=:purple, label="Scaled Narrow Proposal", linewidth=2, linestyle=:dash)
p2 = plot!(deepcopy(p_base), x, pdf.(Normal(x_current, 0.5), x) ./ 2, color=:red, label="Scaled Wide Proposal", linewidth=2, linestyle=:dash)
plot!(p1, size=(600, 500))
plot!(p2, size=(600, 500))
display(p1)
display(p2)
(a) Example target density for Metropolis-Hastings
(b)
Figure 7

Sampling Efficiency

Two common measures of sampling efficiency:

  • Acceptance Rate: Rate at which proposals are accepted
    • “Optimally” 30-45% (depending on number of parameters)
  • Effective Sample Size (ESS): Accounts for autocorrelation \(\rho_t\) across samples \[N_\text{eff} = \frac{N}{1+2\sum_{t=1}^\infty \rho_t}\]

Sampling Efficiency Example

MCMC Sampling for Various Proposals

Autocorrelation of Chains

MCMC Sampling for Various Proposals

ESS by Proposal Variance for Example

MCMC Sampling for Various Proposals

Key Points, Upcoming Schedule, and References

Key Points (Metropolis-Hastings)

  • Construct ergodic and reversible Markov chains with posterior as stationary distribution.
  • Metropolis-Hastings: conceptually simple algorithm, but implementation plays a major role.
  • Proposal distribution plays a large role in acceptance rate and effective sample size.

Next Classes

Wednesday: MCMC Examples

Monday: MCMC Lab (No exercises these weeks)

Next Wednesday: Literature Presentations (email slides by 9pm Tuesday night).

Assessments

  • Homework 3: Due 3/22

References

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys., 21, 1087–1092. https://doi.org/10.1063/1.1699114