Markov Chain Monte Carlo

Lecture 14

March 18, 2024

Last Class

Bayesian Computation

Goal: Sample from posterior distribution to:
- Capture parametric uncertainty;
- Compute MAP estimates.
Challenging because of arbitrary nature of distributions.

Markov Chain Strategy

Generate an appropriate Markov chain so that its stationary distribution of the target distribution \(\pi\);
Run its dynamics long enough to converge to the stationary distribution;
Use the resulting ensemble of states as Monte Carlo samples from \(\pi\) .

Markov Chain Convergence

Given a Markov chain \(\{X_t\}_{t=1, \ldots, T}\) returned from this procedure, sampling from distribution \(\pi\):

\(\mathbb{P}(X_t = y) \to \pi(y)\) as \(t \to \infty\)
This means the chain can be considered a dependent sample approximately distributed from \(\pi\).
The first values (the transient portion) of the chain are highly dependent on the initial value.

Metropolis-Hastings Algorithm

The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm:

The foundational MCMC algorithm (and was named one of the top ten algorithms of the 20th century).
Builds a Markov chain based on transitions by:
- generating proposals for new samples from a conditional proposal distribution \(q(y | x)\);
- accepting or rejecting those proposals.

The Metropolis-Hastings Algorithm

Given \(X_t = x_t\):

Generate \(Y_t \sim q(y | x_t)\);
Set \(X_{t+1} = Y_t\) with probability \(\rho(x_t, Y_t)\), where \[\rho(x, y) = \min \left\{\frac{\pi(y)}{\pi(x)}\frac{q(x | y)}{q(y | x)}, 1\right\},\] else set \(X_{t+1} = x_t\).

How Simple Is That?

The devil is in the details: performance and efficiency are highly dependent on the choice of \(q\).

Key: There is a tradeoff between exploration and acceptance.

Wide proposal: Can make bigger jumps, may be more likely to reject proposals.
Narrow proposal: More likely to accept proposals, may not “mix” efficiently.

Proposal Distribution Choice

The original Metropolis et al. (1953) algorithm used symmetric distributions (\(q(y | x) = q(x | y)\)).

Then the acceptance probability reduces to \[\rho = \min \left\{\frac{\pi(y)}{\pi(x)}, 1\right\}.\]

A common choice: \(y \sim \text{Normal}(X_t, \sigma)\) centered around the current point \(X_t\).

Julia Implementation

function mh_transition(x_current, σ)
    # generate new proposal
    x_proposal = rand(Normal(x_current, σ))
    u = rand()
    ρ = log(target_density(x_proposal)) - log(target_density(x_current)) # transition log-probability
    if log(u) < min(ρ, 1)
        y = x_proposal
    else
        y = x_current
    end
    return y, log(target_density(y))
end

mh_transition (generic function with 1 method)

Julia Implementation

function mh_algorithm(n_iter, σ, x₀)
    # initialize storage
    samples = zeros(n_iter) 
    log_target = zeros(n_iter)
    samples[1] = x₀ # start algorithm
    log_target[1] = log(target_density(x₀))
    accept_count = 0
    for i = 2:length(samples) # iterate
        samples[i], log_target[i] = mh_transition(samples[i-1], σ)
        if samples[i] != samples[i-1]
            accept_count += 1
        end
    end
    accept_rate = accept_count / n_iter # compute acceptance rate
    return samples, log_target, accept_rate
end

mh_algorithm (generic function with 1 method)

Linear Regression Example

\[ \begin{gather} y = 5 + 2x + \varepsilon \\ \varepsilon \sim \text{Normal}(0, 3) \end{gather} \]

Code

# create trend for data
x = rand(Uniform(0, 20), 20)
y = 5 .+ 2 * x
# sample and add noise
ε = rand(Normal(0, 3), 20)
y .+= ε

p = scatter(x, y, label="Data", xlabel=L"$x$", ylabel=L"$y$", markershape=:o, markersize=10, tickfontsize=16, guidefontsize=18, legendfontsize=16, bottom_margin=10mm, left_margin=5mm, right_margin=5mm) 
plot!(p, size=(600, 550))

Figure 1: Regression data plot

Model Specification

\[ \begin{gather} y = a + bx + \varepsilon \\ \varepsilon \sim \text{Normal}(0, \sigma). \end{gather} \]

This makes the likelihood: \[ y \sim \text{Normal}(a+bx, \sigma). \]

Prior Selection

\[ \begin{gather} a \sim \text{Normal(0, 2)} \\ b \sim \text{Normal(0, 2)} \\ \sigma \sim \text{Half-Normal}(0, 1) \end{gather} \]

Code

# generate data for samples
function gen_data(a, b, σ)
    x = collect(0:20)
    y = a .+ b * x
    # sample and add noise
    ε = rand(Normal(0, σ), length(x))
    y .+= ε
    return y
end

# sample and plot
n_samples = 1000
a = rand(Normal(0, 2), n_samples)
b = rand(Normal(0, 2), n_samples)
σ = rand(truncated(Normal(0, 1), lower=0), 1000)
y_prior = [gen_data(a[i], b[i], σ[i]) for i in 1:n_samples]
# convert y to a Matrix by vcatting each vector
y_prior = mapreduce(permutedims, vcat, y_prior) 
plt_prior_1 = plot(; ylabel=L"$y$", xlabel=L"$x$",
    tickfontsize=16, legendfontsize=16, guidefontsize=18, bottom_margin=5mm, left_margin=5mm, legend=false)
for x ∈ [0, 5, 10, 15, 20]
    boxplot!(plt_prior_1, [x], y_prior[:, x+1], color=:blue)
end
plot!(plt_prior_1, size=(600, 550))
plt_prior_1

Figure 2: Prior predictive plot for regression example.

Proposal Distribution and Initial Value

To illustrate how the M-H algorithm works, let’s use a proposal \[\mathcal{N}(x_t, 0.01I_3).\]

And let’s start at \[x_0 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\].

First Proposal

Current:

\[X_0 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\]

\[\text{log-posterior} = -2851\]

Iteration:

\[y = \begin{pmatrix}0.94 \\ 1.07 \\ 0.82\end{pmatrix}\]

\[\text{log-posterior} = -3433\]

\[\rho \approx 0 \Rightarrow X_1 = X_0\]

Another Proposal

Current:

\[X_1 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\]

\[\text{log-posterior} = -2851\]

Iteration:

\[y = \begin{pmatrix}1.24 \\ 1.05 \\ 1.04\end{pmatrix}\]

\[\text{log-posterior} = -2165\]

\[\rho =1 \Rightarrow X_2 = y\]

1,000 Iterations

Code

samples, lpost, α = mh_algorithm(100000, [1; 1; 1], x, y)

p = plot(samples[1:1000, :], layout=(1, 3), label="Samples", guidefontsize=18, tickfontsize=16, legendfontsize=16, bottom_margin=15mm, ylabel="Value", xlabel="Iteration", left_margin=10mm)
hline!(p, [5 2 3], color=:red, linestyle=:dash, label="True Value")
plot!(p, size=(1300, 400))

Figure 3: First 1,000 iterations of the MCMC example

100,000 Iterations

Code

p = plot(samples, layout=(1, 3), label="Samples", guidefontsize=18, tickfontsize=16, legendfontsize=16, bottom_margin=15mm, xlabel="Iteration", left_margin=10mm, xticks=0:50000:100000, right_margin=15mm, ylabel = [L"$a$" L"$b$" L"$\sigma$"], legend=[:false :bottomright :false])
hline!(p, [5 2 3], color=:red, linestyle=:dash, label="True Value", linewidth=3)
plot!(p, size=(1300, 400))

Figure 4: 100,000 iterations of the MCMC example

Acceptance rate: 0.50435

Marginal Distributions

Code

p = histogram(samples[10001:end, :], layout=(1, 3), label=false, guidefontsize=18, tickfontsize=16, legendfontsize=14, bottom_margin=15mm, ylabel="Count", left_margin=10mm, xlabel = [L"$a$" L"$b$" L"$\sigma$"], legend=[:false :false :topright], color=:lightblue)
vline!(p, [5 2 3], color=:red, linestyle=:dash, label="True Value", linewidth=3)
vline!(p, mapslices(mean, samples[10001:end, :], dims=1), color=:purple, linestyle=:dash, label="Posterior Mean", linewidth=3)
plot!(p, size=(1300, 450))

Figure 5: 100,000 iterations of the MCMC example

Considerations for Implementation

Proposal Distribution Example

Code

# target density: modified Normal(0, 1) PDF
function target_distribution(x) 
    return sin(x)^2 * sin(2x)^2 * pdf(Normal(0, 1), x)
end

# compute normalizing constant for normalization
marg_dens, error = quadgk(x -> target_distribution(x), -Inf, Inf)
# plot target density
x = -π:0.01:π
p_base = plot(x, target_distribution.(x) ./ marg_dens, linewidth=3, label="Target Density", tickfontsize=16, legendfontsize=16, guidefontsize=18, bottom_margin=5mm, left_margin=5mm)
plot!(xlabel=L"x", ylabel="Density")
# pick current value
x_current = 0.5
vline!([x_current], color=:black, linewidth=2, label=L"$x_t$")

Figure 6: Example target density for Metropolis-Hastings

Proposal Distribution Examples

Code

# plot proposal distributions
p1 = plot!(deepcopy(p_base), x, pdf.(Normal(x_current, 0.1), x) ./ 10, color=:purple, label="Scaled Narrow Proposal", linewidth=2, linestyle=:dash)
p2 = plot!(deepcopy(p_base), x, pdf.(Normal(x_current, 0.5), x) ./ 2, color=:red, label="Scaled Wide Proposal", linewidth=2, linestyle=:dash)
plot!(p1, size=(600, 500))
plot!(p2, size=(600, 500))
display(p1)
display(p2)

(a) Example target density for Metropolis-Hastings

(b)

Figure 7

Sampling Efficiency

Two common measures of sampling efficiency:

Acceptance Rate: Rate at which proposals are accepted
- “Optimally” 30-45% (depending on number of parameters)
Effective Sample Size (ESS): Accounts for autocorrelation \(\rho_t\) across samples \[N_\text{eff} = \frac{N}{1+2\sum_{t=1}^\infty \rho_t}\]

Sampling Efficiency Example

Autocorrelation of Chains

ESS by Proposal Variance for Example

Key Points, Upcoming Schedule, and References

Key Points (Metropolis-Hastings)

Construct ergodic and reversible Markov chains with posterior as stationary distribution.
Metropolis-Hastings: conceptually simple algorithm, but implementation plays a major role.
Proposal distribution plays a large role in acceptance rate and effective sample size.

Next Classes

Wednesday: MCMC Examples

Monday: MCMC Lab (No exercises these weeks)

Next Wednesday: Literature Presentations (email slides by 9pm Tuesday night).

Assessments

Homework 3: Due 3/22

References

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys., 21, 1087–1092. https://doi.org/10.1063/1.1699114