Generate an appropriate Markov chain so that its stationary distribution of the target distribution \(\pi\);
Run its dynamics long enough to converge to the stationary distribution;
Use the resulting ensemble of states as Monte Carlo samples from \(\pi\) .
Markov Chain Convergence
Given a Markov chain \(\{X_t\}_{t=1, \ldots, T}\) returned from this procedure, sampling from distribution \(\pi\):
\(\mathbb{P}(X_t = y) \to \pi(y)\) as \(t \to \infty\)
This means the chain can be considered a dependent sample approximately distributed from \(\pi\).
The first values (the transient portion) of the chain are highly dependent on the initial value.
The Metropolis-Hastings Algorithm
Given \(X_t = x_t\):
Generate \(Y_t \sim q(y | x_t)\);
Set \(X_{t+1} = Y_t\) with probability \(\rho(x_t, Y_t)\), where \[\rho(x, y) = \min \left\{\frac{\pi(y)}{\pi(x)}\frac{q(x | y)}{q(y | x)}, 1\right\},\] else set \(X_{t+1} = x_t\).
Proposals
“Goldilocks” proposal: acceptance rate 30-45%.
Proposal distribution \(q\) plays a big role in the effective sample size (ESS): \[N_\text{eff} = \frac{N}{1+2\sum_{t=1}^\infty \rho_t}\]
Sampling Efficiency Example
MCMC Sampling for Various Proposals
MCMC Convergence
Transient Chain Portion
What do we do with the transient portion of the chain?
Discard as burn-in;
Just run the chain longer.
How To Identify Convergence?
Short answer: There is no guarantee! Judgement based on an accumulation of evidence from various heuristics.
The good news — getting the precise “right” end of the transient chain doesn’t matter.
If a few transient iterations remain, the effect will be washed out with a large enough post-convergence chain.
Heuristics for Convergence
Compare distribution (histogram/kernel density plot) after half of the chain to full chain.
Run multiple chains from “overdispersed” starting points
Compare intra-chain and inter-chain variances
Summarized as \(\hat{R}\) statistic: closer to 1 implies better convergence.
Can also check distributions across multiple chains vs. the half-chain check.
On Multiple Chains
Unless a specific scheme is used, multiple chains are not a solution for issues of convergence, as each individual chain needs to converge and have burn-in discarded/watered-down.
This means multiple chains are more useful for diagnostics, but once they’ve all been run long enough, can mix samples freely.
Heuristics for Convergence
If you’re more interested in the mean estimate, can also look at the its stability by iteration or the Monte Carlo standard error.
Look at traceplots; do you see sudden “jumps”?
When in doubt, run the chain longer.
Increasing Efficiency
Adaptive Metropolis-Hastings
Adjust proposal density to hit target acceptance rate.
Need to be cautious about detailed balance.
Typical strategy is to adapt for a portion of the initial chain (part of the burn-in), then run longer with that proposal.
Hamiltonian Monte Carlo
Idea: Use proposals which steer towards “typical set” without collapsing towards the mode (based on Hamiltonian vector field);
Requires gradient information: can be obtained through autodifferentiation; challenging for external models;
Can be very efficient due to potential for anti-correlated samples, but very sensitive to parameterization.
Same principles for evaluating convergence apply.
MCMC Example: Modeling Storm Surge Extremes
Data
Code
# load SF tide gauge data# read in data and get annual maximafunctionload_data(fname) date_format =DateFormat("yyyy-mm-dd HH:MM:SS")# This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data df =@chain fname begin CSV.read(DataFrame; header=false)rename("Column1"=>"year", "Column2"=>"month", "Column3"=>"day", "Column4"=>"hour", "Column5"=>"gauge")# need to reformat the decimal date in the data file@transform:datetime =DateTime.(:year, :month, :day, :hour)# replace -99999 with missing@transform:gauge =ifelse.(abs.(:gauge) .>=9999, missing, :gauge)select(:datetime, :gauge)endreturn dfenddat =load_data("data/surge/h551.csv")# detrend the data to remove the effects of sea-level rise and seasonal dynamicsma_length =366ma_offset =Int(floor(ma_length/2))moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]dat_ma =DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .-moving_average(dat.gauge, ma_offset))# group data by year and compute the annual maximadat_ma =dropmissing(dat_ma) # drop missing datadat_annmax =combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(DataFrames.transform(dat_ma, :datetime =>x->year.(x)), :datetime_function))delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yetrename!(dat_annmax, :datetime_function =>:Year)select!(dat_annmax, [:Year, :residual])# make plotsp1 =plot( dat_annmax.Year, dat_annmax.residual; xlabel="Year", ylabel="Annual Max Tide Level (mm)", label=false, marker=:circle, markersize=5, tickfontsize=16, guidefontsize=18)p2 =histogram( dat_annmax.residual, normalize=:pdf, orientation=:horizontal, label=:false, xlabel="PDF", xlims=(0, 0.006), ylabel="", yticks=[], xticks = [], tickfontsize=16, guidefontsize=18)l =@layout [a{0.7w} b{0.3w}]plot(p1, p2; layout=l, link=:y, ylims=(1000, 1700), bottom_margin=5mm, left_margin=5mm)plot!(size=(1000, 450))
Figure 2: Annual maxima surge data from the San Francisco, CA tide gauge.
surge_chain =let# variables defined in a let...end block are temporary model =sf_surge(dat_annmax.residual) # initialize model with data sampler =NUTS() # use the No-U-Turn Sampler; there are other options nsamples =10_000sample(model, sampler, nsamples; drop_warmup=true)endsummarystats(surge_chain)
┌ Warning: Optimization did not converge! You may need to correct your model or adjust the
│ Optim parameters.
└ @ TuringOptimExt ~/.julia/packages/Turing/1Egt9/ext/TuringOptimExt.jl:177
surge_chain =let# variables defined in a let...end block are temporary model =sf_surge(dat_annmax.residual) # initialize model with data sampler =NUTS() # use the No-U-Turn Sampler; there are other options nsamples =10_000 nchains =4sample(model, sampler, MCMCThreads(), nsamples, nchains; drop_warmup=true)endgelmandiag(surge_chain)
┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384Sampling (1 thread) 0%| | ETA: N/A┌ Info: Found initial step size
└ ϵ = 0.000390625
┌ Info: Found initial step size
└ ϵ = 0.00078125
Sampling (1 thread) 25%|███████▊ | ETA: 0:00:10Sampling (1 thread) 50%|███████████████▌ | ETA: 0:00:04┌ Info: Found initial step size
└ ϵ = 0.0015625
┌ Info: Found initial step size
└ ϵ = 0.00625
Sampling (1 thread) 75%|███████████████████████▎ | ETA: 0:00:02Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:06Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:06
Gelman, Rubin, and Brooks diagnostic
parameters psrf psrfci Symbol Float64 Float64
μ 1.0000 1.0000
σ 1.0000 1.0001
ξ 1.0000 1.0000
Plotting Multiple Chains
Code
plot(surge_chain)plot!(size=(1200, 500))
Figure 8: Sampler visualization for multiple surge chains
Key Points and Upcoming Schedule
Key Points (Convergence)
Must rely on “accumulation of evidence” from heuristics for determination about convergence to stationary distribution.
Transient portion of chain: Meh. Some people worry about this too much. Discard or run the chain longer.
Parallelizing solves few problems, but running multiple chains can be useful for diagnostics.
Next Classes
Monday: MCMC Lab (No exercises these weeks)
Next Wednesday: Literature Presentations (email slides by 9pm Tuesday night).
Assessments
Homework 3: Due 3/22
References
References
Gelman, A., & Rubin, D. B. (1992). Inference from Iterative Simulation Using Multiple Simulations. Stat. Sci., 7, 457–511. https://doi.org/10.1214/ss/1177011136