Model-Data Discrepancy and Bayesian Statistics


Lecture 06

February 7, 2024

Last Class(es)

Model-Data Discrepancy

Systematic mismatch between model and system state (not observation error!)

\[\mathbf{y} = \underbrace{F(\mathbf{x}; \mathbf{\theta})}_\text{simulation} + \underbrace{\zeta(\mathbf{x}; \mathbf{\theta})}_\text{discrepancy} + \underbrace{\mathbf{\varepsilon}}_\text{observation error}.\]

Non-Identifiability

Inability to distinguish between statistical parameters, e.g.

\[\zeta_t + \varepsilon_t \sim \mathcal{N}(0, \sqrt{\omega^2 + \sigma^2})\]

Discrepancy Example

EBM With AR(1) Discrepancy

\[ \begin{gather*} y_t = \text{EBM}(F_t; \theta) + \zeta_t + \varepsilon_t \\ \zeta_t = \rho \zeta_{t-1} \sim \mathcal{N}(0, \sigma)\\ \varepsilon_t \sim \mathcal{N}(0, \omega_t) \end{gather*} \]

where \(\omega_t\) is estimated as half the 95% CI of the temperature data.

Covariance Matrix

The likelihood is \(\zeta + \varepsilon \sim N(\mathbf{0}, \Sigma)\).

\[ \begin{gather*} \Sigma = V + D \\ D = \text{diag}(\omega_t^2), \quad V = \frac{\sigma^2}{1-\rho^2}\begin{pmatrix}1 & \rho & \ldots & \rho^{T-1} \\ \rho & 1 & \ldots & \rho^{T-2} \\ \rho^2 & \rho & \ldots & \rho^{T-3} \\ \vdots & \vdots & \ddots & \vdots \\ \rho^{T-1} & \rho^{T-2} & \ldots & 1\end{pmatrix} \end{gather*} \]

In Code…

# p are the model parameters, σ the standard deviation of the AR(1) errors, ρ is the autocorrelation coefficient, y is the data, m the model function
function ar_covariance_mat(σ, ρ, y_err)
    H = abs.((1:length(y_err)) .- (1:(length(y_err)))') # compute the outer product to get the correlation lags
    ζ_var = σ^2 / (1-ρ^2)
    Σ = ρ.^H * ζ_var
    for i in 1:length(y_err)
        Σ[i, i] += y_err[i]^2
    end
    return Σ
end
ar_covariance_mat (generic function with 1 method)

Maximize Likelihood

function ar_discrep_log_likelihood(p, σ, ρ, y, m, y_err)
    y_pred = m(p)
    residuals = y_pred .- y
    Σ = ar_covariance_mat(σ, ρ, y_err)
    ll = logpdf(MvNormal(zeros(length(y)), Σ), residuals)
    return ll
end

ebm_wrap(params) = ebm(forcing_non_aerosol_85[idx], forcing_aerosol_85[idx], p = params)

# maximize log-likelihood within some range
# important to make everything a Float instead of an Int 
lower = [1.0, 50.0, 0.0, 0.0, -1.0]
upper = [4.0, 150.0, 2.0, 10.0, 1.0]
p0 = [2.0, 100.0, 1.0, 1.0, 0.0]
result = Optim.optimize(params -> -ar_discrep_log_likelihood(params[1:end-2], params[end-1], params[end], temp_obs, ebm_wrap, temp_sd), lower, upper, p0)
θ_discrep = result.minimizer

Comparison of Model Fits

Normal IID:

4-element Vector{Float64}:
  1.943746862749113
 86.39077197089858
  0.789333067313891
  0.12104289634011653

Total-Residual AR(1):

5-element Vector{Float64}:
   2.0068089111198297
 109.38026355047445
   0.7915448981319909
   0.10196962494870278
   0.5421594991244063

Discrepancy AR(1):

5-element Vector{Float64}:
  1.9063093045190942
 88.29298074263143
  0.7492605985605157
  0.07978677011247291
  0.537951050697555

Hindcast

Figure 1: Hindcasts from different EBM fits.

Projections (RCP 2.6)

Figure 2: Projections from different EBM fits.

Bayesian Statistics

Prior Information

So far: no way to use prior information about parameters (other than bounds on MLE optimization).

Bayes’ Rule

Original version (Bayes, 1763):

\[P(A | B) = \frac{P(B | A) \times P(A)}{P(B)} \quad \text{if} \quad P(B) \neq 0.\]

Bayes’ Rule

“Modern” version (Laplace, 1774):

\[p(\theta | y) = \frac{p(y | \theta)}{p(y)} p(\theta)\]

Bayes’ Rule

“Modern” version (Laplace, 1774):

\[\underbrace{{p(\theta | y)}}_{\text{posterior}} = \frac{\overbrace{p(y | \theta)}^{\text{likelihood}}}{\underbrace{p(y)}_\text{normalization}} \overbrace{p(\theta)}^\text{prior}\]

On The Normalizing Constant

The normalizing constant (also called the marginal likelihood) is the integral \[p(y) = \int_\Theta p(y | \theta) p(\theta) d\theta.\]

Since this generally doesn’t depend on \(\theta\), it can often be ignored, as the relative probabilities don’t change.

Bayes’ Rule (Ignoring Normalizing Constants)

The version of Bayes’ rule which matters the most for 95% (approximate) of Bayesian statistics:

\[p(\theta | y) \propto p(y | \theta) \times p(\theta)\]

“The posterior is the prior times the likelihood…”

How To Choose A Prior?

One perspective: Priors should reflect “actual knowledge” independent of the analysis (Jaynes, 2003)

Another: Priors are part of the probability model, and can be specified/changed accordingly based on predictive skill (Gelman et al., 2017; Gelman & Shalizi, 2013)

What Makes A Good Prior?

  • Reflects level of understanding (informative vs. weakly informative vs. non-informative).
  • Does not zero out probability of plausible values.
  • Regularization (extreme values should be less probable)

What Makes A Bad Prior?

  • Assigns probability zero to plausible values;
  • Weights implausible values equally as more plausible ones;
  • Double counts information (e.g. fitting a prior to data which is also used in the likelihood)
  • Chosen based on vibes.

A Coin Flipping Example

We would like to understand if a coin-flipping game is fair. We’ve observed the following sequence of flips:

flips = ["H", "H", "H", "T", "H", "H", "H", "H", "H"]
9-element Vector{String}:
 "H"
 "H"
 "H"
 "T"
 "H"
 "H"
 "H"
 "H"
 "H"

Coin Flipping Likelihood

The data-generating process here is straightforward: we can represent a coin flip with a heads-probability of \(\theta\) as a sample from a Bernoulli distribution,

\[y_i \sim \text{Bernoulli}(\theta).\]

flip_ll(θ) = sum(logpdf(Bernoulli(θ), flips .== "H"))
θ_mle = Optim.optimize-> -flip_ll(θ), 0, 1).minimizer
round(θ_mle, digits=2)
0.89

Coin Flipping Prior

Suppose that we spoke to a friend who knows something about coins, and she tells us that it is extremely difficult to make a passable weighted coin which comes up heads more than 75% of the time.

Coin Flipping Prior

Since \(\theta\) is bounded between 0 and 1, we’ll use a Beta distribution for our prior, specifically \(\text{Beta}(4,4)\).

Code
prior_dist = Beta(5, 5)
plot(prior_dist; label=false, xlabel=L"$θ$", ylabel=L"$p(θ)$", linewidth=3, tickfontsize=16, guidefontsize=18)
plot!(size=(500, 500))
Figure 3: Beta prior for coin flipping example

Maximum A Posteriori Estimate

Combining using Bayes’ rule lets us calculate the maximum a posteriori (MAP) estimate:

flip_ll(θ) = sum(logpdf(Bernoulli(θ), flips .== "H"))
flip_lprior(θ) = logpdf(Beta(5, 5), θ)
flip_lposterior(θ) = flip_ll(θ) + flip_lprior(θ)
θ_map = Optim.optimize-> -(flip_lposterior(θ)), 0, 1).minimizer
round(θ_map, digits=2)
0.71

Coin Flipping Posterior Distribution

Code
θ_range = 0:0.01:1
plot(θ_range, flip_lposterior.(θ_range), color=:black, label="Posterior", linewidth=3, tickfontsize=16, legendfontsize=16, guidefontsize=18, bottom_margin=5mm, left_margin=5mm)
vline!([θ_map], color=:red, label="MAP", linewidth=2)
vline!([θ_mle], color=:blue, label="MLE", linewidth=2)
xlabel!(L"$\theta$")
ylabel!("Posterior Density")
plot!(size=(1000, 450))
Figure 4: Posterior distribution for the coin-flipping example

Bayes and Parametric Uncertainty

Frequentist: Parametric uncertainty is purely the result of sampling variability

Bayesian: Parameters have probabilities based on consistency with data and priors.

Bayesian Updating

  • The posterior is a “compromise” between the prior and the data.
  • The posterior mean is a weighted combination of the data and the prior mean.
  • The weights depend on the prior and the likelihood variances.
  • More data usually makes the posterior more confident.

Example: Local Sea Level Extremes

San Francisco Tide Gauge Data

Code
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

dat = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])
dat_annmax.residual = dat_annmax.residual / 1000 # convert to m

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5,
    tickfontsize=16,
    guidefontsize=18
)
p2 = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="PDF",
    ylabel="",
    yticks=[],
    tickfontsize=16,
    guidefontsize=18
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(1, 1.7), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))
Figure 5: Annual maxima surge data from the San Francisco, CA tide gauge.

Proposed Probability Model

\[ \begin{align*} & y \sim LogNormal(\mu, \sigma) \tag{likelihood}\\ & \left. \begin{aligned} & \mu \sim Normal(0, 1) \\ & \sigma \sim HalfNormal(0, 5) \end{aligned} \right\} \tag{priors} \end{align*} \]

Want to find:

\[p(\mu, \sigma | y) \propto p(y | \mu, \sigma) p(\mu)p(\sigma)\]

Are Our Priors Reasonable?

Hard to tell! Let’s simulate data to see we get plausible outcomes.

This is called a prior predictive check.

Prior Predictive Check

Code
# sample from priors
μ_sample = rand(Normal(0, 1), 1_000)
σ_sample = rand(truncated(Normal(0, 5), 0, +Inf), 1_000)

# define return periods and cmopute return levels for parameters
return_periods = 2:100
return_levels = zeros(1_000, length(return_periods))
for i in 1:1_000
    return_levels[i, :] = quantile.(LogNormal(μ_sample[i], σ_sample[i]), 1 .- (1 ./ return_periods))
end

plt_prior_1 = plot(; yscale=:log10, yticks=10 .^ collect(0:2:16), ylabel="Return Level (m)", xlabel="Return Period (yrs)",
    tickfontsize=16, legendfontsize=18, guidefontsize=18, bottom_margin=10mm, left_margin=10mm, legend=:topleft)
for idx in 1:1_000
    label = idx == 1 ? "Prior" : false
    plot!(plt_prior_1, return_periods, return_levels[idx, :]; color=:black, alpha=0.1, label=label)
end
plt_prior_1
Figure 6: Prior predictive check of return periods with revised model

Let’s Revise the Prior

\[ \begin{align*} & y \sim LogNormal(\mu, \sigma) \tag{likelihood}\\ & \left. \begin{aligned} & \mu \sim Normal(0, 0.5) \\ & \sigma \sim HalfNormal(0, 0.1) \end{aligned} \right\} \tag{priors} \end{align*} \]

Prior Predictive Check 2

Code
# sample from priors
μ_sample = rand(Normal(0, 0.5), 1_000)
σ_sample = rand(truncated(Normal(0, 0.1), 0, +Inf), 1_000)

return_periods = 2:100
return_levels = zeros(1_000, length(return_periods))
for i in 1:1_000
    return_levels[i, :] = quantile.(LogNormal(μ_sample[i], σ_sample[i]), 1 .- (1 ./ return_periods))
end

plt_prior_2 = plot(; ylabel="Return Level (m)", xlabel="Return Period (yrs)", tickfontsize=16, legendfontsize=18, guidefontsize=18, bottom_margin=10mm, left_margin=10mm)
for idx in 1:1_000
    label = idx == 1 ? "Prior" : false
    plot!(plt_prior_2, return_periods, return_levels[idx, :]; color=:black, alpha=0.1, label=label)
end
plt_prior_2
Figure 7: Prior predictive check of return periods with revised model

Compute Posterior

Code
ll(μ, σ) = sum(logpdf(LogNormal(μ, σ), dat_annmax.residual))
lprior(μ, σ) = logpdf(Normal(0, 0.5), μ) + logpdf(truncated(Normal(0, 0.1), 0, Inf), σ)
lposterior(μ, σ) = ll(μ, σ) + lprior(μ, σ)

p_map = optimize(p -> -lposterior(p[1], p[2]), [0.0, 0.0], [1.0, 1.0], [0.5, 0.5]).minimizer

μ = 0.15:0.005:0.35
σ = 0.04:0.01:0.1
posterior_vals = @. lposterior', σ)

contour(μ, σ, posterior_vals, 
    levels=100, 
    clabels=false, 
    cbar=true, lw=1, 
    fill=(true,cgrad(:grays,[0,0.1,1.0])),
    tickfontsize=16,
    guidefontsize=18,
    legendfontsize=18,
    right_margin=20mm,
    bottom_margin=5mm,
    left_margin=5mm
)
scatter!([p_map[1]], [p_map[2]], label="MAP", markersize=10, marker=:star)
xlabel!(L"$\mu$")
ylabel!(L"$\sigma$")
plot!(size=(900, 400))
Figure 8: Posterior samples from surge model.
2-element Vector{Float64}:
 0.2546874075930782
 0.055422136861588964

Assess MAP Fit

Code
p1 = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    legend=:false,
    ylabel="PDF",
    xlabel="Annual Max Tide Level (m)",
    tickfontsize=16,
    guidefontsize=18,
    bottom_margin=5mm, left_margin=5mm
)
plot!(p1, LogNormal(p_map[1], p_map[2]),
    linewidth=3,
    color=:red)
xlims!(p1, (1, 1.7))
plot!(p1, size=(600, 450))

return_periods = 2:500
return_levels = quantile.(LogNormal(p_map[1], p_map[2]), 1 .- (1 ./ return_periods))

# function to calculate exceedance probability and plot positions based on data quantile
function exceedance_plot_pos(y)
    N = length(y)
    ys = sort(y; rev=false) # sorted values of y
    nxp = xp = [r / (N + 1) for r in 1:N] # exceedance probability
    xp = 1 .- nxp
    return xp, ys
end
xp, ys = exceedance_plot_pos(dat_annmax.residual)

p2 = plot(return_periods, return_levels, linewidth=3, color=:blue, label="Model Fit", tickfontsize=16, legendfontsize=18, guidefontsize=18, bottom_margin=5mm, left_margin=5mm, right_margin=10mm, legend=:bottomright)
scatter!(p2, 1 ./ xp, ys, label="Observations", color=:black, markersize=5)
xlabel!(p2, "Return Period (yrs)")
ylabel!(p2, "Return Level (m)")
xlims!(-1, 300)
plot!(p2, size=(600, 450))

display(p1)
display(p2)
(a) Checks for model fit.
(b)
Figure 9

What About The Posterior Distribution?

One of the points of Bayesian statistics is we get a distribution over parameters.

Conjugate Priors

When the mathematical forms of the likelihood and the prior(s) are conjugate, the posterior is a nice closed-form distribution.

Examples:

  • Normal \(p(y | \mu)\), Normal \(p(\mu)\) ⇒ Normal \(p(\mu | y)\)
  • Binomial \(p(y | \theta)\), Beta \((p\theta)\), ⇒ Beta \(p(\theta | y)\)

But In General…

Conjugate priors are often convenient, but may be poor choices.

We will talk about how to sample more generally with Monte Carlo later.

Key Points and Upcoming Schedule

Key Points: Discrepancy

  • Modeling discrepancy can separate system state-relevant errors from observation errors.
  • Further decomposition of data generating process.
  • When hindcasting, include observation errors; do not when projecting!

Key Points: Bayesian Statistics

  • Bayesian probability: parameters have probabilities conditional on data
  • Need to specify prior distribution (think generatively!).
  • Be transparent and principled about prior choices (sensitivity analyses?).
  • Maximum a posteriori gives “most probable” parameter values
  • Will talk more about general sampling later.

Next Class(es)

Monday: Extreme Value Theory and Models

Wednesday: No class! Develop project proposals.

Assessments

Friday: Exercise 1 due by 9pm.

HW2 available! Due 2/23 by 9pm.

References

References

Bayes, T. (1763). An Essay towards Solving a Problem in the Doctrine of Chance. Philosophical Transactions of the Royal Society of London, 53, 370–418.
Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. Br. J. Math. Stat. Psychol., 66, 8–38. https://doi.org/10.1111/j.2044-8317.2011.02037.x
Gelman, A., Simpson, D., & Betancourt, M. (2017). The prior can often only be understood in the context of the likelihood. Entropy, 19, 555. https://doi.org/10.3390/e19100555
Jaynes, E. T. (2003). Probability theory: the Logic of Science. (G. L. Bretthorst, Ed.). Cambridge, UK ; New York, NY: Cambridge University Press. Retrieved from https://market.android.com/details?id=book-tTN4HuUNXjgC
Laplace, P. S. (1774). Mémoire sur la Probabilité des Causes par les évènemens. In Mémoires de Mathematique et de Physique, Presentés à l’Académie Royale des Sciences, Par Divers Savans & Lus Dans ses Assemblées (pp. 621–656).