Model Assessment


Lecture 17

April 15, 2024

Review of Last Class

Hypothesis Testing

Standard null hypothesis significance testing framework:

  • Frame problem with null hypothesis \(\mathcal{H}_0\) and alternative hypothesis \(\mathcal{H}\).
  • Set significance level \(\alpha\).
  • Find \(p\)-value \[\mathbb{P}\left(y_{\mathcal{H}_0} > y\right)\]

What is a \(p\)-value?

The p-value captures the probability that, assuming the null hypothesis, you would observe results at least as extreme as you observed.

Code
test_dist = Normal(0, 3)
x = -10:0.01:10
plot(x, pdf.(test_dist, x), linewidth=3, legend=false, color=:black, xticks=false,  yticks=false, ylabel="Probability", xlabel=L"y", bottom_margin=10mm, left_margin=10mm, guidefontsize=16)
vline!([5], linestyle=:dash, color=:purple, linewidth=3, )
areaplot!(5:0.01:10, pdf.(test_dist, 5:0.01:10), color=:green, alpha=0.4)
quiver!([-4.5], [0.095], quiver=([1], [-0.02]), color=:black, linewidth=2)
annotate!([-5], [0.11], text("Null\nSampling\nDistribution", color=:black))
quiver!([6.5], [0.03], quiver=([-1], [-0.015]), color=:green, linewidth=2)
annotate!([6.85], [0.035], text("p-value", :green))
quiver!([3.5], [0.02], quiver=([1.5], [0]), color=:purple, linewidth=2)
annotate!([0.9], [0.02], text("Observation", :purple))
plot!(size=(650, 500))
Figure 1: Illustration of a p-value

Rejecting the Null

If \(p\text{-value} < \alpha\), “reject” the null hypothesis in favor of the alternative and say that the effect is “statistically significant.”

Otherwise, do not reject the null.

The goal is to strike a balance between Type I and Type II errors.

Error Types

Null Hypothesis Is
True False
Decision About Null Hypothesis Don’t reject True negative (probability \(1-\alpha\)) Type II error (false negative, probability \(\beta\))
Reject Type I Error (false positive, probability \(\alpha\)) True positive (probability \(1-\beta\))

But What Is Statistical Significance?

But, this doesn’t mean:

  • That the null is “wrong”;
  • That the alternative is a better descriptor of the data-generating process;
  • That the effect sized of the hypothesized mechanism is “significant”.

What a \(p\)-value is Not

  1. Probability that the null hypothesis is true;
  2. Probability that the effect was produced by chance alone;
  3. An indication of the effect size.

How Might We Do Better?

  • Consideration of multiple plausible (possibly more nuanced) hypotheses.
  • Assessment/quantification of evidence consistent with different hypotheses.
  • Insight into the effect size.

Model Assessment

Fundamental Data Analysis Challenge

Goal (often): Explain data and/or make predictions about unobserved data.

Challenges: Environmental systems are:

  • high-dimensional
  • multi-scale
  • nonlinear
  • subject to many uncertainties

Multiplicities of Models

In general, we are in an \(\mathcal{M}\)-open setting: no model is the “true” data-generating model, so we want to pick a model which performs well enough for the intended purpose.

The contrast to this is \(\mathcal{M}\)-closed, in which one of the models under consideration is the “true” data-generating model, and we would like to recover it.

What Is Any Statistical Test Doing?

If we think about what a test like Mann-Kendall is doing:

  1. Assume the null hypothesis \(H_0\);
  2. Obtain the sampling distribution of a test statistic \(S\) which captures the property of interest under \(H_0\);
  3. Compute the test statistic \(\hat{S}\) on the data.
  4. Calculate the probability of \(S\) more extreme than \(\hat{S}\) (the \(p\)-value).

None of this requires a NHST framework!

Simulation for Statistical Testing

Instead, if we have a model which permits simulation:

  1. Calibrate models under different assumptions (e.g. stationarity vs. nonstationary based on different covariates);
  2. Simulate realizations from those models;
  3. Compute the distribution of the relevant statistic \(S\) from these realizations;
  4. Assess which distribution is most consistent with the observed quantity.

Model Assessment Criteria

How do we assess models?

Two general categories:

  1. How well do we explain the data?
  2. How well do we predict new data?

Explanatory Criteria

Generally based on the error (RMSE, MAE) or probability of the data \(p(y | M)\).

To select a model:

\[\underset{M_i \in \mathcal{M}}{\operatorname{argmax}} p(y |M_i)\]

Example: Are Storm Surges Nonstationary?

SF Tide Gauge Data

Code
# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

dat = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(DataFrames.transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5,
    tickfontsize=16,
    guidefontsize=18,
    left_margin=5mm, 
    bottom_margin=5mm
)
Figure 2: Annual maxima surge data from the San Francisco, CA tide gauge.

Models Under Consideration

Three hypotheses (there could be many more!):

  1. Stationary (“null”) model, \(y_t \sim \text{GEV}(\mu, \sigma, \xi);\)
  2. Time nonstationary (“null-ish”) model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 t, \sigma, \xi);\)
  3. PDO nonstationary model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 \text{PDO}_t, \sigma, \xi)\)

Stationary Model

@model function sf_stat(y)
    μ ~ truncated(Normal(1500, 200), lower=0)
    σ ~ truncated(Normal(100, 25), lower=0)
    ξ ~ Normal(0, 1)

    for i in 1:length(y)
        y[i] ~ GeneralizedExtremeValue(μ, σ, ξ)
    end
end

stat_chain = sample(sf_stat(dat_annmax.residual), NUTS(), MCMCThreads(), 10_000, 4)
summarystats(stat_chain)

Stationary Model

Warning: Only a single thread available: MCMC chains are not sampled in parallel
@ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
Info: Found initial step size
  ϵ = 0.0015625
Info: Found initial step size
  ϵ = 0.05
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:00:24
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:00:09
Info: Found initial step size
  ϵ = 0.00625
Info: Found initial step size
  ϵ = 0.0125
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:00:04
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:11
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:11
Summary Statistics
  parameters        mean       std      mcse     ess_bulk     ess_tail      rh     Symbol     Float64   Float64   Float64      Float64      Float64   Float ⋯

           μ   1259.4746    5.7075    0.0322   31390.4794   28777.5313    1.00 ⋯
           σ     58.7560    4.3683    0.0284   23773.8975   26901.9065    1.00 ⋯
           ξ      0.0224    0.0639    0.0004   22280.1320   23863.3699    1.00 ⋯
                                                               2 columns omitted

Nonstationary (Time) Model

@model function sf_nonstat(y)
    a ~ truncated(Normal(1500, 200), lower=0)
    b ~ Normal(0, 5)
    σ ~ truncated(Normal(100, 25), lower=0)
    ξ ~ Normal(0, 1)

    T = length(y)
    for i in 1:T
        y[i] ~ GeneralizedExtremeValue.(a .+ b * i, σ, ξ)
    end
end

nonstat_chain = sample(sf_nonstat(dat_annmax.residual), NUTS(), MCMCThreads(), 10_000, 4)
summarystats(nonstat_chain)

Nonstationary (Time) Model

Warning: Only a single thread available: MCMC chains are not sampled in parallel
@ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
Info: Found initial step size
  ϵ = 0.00625
Info: Found initial step size
  ϵ = 0.0125
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:00:23
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:00:10
Info: Found initial step size
  ϵ = 0.05
Info: Found initial step size
  ϵ = 0.05
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:00:05
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:16
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:16
Summary Statistics
  parameters        mean       std      mcse     ess_bulk     ess_tail      rh     Symbol     Float64   Float64   Float64      Float64      Float64   Float ⋯

           a   1233.2116    9.9739    0.0751   17727.8760   19471.6475    1.00 ⋯
           b      0.4061    0.1276    0.0009   18531.5660   20869.7920    1.00 ⋯
           σ     55.0806    4.4145    0.0278   25122.5084   22639.0411    1.00 ⋯
           ξ      0.0714    0.0759    0.0005   23417.8043   21499.4490    1.00 ⋯
                                                               2 columns omitted

Nonstationary (PDO) Model

@model function sf_pdo(y, pdo)
    a ~ truncated(Normal(1500, 200), lower=0)
    b ~ Normal(0, 5)
    σ ~ truncated(Normal(100, 25), lower=0)
    ξ ~ Normal(0, 1)

    for i in 1:length(y)
        y[i] ~ GeneralizedExtremeValue.(a + b * pdo[i], σ, ξ)
    end
end

pdo_chain = sample(sf_pdo(dat_annmax.residual, pdo.PDO), NUTS(), MCMCThreads(), 10_000, 4)
summarystats(pdo_chain)

Nonstationary (PDO) Model

Warning: Only a single thread available: MCMC chains are not sampled in parallel
@ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
Info: Found initial step size
  ϵ = 0.025
Info: Found initial step size
  ϵ = 0.2
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:00:13
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:00:06
Info: Found initial step size
  ϵ = 0.2
Info: Found initial step size
  ϵ = 0.2
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:00:02
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:08
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:08
Summary Statistics
  parameters        mean       std      mcse     ess_bulk     ess_tail      rh     Symbol     Float64   Float64   Float64      Float64      Float64   Float ⋯

           a   1258.3519    5.7546    0.0311   34292.0031   30986.1245    1.00 ⋯
           b     -5.2030    3.8235    0.0219   30416.6923   28764.2555    1.00 ⋯
           σ     57.9180    4.3708    0.0265   27268.6473   28192.5174    1.00 ⋯
           ξ      0.0306    0.0644    0.0004   28244.0215   25382.5032    1.00 ⋯
                                                               2 columns omitted

Model Simulations

Code
# make predictions from each model
stat_sim = predict(sf_stat(Vector{Union{Missing, Float64}}(undef, length(dat_annmax.residual))), stat_chain)
nonstat_sim = predict(sf_nonstat(Vector{Union{Missing, Float64}}(undef, length(dat_annmax.residual))), nonstat_chain)
pdo_sim = predict(sf_pdo(Vector{Union{Missing, Float64}}(undef, length(dat_annmax.residual)), pdo.PDO), pdo_chain)

# get quantiles
stat_q = mapslices(x -> quantile(x, [0.05, 0.5, 0.95]), stat_sim.value.data[:, :, 1], dims=1)
nonstat_q = mapslices(x -> quantile(x, [0.05, 0.5, 0.95]), nonstat_sim.value.data[:, :, 1], dims=1)
pdo_q = mapslices(x -> quantile(x, [0.05, 0.5, 0.95]), pdo_sim.value.data[:, :, 1], dims=1)

# plot
p = plot(; xlabel="Year", ylabel="Storm Tide Annual Maximum (mm)", left_margin=5mm, bottom_margin=5mm, right_margin=5mm, tickfontsize=16, guidefontsize=18, legendfontsize=16, legend=:outerright)
plot!(p, dat_annmax.Year, stat_q[2, :], ribbon=(stat_q[2, :] - stat_q[1, :], stat_q[3, :] - stat_q[2, :]), color=:sienna, label="Stationary", fillalpha=0.2, linewidth=3)
plot!(p, dat_annmax.Year, nonstat_q[2, :], ribbon=(nonstat_q[2, :] - nonstat_q[1, :], nonstat_q[3, :] - nonstat_q[2, :]), color=:red, label="Nonstationary (Time)", fillalpha=0.2, linewidth=3)
plot!(p, dat_annmax.Year, pdo_q[2, :], ribbon=(pdo_q[2, :] - pdo_q[1, :], pdo_q[3, :] - pdo_q[2, :]), color=:teal, label="Nonstationary (PDO)", fillalpha=0.2, linewidth=3)
scatter!(p, dat_annmax.Year, dat_annmax.residual, color=:black, markersize=5, label="Observations", alpha=0.5)
Figure 3: Model simulations for the three storm surge models.

Point Estimates vs. Posteriors

Can calculate these metrics using some point estimate \(\hat{\theta}\) (MLE, MAP, etc):

\[\underset{M_i \in \mathcal{M}}{\operatorname{argmax}} p(y | \hat{\theta}, \mathcal{M}_i)\]

or the posterior \(\Theta\) (if this was found):

\[\underset{M_i \in \mathcal{M}}{\operatorname{argmax}} \int_{\theta \in \Theta} p(y | \theta, M_i)\]

What do we think the differences are?

Relative Evidence for Models

Simplest approach: What is the log-posterior probability of the data \(\log p(y | M_i)\)?

Can convert to a relative probability:

\[\mathbb{P}(M_i | \mathcal{M}) = \frac{p(y | M_i)}{\sum_j p(y | M_j)}\]

Relative Evidence for Models

Code
model_evidence = DataFrame(Model=["Stationary", "Time", "PDO"], LogPost=[mean(stat_chain[:lp]), mean(nonstat_chain[:lp]), mean(pdo_chain[:lp])])
model_evidence.ModelProb = round.(exp.(model_evidence.LogPost .- log(sum(exp.(model_evidence.LogPost)))); digits=2)
model_evidence.LogPost = round.(model_evidence.LogPost; digits=0)
model_evidence
3×3 DataFrame
Row Model LogPost ModelProb
String Float64 Float64
1 Stationary -711.0 0.17
2 Time -710.0 0.81
3 PDO -714.0 0.02

Predictive Evaluation Criteria

Problems with Explanatory Criteria

  • Risk of overfitting
  • What if historical dynamics are similar, but out-of-sample predictions are distinct?

The alternative is predictive performance: how probable is unobserved data?

Cross-Validation

Gold standard of predictive criteria: hold out data, fit model on remaining data, and see how well it performs.

But:

  • What to do for data with structure, e.g. temporal or spatial data?
  • Also can be computationally expensive.

Posterior Predictive Distributions

Next class: will build up approximations to cross-validation. Idea: Consider a new realization \(y^{\text{rep}}\) simulated from

\[p(y^{\text{rep}} | y) = \int_{\theta} p(y^{\text{rep}} | \theta) p(\theta | y) d\theta.\]

Can sample:

\[p(\theta | y) \xrightarrow{\hat{\theta}} \mathcal{M}(\hat{\theta}) \rightarrow y^{\text{rep}}\]

Key Points

Key Points

  • Simulation lets us generalize hypothesis testing.
  • Advantage: Can look at probabilities of multiple candidate models, not just “rejecting” a null hypothesis.
  • Bayesian updating applies to these models: how does new data
  • Explanatory vs. Predictive model evaluation.

Upcoming Schedule

Next Classes

Wednesday: Cross-Validation and Information Criteria

Assessments

Homework 4: Released End of Week, Due 5/3 (Last One!)