Model Assessment

Lecture 17

April 15, 2024

Review of Last Class

Hypothesis Testing

Standard null hypothesis significance testing framework:

Frame problem with null hypothesis \(\mathcal{H}_0\) and alternative hypothesis \(\mathcal{H}\).
Set significance level \(\alpha\).
Find \(p\)-value \[\mathbb{P}\left(y_{\mathcal{H}_0} > y\right)\]

What is a \(p\)-value?

The p-value captures the probability that, assuming the null hypothesis, you would observe results at least as extreme as you observed.

Code

test_dist = Normal(0, 3)
x = -10:0.01:10
plot(x, pdf.(test_dist, x), linewidth=3, legend=false, color=:black, xticks=false,  yticks=false, ylabel="Probability", xlabel=L"y", bottom_margin=10mm, left_margin=10mm, guidefontsize=16)
vline!([5], linestyle=:dash, color=:purple, linewidth=3, )
areaplot!(5:0.01:10, pdf.(test_dist, 5:0.01:10), color=:green, alpha=0.4)
quiver!([-4.5], [0.095], quiver=([1], [-0.02]), color=:black, linewidth=2)
annotate!([-5], [0.11], text("Null\nSampling\nDistribution", color=:black))
quiver!([6.5], [0.03], quiver=([-1], [-0.015]), color=:green, linewidth=2)
annotate!([6.85], [0.035], text("p-value", :green))
quiver!([3.5], [0.02], quiver=([1.5], [0]), color=:purple, linewidth=2)
annotate!([0.9], [0.02], text("Observation", :purple))
plot!(size=(650, 500))

Figure 1: Illustration of a p-value

Rejecting the Null

If \(p\text{-value} < \alpha\), “reject” the null hypothesis in favor of the alternative and say that the effect is “statistically significant.”

Otherwise, do not reject the null.

The goal is to strike a balance between Type I and Type II errors.

Error Types

		Null Hypothesis Is
		True	False
Decision About Null Hypothesis	Don’t reject	True negative (probability \(1-\alpha\))	Type II error (false negative, probability \(\beta\))
Decision About Null Hypothesis	Reject	Type I Error (false positive, probability \(\alpha\))	True positive (probability \(1-\beta\))

But What Is Statistical Significance?

But, this doesn’t mean:

That the null is “wrong”;
That the alternative is a better descriptor of the data-generating process;
That the effect sized of the hypothesized mechanism is “significant”.

What a \(p\)-value is Not

Probability that the null hypothesis is true;
Probability that the effect was produced by chance alone;
An indication of the effect size.

How Might We Do Better?

Consideration of multiple plausible (possibly more nuanced) hypotheses.
Assessment/quantification of evidence consistent with different hypotheses.
Insight into the effect size.

Model Assessment

Fundamental Data Analysis Challenge

Goal (often): Explain data and/or make predictions about unobserved data.

Challenges: Environmental systems are:

high-dimensional
multi-scale
nonlinear
subject to many uncertainties

Multiplicities of Models

In general, we are in an \(\mathcal{M}\)-open setting: no model is the “true” data-generating model, so we want to pick a model which performs well enough for the intended purpose.

The contrast to this is \(\mathcal{M}\)-closed, in which one of the models under consideration is the “true” data-generating model, and we would like to recover it.

What Is Any Statistical Test Doing?

If we think about what a test like Mann-Kendall is doing:

Assume the null hypothesis \(H_0\);
Obtain the sampling distribution of a test statistic \(S\) which captures the property of interest under \(H_0\);
Compute the test statistic \(\hat{S}\) on the data.
Calculate the probability of \(S\) more extreme than \(\hat{S}\) (the \(p\)-value).

None of this requires a NHST framework!

Simulation for Statistical Testing

Instead, if we have a model which permits simulation:

Calibrate models under different assumptions (e.g. stationarity vs. nonstationary based on different covariates);
Simulate realizations from those models;
Compute the distribution of the relevant statistic \(S\) from these realizations;
Assess which distribution is most consistent with the observed quantity.

Model Assessment Criteria

How do we assess models?

Two general categories:

How well do we explain the data?
How well do we predict new data?

Explanatory Criteria

Generally based on the error (RMSE, MAE) or probability of the data \(p(y | M)\).

To select a model:

\[\underset{M_i \in \mathcal{M}}{\operatorname{argmax}} p(y |M_i)\]

Example: Are Storm Surges Nonstationary?

SF Tide Gauge Data

Code

# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

dat = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(DataFrames.transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5,
    tickfontsize=16,
    guidefontsize=18,
    left_margin=5mm, 
    bottom_margin=5mm
)

Figure 2: Annual maxima surge data from the San Francisco, CA tide gauge.

Models Under Consideration

Three hypotheses (there could be many more!):

Stationary (“null”) model, \(y_t \sim \text{GEV}(\mu, \sigma, \xi);\)
Time nonstationary (“null-ish”) model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 t, \sigma, \xi);\)
PDO nonstationary model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 \text{PDO}_t, \sigma, \xi)\)

Stationary Model

@model function sf_stat(y)
    μ ~ truncated(Normal(1500, 200), lower=0)
    σ ~ truncated(Normal(100, 25), lower=0)
    ξ ~ Normal(0, 1)

    for i in 1:length(y)
        y[i] ~ GeneralizedExtremeValue(μ, σ, ξ)
    end
end

stat_chain = sample(sf_stat(dat_annmax.residual), NUTS(), MCMCThreads(), 10_000, 4)
summarystats(stat_chain)

Stationary Model

┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
┌ Info: Found initial step size
└   ϵ = 0.0015625
┌ Info: Found initial step size
└   ϵ = 0.05
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:00:24
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:00:09
┌ Info: Found initial step size
└   ϵ = 0.00625
┌ Info: Found initial step size
└   ϵ = 0.0125
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:00:04
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:11
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:11

Summary Statistics
  parameters        mean       std      mcse     ess_bulk     ess_tail      rh ⋯
      Symbol     Float64   Float64   Float64      Float64      Float64   Float ⋯

           μ   1259.4746    5.7075    0.0322   31390.4794   28777.5313    1.00 ⋯
           σ     58.7560    4.3683    0.0284   23773.8975   26901.9065    1.00 ⋯
           ξ      0.0224    0.0639    0.0004   22280.1320   23863.3699    1.00 ⋯
                                                               2 columns omitted

Nonstationary (Time) Model

@model function sf_nonstat(y)
    a ~ truncated(Normal(1500, 200), lower=0)
    b ~ Normal(0, 5)
    σ ~ truncated(Normal(100, 25), lower=0)
    ξ ~ Normal(0, 1)

    T = length(y)
    for i in 1:T
        y[i] ~ GeneralizedExtremeValue.(a .+ b * i, σ, ξ)
    end
end

nonstat_chain = sample(sf_nonstat(dat_annmax.residual), NUTS(), MCMCThreads(), 10_000, 4)
summarystats(nonstat_chain)

Nonstationary (Time) Model

┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
┌ Info: Found initial step size
└   ϵ = 0.00625
┌ Info: Found initial step size
└   ϵ = 0.0125
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:00:23
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:00:10
┌ Info: Found initial step size
└   ϵ = 0.05
┌ Info: Found initial step size
└   ϵ = 0.05
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:00:05
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:16
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:16

Summary Statistics
  parameters        mean       std      mcse     ess_bulk     ess_tail      rh ⋯
      Symbol     Float64   Float64   Float64      Float64      Float64   Float ⋯

           a   1233.2116    9.9739    0.0751   17727.8760   19471.6475    1.00 ⋯
           b      0.4061    0.1276    0.0009   18531.5660   20869.7920    1.00 ⋯
           σ     55.0806    4.4145    0.0278   25122.5084   22639.0411    1.00 ⋯
           ξ      0.0714    0.0759    0.0005   23417.8043   21499.4490    1.00 ⋯
                                                               2 columns omitted

Nonstationary (PDO) Model

@model function sf_pdo(y, pdo)
    a ~ truncated(Normal(1500, 200), lower=0)
    b ~ Normal(0, 5)
    σ ~ truncated(Normal(100, 25), lower=0)
    ξ ~ Normal(0, 1)

    for i in 1:length(y)
        y[i] ~ GeneralizedExtremeValue.(a + b * pdo[i], σ, ξ)
    end
end

pdo_chain = sample(sf_pdo(dat_annmax.residual, pdo.PDO), NUTS(), MCMCThreads(), 10_000, 4)
summarystats(pdo_chain)

Nonstationary (PDO) Model

┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
┌ Info: Found initial step size
└   ϵ = 0.025
┌ Info: Found initial step size
└   ϵ = 0.2
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:00:13
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:00:06
┌ Info: Found initial step size
└   ϵ = 0.2
┌ Info: Found initial step size
└   ϵ = 0.2
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:00:02
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:08
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:08

Summary Statistics
  parameters        mean       std      mcse     ess_bulk     ess_tail      rh ⋯
      Symbol     Float64   Float64   Float64      Float64      Float64   Float ⋯

           a   1258.3519    5.7546    0.0311   34292.0031   30986.1245    1.00 ⋯
           b     -5.2030    3.8235    0.0219   30416.6923   28764.2555    1.00 ⋯
           σ     57.9180    4.3708    0.0265   27268.6473   28192.5174    1.00 ⋯
           ξ      0.0306    0.0644    0.0004   28244.0215   25382.5032    1.00 ⋯
                                                               2 columns omitted

Model Simulations

Code

# make predictions from each model
stat_sim = predict(sf_stat(Vector{Union{Missing, Float64}}(undef, length(dat_annmax.residual))), stat_chain)
nonstat_sim = predict(sf_nonstat(Vector{Union{Missing, Float64}}(undef, length(dat_annmax.residual))), nonstat_chain)
pdo_sim = predict(sf_pdo(Vector{Union{Missing, Float64}}(undef, length(dat_annmax.residual)), pdo.PDO), pdo_chain)

# get quantiles
stat_q = mapslices(x -> quantile(x, [0.05, 0.5, 0.95]), stat_sim.value.data[:, :, 1], dims=1)
nonstat_q = mapslices(x -> quantile(x, [0.05, 0.5, 0.95]), nonstat_sim.value.data[:, :, 1], dims=1)
pdo_q = mapslices(x -> quantile(x, [0.05, 0.5, 0.95]), pdo_sim.value.data[:, :, 1], dims=1)

# plot
p = plot(; xlabel="Year", ylabel="Storm Tide Annual Maximum (mm)", left_margin=5mm, bottom_margin=5mm, right_margin=5mm, tickfontsize=16, guidefontsize=18, legendfontsize=16, legend=:outerright)
plot!(p, dat_annmax.Year, stat_q[2, :], ribbon=(stat_q[2, :] - stat_q[1, :], stat_q[3, :] - stat_q[2, :]), color=:sienna, label="Stationary", fillalpha=0.2, linewidth=3)
plot!(p, dat_annmax.Year, nonstat_q[2, :], ribbon=(nonstat_q[2, :] - nonstat_q[1, :], nonstat_q[3, :] - nonstat_q[2, :]), color=:red, label="Nonstationary (Time)", fillalpha=0.2, linewidth=3)
plot!(p, dat_annmax.Year, pdo_q[2, :], ribbon=(pdo_q[2, :] - pdo_q[1, :], pdo_q[3, :] - pdo_q[2, :]), color=:teal, label="Nonstationary (PDO)", fillalpha=0.2, linewidth=3)
scatter!(p, dat_annmax.Year, dat_annmax.residual, color=:black, markersize=5, label="Observations", alpha=0.5)

Figure 3: Model simulations for the three storm surge models.

Point Estimates vs. Posteriors

Can calculate these metrics using some point estimate \(\hat{\theta}\) (MLE, MAP, etc):

\[\underset{M_i \in \mathcal{M}}{\operatorname{argmax}} p(y | \hat{\theta}, \mathcal{M}_i)\]

or the posterior \(\Theta\) (if this was found):

\[\underset{M_i \in \mathcal{M}}{\operatorname{argmax}} \int_{\theta \in \Theta} p(y | \theta, M_i)\]

What do we think the differences are?

Relative Evidence for Models

Simplest approach: What is the log-posterior probability of the data \(\log p(y | M_i)\)?

Can convert to a relative probability:

\[\mathbb{P}(M_i | \mathcal{M}) = \frac{p(y | M_i)}{\sum_j p(y | M_j)}\]

Relative Evidence for Models

Code

model_evidence = DataFrame(Model=["Stationary", "Time", "PDO"], LogPost=[mean(stat_chain[:lp]), mean(nonstat_chain[:lp]), mean(pdo_chain[:lp])])
model_evidence.ModelProb = round.(exp.(model_evidence.LogPost .- log(sum(exp.(model_evidence.LogPost)))); digits=2)
model_evidence.LogPost = round.(model_evidence.LogPost; digits=0)
model_evidence

3×3 DataFrame

Row	Model	LogPost	ModelProb
	String	Float64	Float64
1	Stationary	-711.0	0.17
2	Time	-710.0	0.81
3	PDO	-714.0	0.02

Predictive Evaluation Criteria

Problems with Explanatory Criteria

Risk of overfitting
What if historical dynamics are similar, but out-of-sample predictions are distinct?

The alternative is predictive performance: how probable is unobserved data?

Cross-Validation

Gold standard of predictive criteria: hold out data, fit model on remaining data, and see how well it performs.

But:

What to do for data with structure, e.g. temporal or spatial data?
Also can be computationally expensive.

Posterior Predictive Distributions

Next class: will build up approximations to cross-validation. Idea: Consider a new realization \(y^{\text{rep}}\) simulated from

\[p(y^{\text{rep}} | y) = \int_{\theta} p(y^{\text{rep}} | \theta) p(\theta | y) d\theta.\]

Can sample:

\[p(\theta | y) \xrightarrow{\hat{\theta}} \mathcal{M}(\hat{\theta}) \rightarrow y^{\text{rep}}\]

Key Points

Simulation lets us generalize hypothesis testing.
Advantage: Can look at probabilities of multiple candidate models, not just “rejecting” a null hypothesis.
Bayesian updating applies to these models: how does new data
Explanatory vs. Predictive model evaluation.

Model Assessment

Review of Last Class

Hypothesis Testing

What is a \(p\)-value?

Rejecting the Null

Error Types

But What Is Statistical Significance?

What a \(p\)-value is Not

How Might We Do Better?

Model Assessment

Fundamental Data Analysis Challenge

Multiplicities of Models

What Is Any Statistical Test Doing?

Simulation for Statistical Testing

Model Assessment Criteria

Explanatory Criteria

Example: Are Storm Surges Nonstationary?

SF Tide Gauge Data

Models Under Consideration

Stationary Model

Stationary Model

Nonstationary (Time) Model

Nonstationary (Time) Model

Nonstationary (PDO) Model

Nonstationary (PDO) Model

Model Simulations

Point Estimates vs. Posteriors

Relative Evidence for Models

Relative Evidence for Models

Predictive Evaluation Criteria

Problems with Explanatory Criteria

Cross-Validation

Posterior Predictive Distributions

Key Points

Key Points

Upcoming Schedule

Next Classes

Assessments