Predictive Model Assessment

Lecture 18

April 17, 2024

Review of Last Class

Simulation as Hypothesis Testing

Define models for hypotheses (including baseline/null).
Fit models to data.
Simulate alternative datasets.
Compare output/test statistics to data.
Maybe: Convert metrics to probabilities (condition on \(\mathcal{M}\)).

Metrics for Model Adequacy/Performance

Explanatory Metrics
- Based on same training/calibration data
- RMSE, R², \(\log p(y | M)\)
Predictive Metrics
- Held-out data/cross-validation

Bias-Variance Tradeoff

What Is The Goal of Model Selection?

Key Idea: Model selection consists of navigating the bias-variance tradeoff.

Model error (e.g. RMSE) is a combination of irreducible error, bias, and variance.

Setting for Model Selection

Suppose we have a data-generating model \[y = f(x) + \varepsilon, \varepsilon \sim N(0, \sigma)\]. We want to fit a model \(\hat{y} \approx \hat{f}(x)\).

Bias

Bias is error from mismatches between the model predictions and the data (\(\text{Bias}[\hat{f}] = \mathbb{E}[\hat{f}] - y\)).

Bias comes from under-fitting meaningful relationships between inputs and outputs.

too few degrees of freedom (“too simple”)
neglected processes.

Variance

Variance is error from over-sensitivity to small fluctuations in inputs (\(\text{Variance} = \text{Var}(\hat{f})\)).

Variance can come from over-fitting noise in the data.

too many degrees of freedom (“too complex”)
poor identifiability

Decomposition of MSE

\[ \begin{align*} \text{MSE} &= \mathbb{E}[y - \hat{f}^2] \\ &= \mathbb{E}[y^2 - 2y\hat{f}(x) + \hat{f}^2] \\ &= \mathbb{E}[y^2] - 2\mathbb{E}[y\hat{f}] + E[\hat{f}^2] \\ &= \mathbb{E}[(f + \varepsilon)^2] - \mathbb{E}[(f + \varepsilon)\hat{f}] + E[\hat{f}^2] \\ &= \vdots \\ &= \text{Bias}(\hat{f})^2 + \text{Var}(\hat{f}) + \sigma^2 \end{align*} \]

Bias-Variance Tradeoff

Upshot: For achieve a fixed error level, you can reduce bias (more “complex” model) or you can reduce variance (more “simple” model) but there’s a tradeoff.

Bias-Variance Tradeoff More Generally

This decomposition is for MSE, but the principle holds more generally.

Models which perform better “on average” over the training data (low bias) are more likely to overfit (high variance);
Models which have less uncertainty for training data (low variance) will do worse “on average”.

Model Complexity

Model complexity is not necessarily the same as the number of parameters.

Sometimes processes in the model can compensate for each other (“reducing degrees of freedom”)

This can help improve the representation of the dynamics and reduce error/uncertainty even when additional parameters are included.

Bias-Variance and Complexity

Source: Wikipedia

Occam’s Razor

Entities are not to be multiplied without necessity.

Credited to William of Ockham, appears much earlier in the works of Maimonides, Ptolemy, and Aristotle, first formulated as such by John Punch (1639)

“Zebra Principle”

More colloquially:

When you hear hoofbeats, think of horses, not zebras.

— Theodore Woodward

Cross-Validation and Predictive Fit

Log-Likelihood as Predictive Fit Measure

The measure of predictive fit that we will use:

The log predictive density or log-likelihood of a replicated data point/set, \[p(y^{rep} | \theta).\]

Why Use Log-Likelihood?

Why use the log-likelihood density instead of the log-posterior?

The likelihood captures the data-generating process;
The posterior includes the prior, which is only relevant for parameter estimation.

Important: This means that the prior is still relevant in predictive model assessment, and should be thought of as part of the model structure!

Cross-Validation

The “gold standard” way to test for predictive performance is cross-validation:

Split data into training/testing sets;
Calibrate model to training set;
Check for predictive ability on testing set.

Leave-One-Out Cross-Validation

Drop one value \(y_i\).
Refit model on rest of data \(y_{-i}\).
Evaluate \(\log p(y_i | y_{-i})\).
Repeat on rest of data set.

LOO-CV Example

Code

drop_idx = rand(1:nrow(dat_annmax), 1)
p1 = plot(
    dat_annmax.Year[setdiff(1:end, drop_idx)],
    dat_annmax.residual[setdiff(1:end, drop_idx)];
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5,
    tickfontsize=16,
    guidefontsize=18,
    left_margin=5mm, 
    bottom_margin=5mm
)
scatter!(p1,
    dat_annmax.Year[drop_idx],
    dat_annmax.residual[drop_idx],
    color=:red,
    label=false,
    markersize=5,
    marker=:circle
)

Figure 1

LOO-CV Example

Code

# fit held-out data
stat_lb = [1000.0, 0.0, -1.0]
stat_ub = [2000.0, 100.0, 1.0]
stat_p0 = [1200.0, 50.0, 0.01]
nonstat_lb = [1000.0, -20.0, 0.0, -5.0]
nonstat_ub = [2000.0, 20.0, 100.0, 5.0]
nonstat_p0 = [1200.0, 0.5, 50.0, 0.05]

stat_cv = Optim.optimize(p -> -sum(logpdf.(GeneralizedExtremeValue(p[1], p[2], p[3]), dat_annmax.residual[setdiff(1:end, drop_idx)])), stat_lb, stat_ub, stat_p0)
stat_cv_mle = stat_cv.minimizer
stat_cv_logp = logpdf(GeneralizedExtremeValue(stat_cv_mle[1], stat_cv_mle[2], stat_cv_mle[3]), dat_annmax.residual[drop_idx])

p = plot(GeneralizedExtremeValue(stat_cv_mle[1], stat_cv_mle[2], stat_cv_mle[3]),
    xlabel="Storm Surge Extreme (mm)",
    ylabel="Probability Density",
    xlims=(1000, 1800),
    linewidth=3,
    label=false,
    tickfontsize=16,
    guidefontsize=18,
    legendfontsize=16,
    left_margin=5mm, 
    bottom_margin=5mm,
    right_margin=10mm
)
vline!(p, [dat_annmax.residual[drop_idx]], color=:red, linestyle=:dash, label="Held-out Value", linewidth=2)
plot!(p, size=(600, 400))

Figure 2: MLE for GEV fit without held out data point.

\(\log p(y_i | y_{-i})\):

-7.2.

LOO-CV Example

2×2 DataFrame

Row	Model	LOOCV
	String	Float64
1	Stationary	-711.064
2	Nonstationary	-706.663

Bayesian LOO-CV

By default, Bayesian LOO-CV is extremely expensive:

\[\text{loo-cv} = \sum_{i=1}^n \log p_{\text{post}(-i)}(y_i),\]

which requires refitting the model without \(y_i\) for every data point.

Leave-\(k\)-Out Cross-Validation

Drop \(k\) values, refit model on rest of data, check for predictive skill.

As \(k \to n\), this reduces to the prior predictive distribution \[p(y^{\text{rep}}) = \int_{\theta} p(y^{\text{rep}} | \theta) p(\theta) d\theta.\]

Challenges with Cross-Validation

This can be very computationally expensive!
We often don’t have a lot of data for calibration, so holding some back can be a problem.
How to divide data with spatial or temporal structure? This can be addressed by partitioning the data more cleverly: \[y = \{y_{1:t}, y_{-((t+1):T)}\}\] but this makes the data problem worse.

Expected Out-Of-Sample Predictive Accuracy

The out-of-sample predictive fit of a new data point \(\tilde{y}_i\) is

\[ \begin{align} \log p_\text{post}(\tilde{y}_i) &= \log \mathbb{E}_\text{post}\left[p(\tilde{y}_i | \theta)\right] \\ &= \log \int p(\tilde{y_i} | \theta) p_\text{post}(\theta)\,d\theta. \end{align} \]

Expected Out-Of-Sample Predictive Accuracy

However, the out-of-sample data \(\tilde{y}_i\) is itself unknown, so we need to compute the expected out-of-sample log-predictive density

\[ \begin{align} \text{elpd} &= \text{expected log-predictive density for } \tilde{y}_i \\ &= \mathbb{E}_P \left[\log p_\text{post}(\tilde{y}_i)\right] \\ &= \int \log\left(p_\text{post}(\tilde{y}_i)\right) P(\tilde{y}_i)\,d\tilde{y}. \end{align} \]

Expected Out-Of-Sample Predictive Accuracy

What is the challenge?

We don’t know \(P\) (the distribution of new data)!

We need some measure of the error induced by using an approximating distribution \(Q\) from some model.

Key Takeaways and Upcoming Schedule

Key Takeaways

Model selection is a balance between bias (underfitting) and variance (overfitting).
For predictive assessment, leave-one-out cross-validation is an ideal, but hard to implement in practice (particularly for time series).

An Important Caveat

Model selection can result in significant overfitting when separated from hypothesis-driven model development (Freedman, 1983; Smith, 2018)

An Important Caveat

Better off thinking about the scientific or engineering problem you want to solve and use domain knowledge/checks rather than throwing a large number of possible models into the selection machinery.
Regularizing priors reduce potential for overfitting.
Model averaging (Hoeting et al., 2021) and stacking (Yao et al., 2018) can combine multiple models as an alternative to selection.

Next Classes

Monday: Information Criteria

References

Freedman, D. A. (1983). A note on screening regression equations. Am. Stat., 37, 152. https://doi.org/10.2307/2685877

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (2021). Bayesian model averaging: a tutorial. Stat. Sci., 14, 382–401. https://doi.org/10.1214/ss/1009212519

Smith, G. (2018). Step away from stepwise. J. Big Data, 5, 1–12. https://doi.org/10.1186/s40537-018-0143-6

Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions. arXiv [Stat.ME]. https://doi.org/10.1214/17-BA1091