Information Criteria

Lecture 19

April 22, 2024

Review of Last Class

Bias-Variance Tradeoff

Key Idea: Model selection consists of navigating the bias-variance tradeoff.

Model error (e.g. RMSE) is a combination of irreducible error, bias, and variance.

Bias

Bias is error from mismatches between the model predictions and the data (\(\text{Bias}[\hat{f}] = \mathbb{E}[\hat{f}] - y\)).

Bias comes from under-fitting meaningful relationships between inputs and outputs.

too few degrees of freedom (“too simple”)
neglected processes.

Variance

Variance is error from over-sensitivity to small fluctuations in inputs (\(\text{Variance} = \text{Var}(\hat{f})\)).

Variance can come from over-fitting noise in the data.

too many degrees of freedom (“too complex”)
poor identifiability

Bias-Variance Tradeoff

Upshot: For achieve a fixed error level, you can reduce bias (more “complex” model) or you can reduce variance (more “simple” model) but there’s a tradeoff.

Bias-Variance Tradeoff More Generally

This decomposition is for MSE, but the principle holds more generally.

Models which perform better “on average” over the training data (low bias) are more likely to overfit (high variance);
Models which have less uncertainty for training data (low variance) will do worse “on average”.

Cross-Validation

The “gold standard” way to test for predictive performance is cross-validation:

Split data into training/testing sets;
Calibrate model to training set;
Check for predictive ability on testing set.

Leave-One-Out Cross-Validation

Drop one value \(y_i\).
Refit model on rest of data \(y_{-i}\).
Evaluate \(\log p(y_i | y_{-i})\).
Repeat on rest of data set.

Leave-\(k\)-Out Cross-Validation

Drop \(k\) values, refit model on rest of data, check for predictive skill.

As \(k \to n\), this reduces to the prior predictive distribution \[p(y^{\text{rep}}) = \int_{\theta} p(y^{\text{rep}} | \theta) p(\theta) d\theta.\]

Expected Out-Of-Sample Predictive Accuracy

The out-of-sample predictive fit of a new data point \(\tilde{y}_i\) is

\[ \begin{align} \log p_\text{post}(\tilde{y}_i) &= \log \mathbb{E}_\text{post}\left[p(\tilde{y}_i | \theta)\right] \\ &= \log \int p(\tilde{y_i} | \theta) p_\text{post}(\theta)\,d\theta. \end{align} \]

Expected Out-Of-Sample Predictive Accuracy

However, the out-of-sample data \(\tilde{y}_i\) is itself unknown, so we need to compute the expected out-of-sample log-predictive density

\[ \begin{align} \text{elpd} &= \text{expected log-predictive density for } \tilde{y}_i \\ &= \mathbb{E}_P \left[\log p_\text{post}(\tilde{y}_i)\right] \\ &= \int \log\left(p_\text{post}(\tilde{y}_i)\right) P(\tilde{y}_i)\,d\tilde{y}. \end{align} \]

Expected Out-Of-Sample Predictive Accuracy

What is the challenge?

We don’t know \(P\) (the distribution of new data)!

We need some measure of the error induced by using an approximating distribution \(Q\) from some model.

Information Criteria

“Information criteria” refers to a category of estimators of prediction error.

The idea: estimate predictive error using the fitted model.

Information Criteria Overview

There is a common framework for all of these:

If we compute the expected log-predictive density for the existing data \(p(y | \theta)\), this will be too good of a fit and will overestimate the predictive skill for new data.

Information Criteria Corrections

We can adjust for that bias by correcting for the effective number of parameters, which can be thought of as the expected degrees of freedom in a model contributing to overfitting.

Akaike Information Criterion (AIC)

The “first” information criterion that most people see.

Uses a point estimate (the maximum-likelihood estimate \(\hat{\theta}_\text{MLE}\)) to compute the log-predictive density for the data, corrected by the number of parameters \(k\):

\[\widehat{\text{elpd}}_\text{AIC} = \log p(y | \hat{\theta}_\text{MLE}) - k.\]

AIC Formula

The AIC is defined as \(-2\widehat{\text{elpd}}_\text{AIC}\).

Due to this convention, lower AICs are better (they correspond to a higher predictive skill).

AIC Correction Term

In the case of a normal model with independent and identically-distributed data and uniform priors, \(k\) is the asymptotically “correct” bias term (there are modified corrections for small sample sizes).

However, with more informative priors and/or hierarchical models, the bias correction \(k\) is no longer appropriate, as there is less “freedom” associated with each parameter.

AIC: Storm Surge Example

Models:

Stationary (“null”) model, \(y_t \sim \text{GEV}(\mu, \sigma, \xi);\)
Time nonstationary (“null-ish”) model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 t, \sigma, \xi);\)
PDO nonstationary model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 \text{PDO}_t, \sigma, \xi)\)

AIC Example

Code

stat_mle = [1258.71, 56.27, 0.017]
stat_ll = -707.67
nonstat_mle = [1231.58, 0.42, 52.07, 0.075]
nonstat_ll = -702.45
pdo_mle = [1255.87, -12.39, 54.73, 0.033]
pdo_ll = -705.24

# compute AIC values
stat_aic = stat_ll - 3
nonstat_aic = nonstat_ll - 4
pdo_aic = pdo_ll - 4

model_aic = DataFrame(Model=["Stationary", "Time", "PDO"], LogLik=trunc.(Int64, round.([stat_ll, nonstat_ll, pdo_ll]; digits=0)), AIC=trunc.(Int64, round.(-2 * [stat_aic, nonstat_aic, pdo_aic]; digits=0)))

3×3 DataFrame

Row	Model	LogLik	AIC
	String	Int64	Int64
1	Stationary	-708	1421
2	Time	-702	1413
3	PDO	-705	1418

AIC Interpretation

Absolute AIC values have no meaning, only the differences \(\Delta_i = \text{AIC}_i - \text{AIC}_\text{min}\).

Some basic rules of thumb (from Burnham & Anderson (2004)):

\(\Delta_i < 2\) means the model has “strong” support across \(\mathcal{M}\);
\(4 < \Delta_i < 7\) suggests “less” support;
\(\Delta_i > 10\) suggests “weak” or “no” support.

AIC and Model Evidence

\(\exp(-\Delta_i/2)\) can be thought of as a measure of the likelihood of the model given the data \(y\).

The ratio \[\exp(-\Delta_i/2) / \exp(-\Delta_j/2)\] can approximate the relative evidence for \(M_i\) versus \(M_j\).

AIC and Model Averaging

This gives rise to the idea of Akaike weights: \[w_i = \frac{\exp(-\Delta_i/2)}{\sum_{m=1}^M \exp(-\Delta_m/2)}.\]

Model projections can then be weighted based on \(w_i\), which can be interpreted as the probability that \(M_i\) is the best (in the sense of approximating the “true” predictive distribution) model in \(\mathcal{M}\).

Model Averaging vs. Selection

Model averaging can sometimes be beneficial vs. model selection.

Model selection can introduce bias from the selection process (this is particularly acute for stepwise selection due to path-dependence).

Key Takeaways and Upcoming Schedule

Key Takeaways

LOO-CV is ideal for navigating bias-variance tradeoff but can be computationally prohibitive.
Information Criteria are an approximation to LOO-CV based on “correcting” for model complexity.
Approximation to out of sample predictive error as a penalty for potential to overfit.

Next Classes

Wednesday: Other Information Criteria

References

Burnham, K. P., & Anderson, D. R. (2004). Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociol. Methods Res., 33, 261–304. https://doi.org/10.1177/0049124104268644