Other Information Criteria


Lecture 20

April 24, 2024

Review of Last Class

Expected Out-Of-Sample Predictive Accuracy

The out-of-sample predictive fit of a new data point \(\tilde{y}_i\) is

\[ \begin{align} \log p_\text{post}(\tilde{y}_i) &= \log \mathbb{E}_\text{post}\left[p(\tilde{y}_i | \theta)\right] \\ &= \log \int p(\tilde{y_i} | \theta) p_\text{post}(\theta)\,d\theta. \end{align} \]

Expected Out-Of-Sample Predictive Accuracy

However, the out-of-sample data \(\tilde{y}_i\) is itself unknown, so we need to compute the expected out-of-sample log-predictive density

\[ \begin{align} \text{elpd} &= \text{expected log-predictive density for } \tilde{y}_i \\ &= \mathbb{E}_P \left[\log p_\text{post}(\tilde{y}_i)\right] \\ &= \int \log\left(p_\text{post}(\tilde{y}_i)\right) P(\tilde{y}_i)\,d\tilde{y}. \end{align} \]

Expected Out-Of-Sample Predictive Accuracy

What is the challenge?

We don’t know \(P\) (the distribution of new data)!

We need some measure of the error induced by using an approximating distribution \(Q\) from some model.

Information Criteria Overview

There is a common framework for all of these:

If we compute the expected log-predictive density for the existing data \(p(y | \theta)\), this will be too good of a fit and will overestimate the predictive skill for new data.

Information Criteria Corrections

We can adjust for that bias by correcting for the effective number of parameters, which can be thought of as the expected degrees of freedom in a model contributing to overfitting.

Akaike Information Criterion (AIC)

The “first” information criterion that most people see.

Uses a point estimate (the maximum-likelihood estimate \(\hat{\theta}_\text{MLE}\)) to compute the log-predictive density for the data, corrected by the number of parameters \(k\):

\[\widehat{\text{elpd}}_\text{AIC} = \log p(y | \hat{\theta}_\text{MLE}) - k.\]

AIC Formula

The AIC is defined as \(-2\widehat{\text{elpd}}_\text{AIC}\).

Due to this convention, lower AICs are better (they correspond to a higher predictive skill).

AIC Correction Term

In the case of a normal model with independent and identically-distributed data and uniform priors, \(k\) is the asymptotically “correct” bias term (there are modified corrections for small sample sizes).

However, with more informative priors and/or hierarchical models, the bias correction \(k\) is no longer appropriate, as there is less “freedom” associated with each parameter.

AIC Interpretation

Absolute AIC values have no meaning, only the differences \(\Delta_i = \text{AIC}_i - \text{AIC}_\text{min}\).

Some basic rules of thumb (from Burnham & Anderson (2004)):

  • \(\Delta_i < 2\) means the model has “strong” support across \(\mathcal{M}\);
  • \(4 < \Delta_i < 7\) suggests “less” support;
  • \(\Delta_i > 10\) suggests “weak” or “no” support.

AIC and Model Evidence

\(\exp(-\Delta_i/2)\) can be thought of as a measure of the likelihood of the model given the data \(y\).

The ratio \[\exp(-\Delta_i/2) / \exp(-\Delta_j/2)\] can approximate the relative evidence for \(M_i\) versus \(M_j\).

AIC and Model Averaging

This gives rise to the idea of Akaike weights: \[w_i = \frac{\exp(-\Delta_i/2)}{\sum_{m=1}^M \exp(-\Delta_m/2)}.\]

Model projections can then be weighted based on \(w_i\), which can be interpreted as the probability that \(M_i\) is the best (in the sense of approximating the “true” predictive distribution) model in \(\mathcal{M}\).

Model Averaging vs. Selection

Model averaging can sometimes be beneficial vs. model selection.

Model selection can introduce bias from the selection process (this is particularly acute for stepwise selection due to path-dependence).

Other Informtion Criteria

Deviance Information Criterion (DIC)

The Deviance Information Criterion (DIC) is a more Bayesian generalization of AIC which uses the posterior mean \[\hat{\theta}_\text{Bayes} = \mathbb{E}\left[\theta | y\right]\] and a bias correction derived from the data.

DIC

\[\widehat{\text{elpd}}_\text{DIC} = \log p(y | \hat{\theta}_\text{Bayes}) - p_{\text{DIC}},\] where \[p_\text{DIC} = 2\left(\log p(y | \hat{\theta}_\text{Bayes}) - \mathbb{E}_\text{post}\left[\log p(y | \theta)\right]\right).\]

Then, as with AIC, \[\text{DIC} = -2\widehat{\text{elpd}}_\text{DIC}.\]

DIC: Effective Number of Parameters

What is the meaning of \(p_\text{DIC}\)?

  • The difference between the average log-likelihood (across parameters) and the log-likelihood at a parameter average measures “degrees of freedom”.
  • The DIC adjustment assumes independence of residuals for fixed \(\theta\).

AIC vs. DIC

AIC and DIC often give similar results, but don’t have to.

The key difference is the impact of priors on parameter estimation and model degrees of freedom.

AIC vs. DIC Storm Surge Example

Models:

  1. Stationary (“null”) model, \(y_t \sim \text{GEV}(\mu, \sigma, \xi);\)
  2. Time nonstationary (“null-ish”) model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 t, \sigma, \xi);\)
  3. PDO nonstationary model, \(y_t \sim \text{GEV}(\mu_0 + \mu_1 \text{PDO}_t, \sigma, \xi)\)

AIC vs. DIC

Code
# fit models
stat_chain = sample(stat_mod, NUTS(), MCMCThreads(), 10_000, 4)
nonstat_chain = sample(nonstat_mod, NUTS(), MCMCThreads(), 10_000, 4)
pdo_chain = sample(pdo_mod, NUTS(), MCMCThreads(), 10_000, 4)

stat_est = mean(stat_chain)[:, 2]
nonstat_est = mean(nonstat_chain)[:, 2]
pdo_est = mean(pdo_chain)[:, 2]

stat_bayes = loglikelihood(stat_mod, (μ = stat_est[1], σ = stat_est[2], ξ = stat_est[3]))
nonstat_bayes = loglikelihood(nonstat_mod, (a = nonstat_est[1], b = nonstat_est[2], σ = nonstat_est[3], ξ = nonstat_est[4]))
pdo_bayes = loglikelihood(pdo_mod, (a = pdo_est[1], b = pdo_est[2], σ = pdo_est[3], ξ = pdo_est[4]))

stat_ll = mean([loglikelihood(stat_mod, row) for row in eachrow(DataFrame(stat_chain))]) 
nonstat_ll = mean([loglikelihood(nonstat_mod, row) for row in eachrow(DataFrame(nonstat_chain))])
pdo_ll = mean([loglikelihood(pdo_mod, row) for row in eachrow(DataFrame(pdo_chain))])

stat_dic = 2 * stat_ll - stat_bayes
nonstat_dic = 2 * nonstat_ll - nonstat_bayes
pdo_dic = 2 * pdo_ll - pdo_bayes

model_dic = DataFrame(Model=["Stationary", "Time", "PDO"], AIC=trunc.(Int64, round.(-2 * [stat_aic, nonstat_aic, pdo_aic]; digits=0)), DIC=trunc.(Int64, round.(-2 * [stat_dic, nonstat_dic, pdo_dic]; digits=0)))
Warning: Only a single thread available: MCMC chains are not sampled in parallel
@ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
Info: Found initial step size
  ϵ = 0.0015625
Info: Found initial step size
  ϵ = 0.05
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:00:28
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:00:11
Info: Found initial step size
  ϵ = 0.00625
Info: Found initial step size
  ϵ = 0.0125
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:00:04
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:13
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:13
Warning: Only a single thread available: MCMC chains are not sampled in parallel
@ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
Info: Found initial step size
  ϵ = 0.00625
Info: Found initial step size
  ϵ = 0.0125
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:00:18
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:00:08
Info: Found initial step size
  ϵ = 0.05
Info: Found initial step size
  ϵ = 0.05
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:00:03
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:12
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:12
Warning: Only a single thread available: MCMC chains are not sampled in parallel
@ AbstractMCMC ~/.julia/packages/AbstractMCMC/kwj9g/src/sample.jl:384
Sampling (1 thread)   0%|                               |  ETA: N/A
Info: Found initial step size
  ϵ = 0.025
Info: Found initial step size
  ϵ = 0.2
Sampling (1 thread)  25%|███████▊                       |  ETA: 0:00:12
Sampling (1 thread)  50%|███████████████▌               |  ETA: 0:00:06
Info: Found initial step size
  ϵ = 0.2
Info: Found initial step size
  ϵ = 0.2
Sampling (1 thread)  75%|███████████████████████▎       |  ETA: 0:00:03
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:09
Sampling (1 thread) 100%|███████████████████████████████| Time: 0:00:09
3×3 DataFrame
Row Model AIC DIC
String Int64 Int64
1 Stationary 1421 1421
2 Time 1413 1413
3 PDO 1418 1419

Convergence of AIC and DIC

Both AIC and DIC converge (as \(n \to \infty\)) to expected leave-one-out CV.

The question is how well they do under limited sample sizes!

Watanabe-Akaike Information Criterion (WAIC)

\[\widehat{\text{elpd}}_\text{WAIC} = \sum_{i=1}^n \log \int p(y_i | \theta) p_\text{post}(\theta)\,d\theta - p_{\text{WAIC}},\]

where \[p_\text{WAIC} = \sum_{i=1}^n \text{Var}_\text{post}\left(\log p(y_i | \theta)\right).\]

WAIC Correction Factor

\(p_\text{WAIC}\) is an estimate of the number of “unconstrained” parameters in the model.

  • A parameter counts as 1 if its estimate is “independent” of the prior;
  • A parameter counts as 0 if it is fully constrained by the prior.
  • A parameter gives a partial value if both the data and prior are informative.

WAIC vs. AIC and DIC

  • WAIC can be viewed as an approximation to leave-one-out CV, and averages over the entire posterior, vs. AIC and DIC which use point estimates.
  • But it doesn’t work well with highly structured data; no real alternative to more clever uses of Bayesian cross-validation.

“Bayesian” “Information” Criterion (BIC)

\[\text{BIC} = -2\log p(y | \hat{\theta}_\text{MLE}) + k\log n.\]

BIC:

  • Is not Bayesian (it relies on the MLE);
  • Has no relationship to information theory (unlike AIC/DIC);
  • Assumes \(\mathcal{M}\)-closed (e.g. that the true model is under consideration).

BIC

BIC approximates the prior log-predictive likelihood and leave-\(k\)-out cross-validation (hence the extra penalization for additional parameters).

This is why it’s odd when model selection consists of examining both AIC and BIC: these are different quantities with different purposes!

Key Takeaways and Upcoming Schedule

Key Takeaways

  • Information Criteria are an approximation to LOO-CV based on “correcting” for model complexity.
  • AIC and DIC can be used to approximate LOO-CV.
  • BIC is an entirely different measure, approximating the prior predictive distribution (leave-\(k\)-out CV).

An Important Caveat

Model selection can result in significant overfitting when separated from hypothesis-driven model development (Freedman, 1983; Smith, 2018)

An Important Caveat

  • Better off thinking about the scientific or engineering problem you want to solve and use domain knowledge/checks rather than throwing a large number of possible models into the selection machinery.
  • Regularizing priors reduce potential for overfitting.
  • Model averaging (Hoeting et al., 2021) and stacking (Yao et al., 2018) can combine multiple models as an alternative to selection.

Next Classes

Next Week: Emulating Complex Models

References

References

Burnham, K. P., & Anderson, D. R. (2004). Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociol. Methods Res., 33, 261–304. https://doi.org/10.1177/0049124104268644
Freedman, D. A. (1983). A note on screening regression equations. Am. Stat., 37, 152. https://doi.org/10.2307/2685877
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (2021). Bayesian model averaging: a tutorial. Stat. Sci., 14, 382–401. https://doi.org/10.1214/ss/1009212519
Smith, G. (2018). Step away from stepwise. J. Big Data, 5, 1–12. https://doi.org/10.1186/s40537-018-0143-6
Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions. arXiv [Stat.ME]. https://doi.org/10.1214/17-BA1091