Key Idea: Model selection consists of navigating the bias-variance tradeoff.
Model error (e.g. RMSE) is a combination of irreducible error, bias, and variance.
Bias
Bias is error from mismatches between the model predictions and the data (\(\text{Bias}[\hat{f}] = \mathbb{E}[\hat{f}] - y\)).
Bias comes from under-fitting meaningful relationships between inputs and outputs.
too few degrees of freedom (“too simple”)
neglected processes.
Variance
Variance is error from over-sensitivity to small fluctuations in inputs (\(\text{Variance} = \text{Var}(\hat{f})\)).
Variance can come from over-fitting noise in the data.
too many degrees of freedom (“too complex”)
poor identifiability
Bias-Variance Tradeoff
Upshot: For achieve a fixed error level, you can reduce bias (more “complex” model) or you can reduce variance (more “simple” model) but there’s a tradeoff.
Bias-Variance Tradeoff More Generally
This decomposition is for MSE, but the principle holds more generally.
Models which perform better “on average” over the training data (low bias) are more likely to overfit (high variance);
Models which have less uncertainty for training data (low variance) will do worse “on average”.
Cross-Validation
The “gold standard” way to test for predictive performance is cross-validation:
Split data into training/testing sets;
Calibrate model to training set;
Check for predictive ability on testing set.
Leave-One-Out Cross-Validation
Drop one value \(y_i\).
Refit model on rest of data \(y_{-i}\).
Evaluate \(\log p(y_i | y_{-i})\).
Repeat on rest of data set.
Leave-\(k\)-Out Cross-Validation
Drop \(k\) values, refit model on rest of data, check for predictive skill.
As \(k \to n\), this reduces to the prior predictive distribution \[p(y^{\text{rep}}) = \int_{\theta} p(y^{\text{rep}} | \theta) p(\theta) d\theta.\]
Expected Out-Of-Sample Predictive Accuracy
The out-of-sample predictive fit of a new data point \(\tilde{y}_i\) is
We don’t know \(P\) (the distribution of new data)!
We need some measure of the error induced by using an approximating distribution \(Q\) from some model.
Information Criteria
Information Criteria
“Information criteria” refers to a category of estimators of prediction error.
The idea: estimate predictive error using the fitted model.
Information Criteria Overview
There is a common framework for all of these:
If we compute the expected log-predictive density for the existing data \(p(y | \theta)\), this will be too good of a fit and will overestimate the predictive skill for new data.
Information Criteria Corrections
We can adjust for that bias by correcting for the effective number of parameters, which can be thought of as the expected degrees of freedom in a model contributing to overfitting.
Akaike Information Criterion (AIC)
The “first” information criterion that most people see.
Uses a point estimate (the maximum-likelihood estimate \(\hat{\theta}_\text{MLE}\)) to compute the log-predictive density for the data, corrected by the number of parameters \(k\):
The AIC is defined as \(-2\widehat{\text{elpd}}_\text{AIC}\).
Due to this convention, lower AICs are better (they correspond to a higher predictive skill).
AIC Correction Term
In the case of a normal model with independent and identically-distributed data and uniform priors, \(k\) is the asymptotically “correct” bias term (there are modified corrections for small sample sizes).
However, with more informative priors and/or hierarchical models, the bias correction \(k\) is no longer appropriate, as there is less “freedom” associated with each parameter.
Absolute AIC values have no meaning, only the differences \(\Delta_i = \text{AIC}_i - \text{AIC}_\text{min}\).
Some basic rules of thumb (from Burnham & Anderson (2004)):
\(\Delta_i < 2\) means the model has “strong” support across \(\mathcal{M}\);
\(4 < \Delta_i < 7\) suggests “less” support;
\(\Delta_i > 10\) suggests “weak” or “no” support.
AIC and Model Evidence
\(\exp(-\Delta_i/2)\) can be thought of as a measure of the likelihood of the model given the data \(y\).
The ratio \[\exp(-\Delta_i/2) / \exp(-\Delta_j/2)\] can approximate the relative evidence for \(M_i\) versus \(M_j\).
AIC and Model Averaging
This gives rise to the idea of Akaike weights: \[w_i = \frac{\exp(-\Delta_i/2)}{\sum_{m=1}^M \exp(-\Delta_m/2)}.\]
Model projections can then be weighted based on \(w_i\), which can be interpreted as the probability that \(M_i\) is the best (in the sense of approximating the “true” predictive distribution) model in \(\mathcal{M}\).
Model Averaging vs. Selection
Model averaging can sometimes be beneficial vs. model selection.
Model selection can introduce bias from the selection process (this is particularly acute for stepwise selection due to path-dependence).
Key Takeaways and Upcoming Schedule
Key Takeaways
LOO-CV is ideal for navigating bias-variance tradeoff but can be computationally prohibitive.
Information Criteria are an approximation to LOO-CV based on “correcting” for model complexity.
Approximation to out of sample predictive error as a penalty for potential to overfit.
Next Classes
Wednesday: Other Information Criteria
References
References
Burnham, K. P., & Anderson, D. R. (2004). Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociol. Methods Res., 33, 261–304. https://doi.org/10.1177/0049124104268644