Define models for hypotheses (including baseline/null).
Fit models to data.
Simulate alternative datasets.
Compare output/test statistics to data.
Maybe: Convert metrics to probabilities (condition on \(\mathcal{M}\)).
Metrics for Model Adequacy/Performance
Explanatory Metrics
Based on same training/calibration data
RMSE, R2, \(\log p(y | M)\)
Predictive Metrics
Held-out data/cross-validation
Bias-Variance Tradeoff
What Is The Goal of Model Selection?
Key Idea: Model selection consists of navigating the bias-variance tradeoff.
Model error (e.g. RMSE) is a combination of irreducible error, bias, and variance.
Setting for Model Selection
Suppose we have a data-generating model \[y = f(x) + \varepsilon, \varepsilon \sim N(0, \sigma)\]. We want to fit a model \(\hat{y} \approx \hat{f}(x)\).
Bias
Bias is error from mismatches between the model predictions and the data (\(\text{Bias}[\hat{f}] = \mathbb{E}[\hat{f}] - y\)).
Bias comes from under-fitting meaningful relationships between inputs and outputs.
too few degrees of freedom (“too simple”)
neglected processes.
Variance
Variance is error from over-sensitivity to small fluctuations in inputs (\(\text{Variance} = \text{Var}(\hat{f})\)).
Variance can come from over-fitting noise in the data.
Upshot: For achieve a fixed error level, you can reduce bias (more “complex” model) or you can reduce variance (more “simple” model) but there’s a tradeoff.
Bias-Variance Tradeoff More Generally
This decomposition is for MSE, but the principle holds more generally.
Models which perform better “on average” over the training data (low bias) are more likely to overfit (high variance);
Models which have less uncertainty for training data (low variance) will do worse “on average”.
Model Complexity
Model complexity is not necessarily the same as the number of parameters.
Sometimes processes in the model can compensate for each other (“reducing degrees of freedom”)
This can help improve the representation of the dynamics and reduce error/uncertainty even when additional parameters are included.
which requires refitting the model without \(y_i\) for every data point.
Leave-\(k\)-Out Cross-Validation
Drop \(k\) values, refit model on rest of data, check for predictive skill.
As \(k \to n\), this reduces to the prior predictive distribution \[p(y^{\text{rep}}) = \int_{\theta} p(y^{\text{rep}} | \theta) p(\theta) d\theta.\]
Challenges with Cross-Validation
This can be very computationally expensive!
We often don’t have a lot of data for calibration, so holding some back can be a problem.
How to divide data with spatial or temporal structure? This can be addressed by partitioning the data more cleverly: \[y = \{y_{1:t}, y_{-((t+1):T)}\}\] but this makes the data problem worse.
Expected Out-Of-Sample Predictive Accuracy
The out-of-sample predictive fit of a new data point \(\tilde{y}_i\) is
We don’t know \(P\) (the distribution of new data)!
We need some measure of the error induced by using an approximating distribution \(Q\) from some model.
Key Takeaways and Upcoming Schedule
Key Takeaways
Model selection is a balance between bias (underfitting) and variance (overfitting).
For predictive assessment, leave-one-out cross-validation is an ideal, but hard to implement in practice (particularly for time series).
An Important Caveat
Model selection can result in significant overfitting when separated from hypothesis-driven model development(Freedman, 1983; Smith, 2018)
An Important Caveat
Better off thinking about the scientific or engineering problem you want to solve and use domain knowledge/checks rather than throwing a large number of possible models into the selection machinery.
Regularizing priors reduce potential for overfitting.
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (2021). Bayesian model averaging: a tutorial. Stat. Sci., 14, 382–401. https://doi.org/10.1214/ss/1009212519
Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions. arXiv [Stat.ME]. https://doi.org/10.1214/17-BA1091