Probability Models and Model Residuals

Lecture 03

January 29, 2024

Last Class(es)

Modes of Data Analysis

Probability Review

Distributions
Central Limit Theorem
Confidence Intervals

Questions?

Text: VSRIKRISH to 22333

URL: https://pollev.com/vsrikrish

See Results

Modeling Example: Radiative Forcing and Warming

Historical Temperature Anomalies

Code

temps = CSV.read("data/climate/HadCRUT.5.0.1.0.analysis.summary_series.global.annual.csv", DataFrame, delim=",")

time_obs = temps[:, 1]
temp_obs = temps[:, 2]
temp_lo = temps[:, 3]
temp_hi = temps[:, 4]

temp_lo = temp_lo .- mean(temp_obs[1:20])
temp_hi = temp_hi .- mean(temp_obs[1:20])
temp_obs = temp_obs .- mean(temp_obs[1:20]) # compute anomalies relative to first 20 years of data

plot(time_obs, temp_obs, ribbon=(temp_obs-temp_lo,temp_hi-temp_obs), color="blue", linewidth=2, fillalpha=0.2, legend=false, xlabel="Year", ylabel="Temperature anomaly (°C)", labelfontsize=18, tickfontsize=16, bottom_margin=10mm, left_margin=10mm)
plot!(size=(950, 450))

Figure 1: Global temperature anomalies

Data Source: HadCRUT 5.0.1.0

Planetary Energy Balance

Representation of Planetary Energy Balance

Source: Reprinted from A Climate Modeling Primer, A. Henderson-Sellers and K. McGuffie, Wiley, pg. 58, (1987) via https://www.e-education.psu.edu/meteo469/node/137.

Radiative Forcing

Climate changes result from changes to the energy balance of the planet (or radiative forcings), due to e.g.:

greenhouse gas emissions (which trap radiation, warming the planet);
aerosol emissions from air pollution or volcanic eruptions (which block incoming radiation, cooling the planet);
changes to the solar cycle (which can increase or decrease the incoming solar radiation).

Historical Radiative Forcing

Code

# Dataset from https://zenodo.org/record/3973015
# The CSV is read into a DataFrame object, and we specify that it is comma delimited
forcings_all_85 = CSV.read("data/climate/ERF_ssp585_1750-2500.csv", DataFrame, delim=",")

# Separate out the individual components
forcing_co2_85 = forcings_all_85[!,"co2"]
# Get total aerosol forcings
forcing_aerosol_rad_85 = forcings_all_85[!,"aerosol-radiation_interactions"]
forcing_aerosol_cloud_85 = forcings_all_85[!,"aerosol-cloud_interactions"]
forcing_aerosol_85 = forcing_aerosol_rad_85 + forcing_aerosol_cloud_85
forcing_total_85 = forcings_all_85[!,"total"]
forcing_non_aerosol_85 = forcing_total_85 - forcing_aerosol_85
forcing_other_85 = forcing_total_85 - (forcing_co2_85 + forcing_aerosol_85)

t = time_forcing = Int64.(forcings_all_85[!,"year"]) # Ensure that years are interpreted as integers

plot(xlabel="Year", ylabel="Radiative Forcing (W/m²)", tickfontsize=16, guidefontsize=18, legendfontsize=16, leftmargin=10mm, bottommargin=5mm, right_margin=5mm)
plot!(time_forcing, forcing_total_85, label="Total", color=:black, linewidth=3)
plot!(time_forcing, forcing_co2_85, label="CO₂", color=:orange, linewidth=2)
plot!(time_forcing, forcing_aerosol_85, label="Aerosol", color=:blue, linewidth=2)
plot!(time_forcing, forcing_other_85, label="Other", color=:purple, linewidth=2)
plot!(size=(800, 450))
xlims!((1750, 2020))
ylims!(-4.5, 5)

Figure 2: Historical and projected radiative forcings.

Data Source: https://zenodo.org/records/3973015

What Are Some Sources of Relevant Uncertainty in Understanding Past and Future Climate Changes and Impacts?

One key question: what is the sensitivity of warming to continued CO₂ emissions?

The Energy Balance Model (EBM)

\[\begin{align*} \overbrace{\frac{dH}{dt}}^{\text{change in heat}} &= \overbrace{F}^{\text{RF}} - \overbrace{\lambda T}^{\substack{\text{change in} \\ \text{temperature}}} \\ \underbrace{C}_{\substack{\text{ocean heat} \\ \text{capacity}}} \frac{dT}{dt} &= F - \lambda T \\ cd \frac{dT}{dt} &= F - \lambda T, \end{align*}\]

The EBM (cont’d)

\(c = 4.184\times 10^6 \\ \text{J/K/m}^2\) is the specific heat of water per area.
Total RF: \[F = F_\text{non-aerosol} + \alpha F_\text{aerosol}.\]
The climate feedback factor \(\lambda\) controls how much the Earth warms in response to radiative forcing.

EBM Solution

Use Euler discretization:

\[\begin{gather*} C \frac{dT}{dt} = F - \lambda T \\\\ \Rightarrow C \frac{T_{i+1}-T_i}{\Delta t} = F_i - \lambda T_i \\\\ \Rightarrow \bbox[yellow, 10px, border:5px solid red]{T_{i+1} = T_i + \frac{F_i - \lambda T_i}{C} \Delta t} \end{gather*}\]

Equilibrium Climate Sensitivity (ECS)

Under steady-state conditions (constant \(F\) and \(dT/dt = 0\)), \[T = \frac{F}{\lambda}.\]

When we double atmospheric CO₂, we refer to the equilibrium temperature \(S\) as the equilibrium climate sensitivity:

\[S = \underbrace{F_{2\times \text{CO}_2}}_{\approx 4 \text{W/m}^2}/\lambda\]

Model Fitting

Degree of Freedom / Free Parameters

There are a few uncertain parameters:

\(\lambda\) or \(S\)
\(d\) (ocean mixing depth)
\(\alpha\) (aerosol scaling factor)

Model Fitting By Minimizing Loss

Idea: Find best estimates \(\theta^*\) of model parameters \(\theta\) by minimizing the mismatch between data and simulations (denote by \(\mathbf{y} = F(\theta)\)).

Key choice: “loss function” \(L(\mathbf{y}, \mathbf{x})\), then: \[\theta^* = \underset{\theta}{\operatorname{argmin}} L(F(\theta), \mathbf{x}).\]

Can you think of some common loss functions? What do they imply about how to penalize model error?

Programming Implementation

# use default values of S=3.2°C, d=100m, α=1.3
function ebm(rf_nonaerosol, rf_aerosol; p=(3.2, 100, 1.3))
    # set up model parameters
    S, d, α = p # this unpacks the parameter tuple into variables
    F2xCO₂ = 4.0 # radiative forcing [W/m²] for a doubling of CO₂
    λ = F2xCO₂ / S

    c = 4.184e6 # heat capacity/area [J/K/m²]
    C = c*d # heat capacity of mixed layer (per area)
    F = rf_nonaerosol + α*rf_aerosol # radiative forcing
    Δt = 31558152. # annual timestep [s]

    T = zero(F)
    for i in 1:length(F)-1
        T[i+1] = T[i] + (F[i] - λ*T[i])/C * Δt
    end
    
    # return after normalizing to reference period
    return T .- mean(T[1:20])
end

Model Evaluation (Default Parameters)

Code

# generate simulations
sim_years = 1850:2020 # model years to simulate
idx = indexin(sim_years, t) # find indices in t vector of simulation years
# since we specified default values for p, those are used for the parameters
temp_default = ebm(forcing_non_aerosol_85[idx], forcing_aerosol_85[idx]) 

temp_obs = temp_obs[indexin(sim_years, time_obs)] # filter to simulated years for plotting
temp_obs = temp_obs .- mean(temp_obs[1:20]) # re-normalize to be consistent with the model
# plot simulated output and data
plot(sim_years, temp_default, xlabel="Year", ylabel="Temperature Anomaly (°C)", color=:blue, label="Simulation", linewidth=3, tickfontsize=16, guidefontsize=18, legendfontsize=16, leftmargin=10mm, bottommargin=5mm, right_margin=5mm)
scatter!(sim_years, temp_obs, color=:black, linewidth=2, fillalpha=0.2, label="Data")

Figure 3: Comparison of model simualtion with default parameters and data.

Fitting the EBM By Minimizing RMSE

# define RMSE function
rmse(y, x) = sqrt(mean((y .- x).^2))

# define wrapper function to map parameters to model evaluations
ebm_wrap(params) = ebm(forcing_non_aerosol_85[idx], forcing_aerosol_85[idx], p = params)
# minimize RMSE within some range for each parameter
# important to make everything a Float instead of an Int 
lower = [1.0, 50.0, 0.0]
upper = [4.0, 150.0, 2.0]
p0 = [2.0, 100.0, 1.0]
result = Optim.optimize(params -> rmse(ebm_wrap(params), temp_obs), lower, upper, p0)
θ = result.minimizer
θ

Fitting the EBM By Minimizing RMSE

3-element Vector{Float64}:
  1.9437477350724541
 86.39106695508048
  0.789333170364518

Fitted Results

Code

temp_fitted = ebm(forcing_non_aerosol_85[idx], forcing_aerosol_85[idx]; p = θ) 

# plot simulated output and data
plot(sim_years, temp_default, xlabel="Year", ylabel="Temperature Anomaly (°C)", color=:blue, label="Default Simulation", linewidth=3, tickfontsize=16, guidefontsize=18, legendfontsize=16, leftmargin=10mm, bottommargin=5mm, right_margin=5mm)
plot!(sim_years, temp_fitted, color=:red, label="Fitted Simulation", linewidth=3)
scatter!(sim_years, temp_obs, color=:black, linewidth=2, fillalpha=0.2, label="Data")

Figure 4: Comparison of model simulation with fitted parameters, default parameters, and data.

What the EBM Neglects

What are some things the EBM neglects or simplifies?

Important

Since models can be so stylized, optimal model parameters which ought to correspond to physical values may not match their true values.

Statistical Interpretation of Model Fitting

Likelihood of Parameters

Probability of data given probability model: \(p(\mathbf{x} | \theta)\)

Likelihood of parameters given probability model and data: \(\mathcal{L}(\theta | \mathbf{x}) = p(\mathbf{x} | \theta)\)

Example (normal distribution):

\[\mathcal{L}(\mu, \sigma | \mathbf{x}) = p(\mathbf{x} | \mu, \sigma) = \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{x_i - \mu}{\sigma}^2\right)\right)\]

Probability Model Fitting

Fitting a probability model ⇔ Maximizing Likelihood

We often work with log-likelihoods since \[\operatorname{argmax} \left[\log f(x)\right] = \operatorname{argmax} f(x)\]

and sums and small numbers are more stable.

Code

x = 0:0.1:3
f(x) = abs.(3 - (x-2)^2)
plot(x, f.(x), color=:blue, label=L"$f(x)$", linewidth=3, legend=:bottomright, tickfontsize=16, legendfontsize=16, guidefontsize=18)
plot!(x, log.(f.(x)), color=:red, label=L"$\log f(x))$", linewidth=3)
plot!(size=(500, 400))

Figure 5: Comparing logarithm of a function with the function.

Model Residuals

Model Residuals are the “error” between the model simulations and the data.

\[\underbrace{\mathbf{r}}_{\text{residuals}} = F(\mathbf{x}; \theta) - \underbrace{\mathbf{y}}_{\text{data}}\]

The connection between statistical modeling and simulation modeling is developing a probability model for the residuals.

Statistical Interpretation of RMSE Minimization

Claim: Minimizing the (R)MSE is the same as maximizing likelihood assuming independent and identically-normally-distributed residuals (with known variance).

\[ \begin{gather*} \mathbf{y} = F(\mathbf{x}) - \mathbf{r} \\ r_i \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma) \end{gather*} \]

Likelihood of Normal Residuals

Data Probability Model:

\[y_i \sim \mathcal{N}(F(x_i), \sigma)\]

\[\mathcal{L}(\theta | \mathbf{y}; F) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}} \exp(-\frac{y_i - F(x_i)^2}{2\sigma^2})\]

\[\log \mathcal{L}(\theta | \mathbf{y}; F) = \sum_{i=1}^n \left[\log \frac{1}{\sqrt{2\pi}} - \frac{1}{2\sigma^2}(y_i - F(x_i))^2 \right]\]

\[ \begin{align} \log \mathcal{L}(\theta | \mathbf{y}, F) &= \sum_{i=1}^n \left[\log \frac{1}{\sqrt{2\pi}} - \frac{1}{2\sigma^2}(y_i - F(x_i)) ^2 \right] \\ &= n \log \frac{1}{\sqrt{2\pi}} - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - F(x_i))^2 \end{align} \]

Ignoring constants (including \(\sigma\)):

\[\log \mathcal{L}(\theta | \mathbf{y}, F) \propto -\sum_{i=1}^n (y_i - F(x_i))^2.\]

Maximizing \(f(x)\) is equivalent to minimizing \(-f(x)\):

\[ - \log \mathcal{L}(\theta | \mathbf{y}, F) \propto \sum_{i=1}^n (y_i - F(x_i))^2 = \text{MSE} \]

Key Points

Goal of model-based data analysis: Gain insights into data or underlying system through model simulations.
Requires developing probability model for data or residuals.
Fitting a model as maximum likelihood estimation.
Can develop more complex models (autocorrelated residuals, non-normal errors) but these may not result in “standard” error metrics.

Upcoming Schedule

Next Class(es)

Wednesday: Bayesian statistics and probability models

Next Week: Exploratory data analysis

Assessments

Friday: HW1 and Exercise 1 due by 9pm.