Extreme Value Theory & Modeling

Lecture 08

February 19, 2024

Last Class(es)

Bayesian Probability

Explicitly based on conditional probability: \(p(\theta | \mathbf{y})\)
Model structures, parameters, and unobserved data all random

Bayesian Model Components

A fully specified Bayesian model includes:

Probability model for the data given the parameters (the likelihood), \(p(y | \theta)\)t
Prior distributions over the parameters, \(p(\theta)\)

Prior Selection

May need to play with different priors
Prior predictive checks to help refine

Extreme Values

Two Ways To Frame “Extreme” Values

“Block” extremes, e.g. annual maxima (block maxima)?
Values which exceed a certain threshold (peaks over threshold)?

Example: Tide Gauge Data

Code

function load_data(fname)
    date_format = "yyyy-mm-dd HH:MM"
    # this uses the DataFramesMeta package -- it's pretty cool
    return @chain fname begin
        CSV.File(; dateformat=date_format)
        DataFrame
        rename(
            "Time (GMT)" => "time", "Predicted (m)" => "harmonic", "Verified (m)" => "gauge"
        )
        @transform :datetime = (Date.(:Date, "yyyy/mm/dd") + Time.(:time))
        select(:datetime, :gauge, :harmonic)
        @transform :weather = :gauge - :harmonic
        @transform :month = (month.(:datetime))
    end
end

dat = load_data("data/surge/norfolk-hourly-surge-2015.csv")

p1 = plot(dat.datetime, dat.gauge; ylabel="Gauge Measurement (m)", label="Observed", legend=:topleft, tickfontsize=14, guidefontsize=16, legendfontsize=14, xlabel="Date/Time", bottom_margin=5mm, left_margin=5mm, right_margin=5mm)

Figure 1: 2015 tide gauge data from the Norfolk, VA tide gauge.

Example: Tide Gauge Data

Code

plot!(p1, dat.datetime, dat.harmonic, label="Predicted", alpha=0.7)

Figure 2: 2015 tide gauge data with predicted harmonics from the Norfolk, VA tide gauge.

Example: Detrended Data

Code

plot(dat.datetime, dat.weather; ylabel="Gauge Weather Variability (m)", label="Detrended Data", linewidth=3, legend=:topleft, tickfontsize=14, guidefontsize=16, legendfontsize=14, xlabel="Date/Time", bottom_margin=5mm, left_margin=5mm, right_margin=5mm)

Figure 3: 2015 detrended tide gauge data from the Norfolk, VA tide gauge.

Example: Block Maxima

Code

p1 = plot(dat.datetime, dat.weather; ylabel="Gauge Weather Variability (m)", label="Detrended Data", linewidth=2, legend=:topleft, tickfontsize=14, guidefontsize=16, legendfontsize=14, xlabel="Date/Time", bottom_margin=5mm, left_margin=5mm, right_margin=5mm)
max_dat = combine(dat -> dat[argmax(dat.weather), :], groupby(transform(dat, :datetime => x->yearmonth.(x)), :datetime_function))
scatter!(max_dat.datetime, max_dat.weather, label="Monthly Maxima", markersize=5)
month_start = collect(Date(2015, 01, 01):Dates.Month(1):Date(2015, 12, 01))
vline!(DateTime.(month_start), color=:black, label=:false, linestyle=:dash)

p2 = histogram(
    max_dat.weather,
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="PDF",
    ylabel="",
    yticks=[],
    tickfontsize=16,
    guidefontsize=18
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(-0.4, 1.4), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))

Figure 4: 2015 detrended tide gauge data from the Norfolk, VA tide gauge.

Example: Peaks Over Threshold

Code

thresh = 0.5
p1 = plot(dat.datetime, dat.weather; linewidth=2, ylabel="Gauge Weather Variability (m)", label="Observations", legend=:top, tickfontsize=14, guidefontsize=16, legendfontsize=14, xlabel="Date/Time")
hline!([thresh], color=:red, linestyle=:dash, label="Threshold")
scatter!(dat.datetime[dat.weather .> thresh], dat.weather[dat.weather .> thresh], markershape=:x, color=:black, markersize=3, label="Exceedances")

p2 = histogram(
    dat.weather[dat.weather .> thresh],
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="PDF",
    ylabel="",
    yticks=[],
    tickfontsize=16,
    guidefontsize=18
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(-0.4, 1.4), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))

Figure 5: 2015 detrended tide gauge data from the Norfolk, VA tide gauge.

Block Maxima

Given independent and identically-distributed random variables \(X_1, X_2, \ldots, X_{mk}\), what is the distribution of maxima of “blocks” of size \(m\):

\[\tilde{X}_i = \max_{(i-1)m < j \leq im} X_j,\]

for \(i = 1, 2, \ldots, k\)?

Analogy: Central Limit Theorem

Recall that the Central Limit Theorem tells us:

If we have independent and identically-distributed variables \[X_1, X_2, \ldots, X_n\] from some population with mean \(\mu\) and standard deviation \(\sigma\), the sample mean \(\bar{X}\) has the approximate distribution

\[\bar{X} \sim \text{Normal}(\mu, \sigma/\sqrt{n}).\]

Extreme Value Theorem

The Extreme Value Theorem is the equivalent for block maxima.

If the limiting distribution exists, it can only by given as a Generalized Extreme Value (GEV) distribution:

\[H(y) = \exp\left\{-\left[1 + \xi\left(\frac{y-\mu}{\sigma}\right)\right]^{-1/\xi}\right\},\] defined for \(y\) such that \(1 + \xi(y-\mu)/\sigma > 0\).

GEV Distributions

GEV distributions have three parameters:

location \(\mu\);
scale \(\sigma > 0\);
shape \(\xi\).

GEV “Types”

\(\xi > 0\): Frèchet (heavy-tailed)
\(\xi = 0\): Gumbel (light-tailed)
\(\xi < 0\): Weibull (bounded)

Code

p1 = plot(-2:0.1:6, GeneralizedExtremeValue(0, 1, 0.5), linewidth=3, color=:red, label=L"$\xi = 1/2$", guidefontsize=18, legendfontsize=16, tickfontsize=16)
plot!(-4:0.1:6, GeneralizedExtremeValue(0, 1, 0), linewidth=3, color=:green, label=L"$\xi = 0$")
plot!(-4:0.1:2, GeneralizedExtremeValue(0, 1, -0.5), linewidth=3, color=:blue, label=L"$\xi = -1/2$")
scatter!((-2, 0), color=:red, label=:false)
scatter!((2, 0), color=:blue, label=:false)
ylabel!("Density")
xlabel!(L"$x$")
plot!(size=(600, 450))

Figure 6: Shape of the GEV distribution with different choices of \(\xi\).

GEV Types

\(\xi < 0\): extremes are bounded (the Weibull distribution comes up in the context of temperature and wind speed extremes).
\(\xi > 0\): tails are heavy, and there is no expectation if \(\xi > 1\). Common for streamflow, storm surge, precipitation.
The Gumbel distribution (\(\xi = 0\)) is common for extremes from normal distributions, doesn’t occur often in real-world data.

San Francisco Tide Gauge Data

Code

# load SF tide gauge data
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

dat = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])
dat_annmax.residual = dat_annmax.residual / 1000 # convert to m

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5,
    tickfontsize=16,
    guidefontsize=18
)
p2 = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="PDF",
    ylabel="",
    yticks=[],
    tickfontsize=16,
    guidefontsize=18
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(1, 1.7), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))

Figure 7: Annual maxima surge data from the San Francisco, CA tide gauge.

Block Maxima Fit

Code

# find GEV fit
# for most distributions we could use Distributions.fit(), but this isn't implemented in Distributions.jl for GEV
init_θ = [1.0, 1.0, 1.0]
gev_lik(θ) = -sum(logpdf(GeneralizedExtremeValue(θ[1], θ[2], θ[3]), dat_annmax.residual))
θ_mle = Optim.optimize(gev_lik, init_θ).minimizer

p = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    label="Data",
    xlabel="Annual Maximum (m)",
    ylabel="PDF",
    yticks=[],
    tickfontsize=16,
    guidefontsize=18,
    legendfontsize=16,
    left_margin=10mm, 
    right_margin=10mm,
    bottom_margin=5mm
)
plot!(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), linewidth=3, label="GEV Fit")
plot!(fit(LogNormal, dat_annmax.residual), linewidth=3, label="LogNormal Fit", color=:black)
xlims!((1, 1.75))

Figure 8: GEV fit to annual maxima of San Francisco Tide Gauge Data

GEV Q-Q Plot

Code

p1 = qqplot(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), dat_annmax.residual, 
    linewidth=3, markersize=5,
    xlabel="Theoretical Quantile",
    ylabel="Empirical Quantile",
    tickfontsize=16,
    guidefontsize=18,
    legendfontsize=16
)
plot!(p1, size=(600, 450))

return_periods = 2:500
# get GEV return levels
return_levels = quantile.(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), 1 .- (1 ./ return_periods))
# fit lognormal to get return levels for comparison
lognormal_fit = fit(LogNormal, dat_annmax.residual)
return_levels_lognormal = quantile.(lognormal_fit, 1 .- (1 ./ return_periods))

# function to calculate exceedance probability and plot positions based on data quantile
function exceedance_plot_pos(y)
    N = length(y)
    ys = sort(y; rev=false) # sorted values of y
    nxp = xp = [r / (N + 1) for r in 1:N] # exceedance probability
    xp = 1 .- nxp
    return xp, ys
end
xp, ys = exceedance_plot_pos(dat_annmax.residual)

p2 = plot(return_periods, return_levels, linewidth=3, color=:blue, label="GEV Model Fit", tickfontsize=16, legendfontsize=18, guidefontsize=18, bottom_margin=5mm, left_margin=5mm, right_margin=10mm, legend=:bottomright)
plot!(p2, return_periods, return_levels_lognormal, linewidth=3, color=:orange, label="LogNormal Model Fit")
scatter!(p2, 1 ./ xp, ys, label="Observations", color=:black, markersize=5)
xlabel!(p2, "Return Period (yrs)")
ylabel!(p2, "Return Level (m)")
xlims!(-1, 300)
plot!(p2, size=(600, 450))

display(p1)
display(p2)

(a) GEV fit to annual maxima of San Francisco Tide Gauge Data

(b)

Figure 9

Be Careful About The Shape Parameter!

Source: Zarekarizi et al. (2020)

Peaks Over Thresholds

Drawbacks of Block Maxima

The block-maxima approach has two potential drawbacks:

Uses a limited amount of data;
Doesn’t capture the potential for multiple exceedances within a block.

Peaks Over Thresholds

Consider the conditional excess distribution function

\[F_u(y) = \mathbb{P}(X > u + y \ |\ X > u),\]

which is the cumulative distribution of values by which \(X\) exceeds \(u\) (given that the exceedance has occurred).

Generalized Pareto Distribution (GPD)

For a large number of underlying distributions of \(X\), \(F_u(y)\) is well-approximated by a Generalized Pareto Distribution (GPD):

\[F\_u(y) \to G(y) = 1 - \left[1 + \xi\left(\frac{y-\mu}{\sigma}\right)^{-1/\xi}\right],\] defined for \(y\) such that \(1 + \xi(y-\mu)/\sigma > 0\).

Generalized Pareto Distribution (GPD)

Similarly to the GEV distribution, the GPD distribution has three parameters:

location \(\mu\);
scale \(\sigma > 0\);
shape \(\xi\).

GPD Types

\(\xi > 0\): heavy-tailed
\(\xi = 0\): light-tailed
\(\xi < 0\): bounded

Code

p1 = plot(-2:0.1:6, GeneralizedPareto(0, 1, 0.5), linewidth=3, color=:red, label=L"$\xi = 1/2$", guidefontsize=18, legendfontsize=16, tickfontsize=16, left_margin=5mm, bottom_margin=10mm)
plot!(-4:0.1:6, GeneralizedPareto(0, 1, 0), linewidth=3, color=:green, label=L"$\xi = 0$")
plot!(-4:0.1:2, GeneralizedPareto(0, 1, -0.5), linewidth=3, color=:blue, label=L"$\xi = -1/2$")
scatter!((-2, 0), color=:red, label=:false)
scatter!((2, 0), color=:blue, label=:false)
ylabel!("Density")
xlabel!(L"$x$")
plot!(size=(600, 450))

Figure 10: Shape of the GPD distribution with different choices of \(\xi\).

Exceedances Can Occur In Clusters

Code

thresh = 1.0
dat_ma_plot = @subset(dat_ma, year.(:datetime) .> 2020)
dat_ma_plot.residual = dat_ma_plot.residual ./ 1000
p1 = plot(dat_ma_plot.datetime, dat_ma_plot.residual; linewidth=2, ylabel="Gauge Weather Variability (m)", label="Observations", legend=:bottom, tickfontsize=16, guidefontsize=18, legendfontsize=16, xlabel="Date/Time", right_margin=10mm, left_margin=5mm, bottom_margin=5mm)
hline!([thresh], color=:red, linestyle=:dash, label="Threshold")
scatter!(dat_ma_plot.datetime[dat_ma_plot.residual .> thresh], dat_ma_plot.residual[dat_ma_plot.residual .> thresh], markershape=:x, color=:black, markersize=3, label="Exceedances")

Figure 11: Peaks Over Thresholds for the SF Tide Gauge Data

Declustering

Arns et al. (2013) note: there is no clear declustering time period to use: need to rely on physical understanding of events and “typical” durations.

If we have prior knowledge about the duration of physical processes leading to clustered extremes (e.g. storm durations), can use this. Otherwise, need some way to estimate cluster duration from the data.

Extremal Index

The most common is the extremal index \(\theta(u)\), which measures the inter-exceedance time for a given threshold \(u\).

\[0 \leq \theta(u) \leq 1,\]

where \(\theta(u) = 1\) means independence and \(\theta(u) = 0\) means the entire dataset is one cluster.

Extremal Index

\(\theta(u)\) has two meanings:

The “propensity to cluster”: \(\theta\) is the probability that the process has left one exceedance cluster;
The “reciprocal of the clustering duration”: \(1/\theta\) is the mean time between clusters.

Computing the Extremal Index

This estimator is taken from Ferro & Segers (2003).

Let \(N = \sum_{i=1}^n \mathbb{I}(X_i > u)\) be the total number of exceedances.

Denote by \(1 \leq S_1 < \ldots < S_N \leq n\) the exceedance times.

Then the inter-exceedance times are \[T_i = S_{i+1} - S_i, \quad 1 \leq i \leq N-1.\]

Computing the Extremal Index

\[\hat{\theta}(u) = \frac{2\left(\sum_{i-1}^{N-1} T_i\right)^2}{(N-1)\sum_{i=1}^{N-1}T_i^2}\]

Code

# find total number of exceedances and exceedance times
dat_ma.residual = dat_ma.residual ./ 1000 # convert to m
S = findall(dat_ma.residual .> thresh)
N = length(S)
T = diff(S) # get difference between adjacent exceedances
θ = 2 * sum(T)^2 / ((N-1) * sum(T.^2)) # extremal index

0.23700514680023474

For the SF tide gauge data and \(u=1.0 \text{m}\), we get the declustering time is 4.0 hours.

Mapping Data To Clusters

# cluster data points which occur within period
function assign_cluster(dat, period)
    cluster_index = 1
    clusters = zeros(Int, length(dat))
    for i in 1:length(dat)
        if clusters[i] == 0
            clusters[findall(abs.(dat .- dat[i]) .<= period)] .= cluster_index
            cluster_index += 1
        end
    end
    return clusters
end

# cluster exceedances that occur within a four-hour window
# @transform is a macro from DataFramesMeta.jl which adds a new column based on a data transformation
dat_exceed = dat_ma[dat_ma.residual .> thresh, :]
dat_exceed = @transform dat_exceed :cluster = assign_cluster(:datetime, Dates.Hour(4))
# find maximum value within cluster
dat_decluster = combine(dat_exceed -> dat_exceed[argmax(dat_exceed.residual), :], 
    groupby(dat_exceed, :cluster))
dat_decluster

Declustered Distribution

Code

p = histogram(dat_decluster.residual .- thresh,
    normalize = :pdf,
    label="Data",
    xlabel="Threshold Exceedance (m)",
    ylabel="PDF",
    yticks=[],
    tickfontsize=16,
    guidefontsize=18,
    legendfontsize=16,
    left_margin=10mm, 
    right_margin=10mm,
    bottom_margin=5mm
    )

Figure 12: Histogram of clustered exceedances for SF tide gauge data.

GPD Fit

Code

# fit GPD
init_θ = [1.0, 1.0]
low_bds = [0.0, -Inf]
up_bds = [Inf, Inf]
gpd_lik(θ) = -sum(logpdf(GeneralizedPareto(0.0, θ[1], θ[2]), dat_decluster.residual .- thresh))
θ_mle = Optim.optimize(gpd_lik, low_bds, up_bds, init_θ).minimizer
p1 = plot!(p, GeneralizedPareto(0.0, θ_mle[1], θ_mle[2]), linewidth=3, label="GPD Fit")
plot!(size=(600, 450))

# Q-Q Plot
p2 = qqplot(GeneralizedPareto(0.0, θ_mle[1], θ_mle[2]), dat_decluster.residual .- thresh, 
    xlabel="Theoretical Quantile",
    ylabel="Empirical Quantile",
    linewidth=3,
    tickfontsize=16,
    guidefontsize=18,
    legendfontsize=16,
    left_margin=5mm, 
    right_margin=10mm,
    bottom_margin=5mm)
plot!(size=(600, 450))

display(p1)
display(p2)

(a) GPD fit to tide gauge readings over 1m of San Francisco Tide Gauge Data

(b)

Figure 13

But What About Exceedance Frequency?

The GPD fit gives a distribution for how extreme threshold exceedances are when they occur.

But how often do they occur?

Code

# add column with years of occurrence
dat_decluster = @transform dat_decluster :year = Dates.year.(dat_decluster.datetime)
# group by year and add up occurrences
exceed_counts = combine(groupby(dat_decluster, :year), nrow => :count)
delete!(exceed_counts, nrow(exceed_counts)) # including 2023 will bias the count estimate
p = histogram(exceed_counts.count, legend=:false, 
    xlabel="Yearly Exceedances",
    ylabel="Count",
    guidefontsize=18,
    tickfontsize=16,
    left_margin=5mm,
    bottom_margin=10mm
)
plot!(size=(600, 400))

Figure 14: histogram of number of exceedances in each year

Poisson - Generalized Pareto Process

Model the number of new exceedances with a Poisson distribution

\[n \sim \text{Poisson}(\lambda_u),\]

The MLE for \(\lambda_u\) is the mean of the count data, in this case 49.5.

Then, for each \(i=1, \ldots, n\), sample \[X_i \sim \text{GeneralizedPareto}(u, \sigma, \xi).\]

to get the level for each exceedance.

Poisson - Generalized Pareto Process Return Levels

Then the return level for return period \(m\) years can be obtained by solving the quantile equation (see Coles (2001) for details):

\[\text{RL}_m = \begin{cases}u + \frac{\sigma}{\xi} \left((m\lambda_u)^\xi - 1\right) & \text{if}\ \xi \neq 0 \\ u + \sigma \log(m\lambda_u) & \text{if}\ \xi = 0.\end{cases}\]

For More on Extremes…

Coles (2001) is the gold standard textbook.

Nonstationary Extremes

What Is The Problem With “Default EVT”?

Extreme Value Theory (EVT) assumes each draw is i.i.d. from the relevant distribution.

Why might this not hold?

Stationarity

A **stationary process* is a stochastic process whose unconditional joint probability distribution does not change when shifted by time.

But often environmental processes are not stationary (e.g. El Niño).

Example: Climate Impacts

Are storm frequencies and intensities changing?
Frequency of cold-weather extremes?
Heat waves frequencies and intensities

Regression Models

We often model non-stationary extremes using a regression model, e.g.

\[\text{GEV}(\mu_0 + \mu_1 t, \sigma, \xi)\] or

\[\text{GEV}(\mu_0 + \mu_1 T(t), \sigma, \xi).\]

Choosing Covariates

Important to rely on domain knowledge, but often the covariates and types of dependence are hypotheses we would like to test.

More on this later (model selection!).
Often accompanied by large parameter uncertainties.

Wrap Up

Key Points

Extreme values can be modeled as block maxima or peaks-over-thresholds.
Block Maxima: Generalized Extreme Value distributions.
Peaks-Over-Thresholds: Generalized Pareto distributions (plus maybe Poisson processes).
Statistical models are highly sensitive to details: shape parameters \(\xi\), thresholds \(u\), etc.
Models assume independent variables.

What We Haven’t Discussed

Nonstationary models
Multivariate extremes are difficult: what does this even mean?

Upcoming Schedule

Wednesday: Clusters and Mixture models

Monday: February Break!

Next Wednesday: In-Class Figure Discussion

Assessments

Friday:

Submit figures for discussion (Exercise 5)
HW2 Due
Project proposal

References

Arns, A., Wahl, T., Haigh, I. D., Jensen, J., & Pattiaratchi, C. (2013). Estimating extreme water level probabilities: A comparison of the direct methods and recommendations for best practise. Coast. Eng., 81, 51–66. https://doi.org/10.1016/j.coastaleng.2013.07.003

Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. London: Springer-Verlag London.

Ferro, C. A. T., & Segers, J. (2003). Inference for clusters of extreme values. J. R. Stat. Soc. Series B Stat. Methodol., 65, 545–556. https://doi.org/10.1111/1467-9868.00401

Zarekarizi, M., Srikrishnan, V., & Keller, K. (2020). Neglecting uncertainties biases house-elevation decisions to manage riverine flood risks. Nat. Commun., 11, 5361. https://doi.org/10.1038/s41467-020-19188-9