Figure 5: 2015 detrended tide gauge data from the Norfolk, VA tide gauge.
Block Maxima
Block Maxima
Given independent and identically-distributed random variables \(X_1, X_2, \ldots, X_{mk}\), what is the distribution of maxima of “blocks” of size \(m\):
\[\tilde{X}_i = \max_{(i-1)m < j \leq im} X_j,\]
for \(i = 1, 2, \ldots, k\)?
Analogy: Central Limit Theorem
Recall that the Central Limit Theorem tells us:
If we have independent and identically-distributed variables \[X_1, X_2, \ldots, X_n\] from some population with mean \(\mu\) and standard deviation \(\sigma\), the sample mean \(\bar{X}\) has the approximate distribution
The Extreme Value Theorem is the equivalent for block maxima.
If the limiting distribution exists, it can only by given as a Generalized Extreme Value (GEV) distribution:
\[H(y) = \exp\left\{-\left[1 + \xi\left(\frac{y-\mu}{\sigma}\right)\right]^{-1/\xi}\right\},\] defined for \(y\) such that \(1 + \xi(y-\mu)/\sigma > 0\).
Figure 6: Shape of the GEV distribution with different choices of \(\xi\).
GEV Types
\(\xi < 0\): extremes are bounded (the Weibull distribution comes up in the context of temperature and wind speed extremes).
\(\xi > 0\): tails are heavy, and there is no expectation if \(\xi > 1\). Common for streamflow, storm surge, precipitation.
The Gumbel distribution (\(\xi = 0\)) is common for extremes from normal distributions, doesn’t occur often in real-world data.
San Francisco Tide Gauge Data
Code
# load SF tide gauge data# read in data and get annual maximafunctionload_data(fname) date_format =DateFormat("yyyy-mm-dd HH:MM:SS")# This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data df =@chain fname begin CSV.read(DataFrame; header=false)rename("Column1"=>"year", "Column2"=>"month", "Column3"=>"day", "Column4"=>"hour", "Column5"=>"gauge")# need to reformat the decimal date in the data file@transform:datetime =DateTime.(:year, :month, :day, :hour)# replace -99999 with missing@transform:gauge =ifelse.(abs.(:gauge) .>=9999, missing, :gauge)select(:datetime, :gauge)endreturn dfenddat =load_data("data/surge/h551.csv")# detrend the data to remove the effects of sea-level rise and seasonal dynamicsma_length =366ma_offset =Int(floor(ma_length/2))moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]dat_ma =DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .-moving_average(dat.gauge, ma_offset))# group data by year and compute the annual maximadat_ma =dropmissing(dat_ma) # drop missing datadat_annmax =combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime =>x->year.(x)), :datetime_function))delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yetrename!(dat_annmax, :datetime_function =>:Year)select!(dat_annmax, [:Year, :residual])dat_annmax.residual = dat_annmax.residual /1000# convert to m# make plotsp1 =plot( dat_annmax.Year, dat_annmax.residual; xlabel="Year", ylabel="Annual Max Tide Level (m)", label=false, marker=:circle, markersize=5, tickfontsize=16, guidefontsize=18)p2 =histogram( dat_annmax.residual, normalize=:pdf, orientation=:horizontal, label=:false, xlabel="PDF", ylabel="", yticks=[], tickfontsize=16, guidefontsize=18)l =@layout [a{0.7w} b{0.3w}]plot(p1, p2; layout=l, link=:y, ylims=(1, 1.7), bottom_margin=5mm, left_margin=5mm)plot!(size=(1000, 450))
Figure 7: Annual maxima surge data from the San Francisco, CA tide gauge.
Block Maxima Fit
Code
# find GEV fit# for most distributions we could use Distributions.fit(), but this isn't implemented in Distributions.jl for GEVinit_θ = [1.0, 1.0, 1.0]gev_lik(θ) =-sum(logpdf(GeneralizedExtremeValue(θ[1], θ[2], θ[3]), dat_annmax.residual))θ_mle = Optim.optimize(gev_lik, init_θ).minimizerp =histogram( dat_annmax.residual, normalize=:pdf, label="Data", xlabel="Annual Maximum (m)", ylabel="PDF", yticks=[], tickfontsize=16, guidefontsize=18, legendfontsize=16, left_margin=10mm, right_margin=10mm, bottom_margin=5mm)plot!(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), linewidth=3, label="GEV Fit")plot!(fit(LogNormal, dat_annmax.residual), linewidth=3, label="LogNormal Fit", color=:black)xlims!((1, 1.75))
Figure 8: GEV fit to annual maxima of San Francisco Tide Gauge Data
GEV Q-Q Plot
Code
p1 =qqplot(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), dat_annmax.residual, linewidth=3, markersize=5, xlabel="Theoretical Quantile", ylabel="Empirical Quantile", tickfontsize=16, guidefontsize=18, legendfontsize=16)plot!(p1, size=(600, 450))return_periods =2:500# get GEV return levelsreturn_levels =quantile.(GeneralizedExtremeValue(θ_mle[1], θ_mle[2], θ_mle[3]), 1.- (1./ return_periods))# fit lognormal to get return levels for comparisonlognormal_fit =fit(LogNormal, dat_annmax.residual)return_levels_lognormal =quantile.(lognormal_fit, 1.- (1./ return_periods))# function to calculate exceedance probability and plot positions based on data quantilefunctionexceedance_plot_pos(y) N =length(y) ys =sort(y; rev=false) # sorted values of y nxp = xp = [r / (N +1) for r in1:N] # exceedance probability xp =1.- nxpreturn xp, ysendxp, ys =exceedance_plot_pos(dat_annmax.residual)p2 =plot(return_periods, return_levels, linewidth=3, color=:blue, label="GEV Model Fit", tickfontsize=16, legendfontsize=18, guidefontsize=18, bottom_margin=5mm, left_margin=5mm, right_margin=10mm, legend=:bottomright)plot!(p2, return_periods, return_levels_lognormal, linewidth=3, color=:orange, label="LogNormal Model Fit")scatter!(p2, 1./ xp, ys, label="Observations", color=:black, markersize=5)xlabel!(p2, "Return Period (yrs)")ylabel!(p2, "Return Level (m)")xlims!(-1, 300)plot!(p2, size=(600, 450))display(p1)display(p2)
(a) GEV fit to annual maxima of San Francisco Tide Gauge Data
Figure 11: Peaks Over Thresholds for the SF Tide Gauge Data
Declustering
Arns et al. (2013) note: there is no clear declustering time period to use: need to rely on physical understanding of events and “typical” durations.
If we have prior knowledge about the duration of physical processes leading to clustered extremes (e.g. storm durations), can use this. Otherwise, need some way to estimate cluster duration from the data.
Extremal Index
The most common is the extremal index\(\theta(u)\), which measures the inter-exceedance time for a given threshold \(u\).
\[0 \leq \theta(u) \leq 1,\]
where \(\theta(u) = 1\) means independence and \(\theta(u) = 0\) means the entire dataset is one cluster.
Extremal Index
\(\theta(u)\) has two meanings:
The “propensity to cluster”: \(\theta\) is the probability that the process has left one exceedance cluster;
The “reciprocal of the clustering duration”: \(1/\theta\) is the mean time between clusters.
Computing the Extremal Index
This estimator is taken from Ferro & Segers (2003).
Let \(N = \sum_{i=1}^n \mathbb{I}(X_i > u)\) be the total number of exceedances.
Denote by \(1 \leq S_1 < \ldots < S_N \leq n\) the exceedance times.
Then the inter-exceedance times are \[T_i = S_{i+1} - S_i, \quad 1 \leq i \leq N-1.\]
# find total number of exceedances and exceedance timesdat_ma.residual = dat_ma.residual ./1000# convert to mS =findall(dat_ma.residual .> thresh)N =length(S)T =diff(S) # get difference between adjacent exceedancesθ =2*sum(T)^2/ ((N-1) *sum(T.^2)) # extremal index
0.23700514680023474
For the SF tide gauge data and \(u=1.0 \text{m}\), we get the declustering time is 4.0 hours.
Mapping Data To Clusters
# cluster data points which occur within periodfunctionassign_cluster(dat, period) cluster_index =1 clusters =zeros(Int, length(dat))for i in1:length(dat)if clusters[i] ==0 clusters[findall(abs.(dat .- dat[i]) .<= period)] .= cluster_index cluster_index +=1endendreturn clustersend# cluster exceedances that occur within a four-hour window# @transform is a macro from DataFramesMeta.jl which adds a new column based on a data transformationdat_exceed = dat_ma[dat_ma.residual .> thresh, :]dat_exceed =@transform dat_exceed :cluster =assign_cluster(:datetime, Dates.Hour(4))# find maximum value within clusterdat_decluster =combine(dat_exceed -> dat_exceed[argmax(dat_exceed.residual), :], groupby(dat_exceed, :cluster))dat_decluster
(a) GPD fit to tide gauge readings over 1m of San Francisco Tide Gauge Data
(b)
Figure 13
But What About Exceedance Frequency?
The GPD fit gives a distribution for how extreme threshold exceedances are when they occur.
But how often do they occur?
Code
# add column with years of occurrencedat_decluster =@transform dat_decluster :year =Dates.year.(dat_decluster.datetime)# group by year and add up occurrencesexceed_counts =combine(groupby(dat_decluster, :year), nrow =>:count)delete!(exceed_counts, nrow(exceed_counts)) # including 2023 will bias the count estimatep =histogram(exceed_counts.count, legend=:false, xlabel="Yearly Exceedances", ylabel="Count", guidefontsize=18, tickfontsize=16, left_margin=5mm, bottom_margin=10mm)plot!(size=(600, 400))
Figure 14: histogram of number of exceedances in each year
Poisson - Generalized Pareto Process
Model the number of new exceedances with a Poisson distribution
\[n \sim \text{Poisson}(\lambda_u),\]
The MLE for \(\lambda_u\) is the mean of the count data, in this case 49.5.
Then, for each \(i=1, \ldots, n\), sample \[X_i \sim \text{GeneralizedPareto}(u, \sigma, \xi).\]
to get the level for each exceedance.
Poisson - Generalized Pareto Process Return Levels
Then the return level for return period \(m\) years can be obtained by solving the quantile equation (see Coles (2001) for details):
Statistical models are highly sensitive to details: shape parameters \(\xi\), thresholds \(u\), etc.
Models assume independent variables.
What We Haven’t Discussed
Nonstationary models
Multivariate extremes are difficult: what does this even mean?
Upcoming Schedule
Wednesday: Clusters and Mixture models
Monday: February Break!
Next Wednesday: In-Class Figure Discussion
Assessments
Friday:
Submit figures for discussion (Exercise 5)
HW2 Due
Project proposal
References
Arns, A., Wahl, T., Haigh, I. D., Jensen, J., & Pattiaratchi, C. (2013). Estimating extreme water level probabilities: A comparison of the direct methods and recommendations for best practise. Coast. Eng., 81, 51–66. https://doi.org/10.1016/j.coastaleng.2013.07.003
Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. London: Springer-Verlag London.
Ferro, C. A. T., & Segers, J. (2003). Inference for clusters of extreme values. J. R. Stat. Soc. Series B Stat. Methodol., 65, 545–556. https://doi.org/10.1111/1467-9868.00401
Zarekarizi, M., Srikrishnan, V., & Keller, K. (2020). Neglecting uncertainties biases house-elevation decisions to manage riverine flood risks. Nat. Commun., 11, 5361. https://doi.org/10.1038/s41467-020-19188-9