Exercise Set 08: The Bootstrap

BEE 4850/5850, Fall 2024

Due Date

Friday, 3/15/24, 9:00pm

You can find a Jupyter notebook, data, and a Julia 1.9.x environment in the exercise’s Github repository. You should feel free to clone the repository and switch the notebook to another language, or to download the relevant data file(s) and solve the problems without using a notebook. In either of these cases, if you using a different environment, you will be responsible for setting up an appropriate package environment.

Regardless of your solution method, make sure to include your name and NetID on your solution PDF for submission to Gradescope.

Overview

Instructions

The goal of this exercise is for you to explore the differences in the confidence intervals produced by a non-parametric and parametric bootstrap.

Load Environment

The following code loads the environment and makes sure all needed packages are installed. This should be at the start of most Julia scripts.

import Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()

The following packages are included in the environment (to help you find other similar packages in other languages). The code below loads these packages for use in the subsequent notebook (the desired functionality for each package is commented next to the package).

using Distributions # API to work with statistical distributions
using Plots # plotting library
using StatsBase # statistical quantities like mean, median, etc
using StatsPlots # some additional statistical plotting tools
using DataFrames # tabular data storage
using DataFramesMeta # API for chaining DataFrames commands
using Dates # datetime API
using CSV # read/write CSV files
using Random

Problems

Problem 1

Let’s revisit the 2015 Sewell’s Point tide gauge data, which consists of hourly observations and predicted sea-level based on NOAA’s harmonic model, as shown in Figure 1.

function load_data(fname)
    date_format = "yyyy-mm-dd HH:MM"
    # this uses the DataFramesMeta package -- it's pretty cool
    return @chain fname begin
        CSV.File(; dateformat=date_format)
        DataFrame
        rename(
            "Time (GMT)" => "time", "Predicted (m)" => "harmonic", "Verified (m)" => "gauge"
        )
        @transform :datetime = (Date.(:Date, "yyyy/mm/dd") + Time.(:time))
        select(:datetime, :gauge, :harmonic)
        @transform :weather = :gauge - :harmonic
        @transform :month = (month.(:datetime))
    end
end

dat = load_data("data/norfolk-hourly-surge-2015.csv")

plot(dat.datetime, dat.gauge; ylabel="Gauge Measurement (m)", label="Observed", legend=:topleft, xlabel="Date/Time", color=:blue)
plot!(dat.datetime, dat.harmonic, label="Prediction", color=:orange)

Figure 1: 2015 data from the Sewell’s Point tide gauge, including NOAA’s predictions for sea-level harmonics.

We detrend the data to isolate the weather-induced variability by subtracting the predictions from the observations; the results (following the Julia code) are in dat[:, :weather], visualized in Figure 2.

plot(dat.datetime, dat.weather; ylabel="Gauge Weather Variability (m)", label="Detrended Data", linewidth=1, legend=:topleft, xlabel="Date/Time")

Figure 2: Detrended 2015 data from the Sewell’s Point tide gauge.

We would like to understand the uncertainty in an estimate of the median level of weather-induced variability.

In this problem:

Use 1,000 non-parametric bootstrap samples to compute a 95% confidence interval for the median.
Assuming the weather-induced variability is independently and identically distributed according to a normal distribution, compute a 95% confidence interval using 1,000 parametric bootstrap samples.
Compare the two confidence intervals.
What can you say about the bias of the median as an estimator?