Lab 6 – Bootstrapping

library(ggplot2)
library(dplyr)

theme_set(theme_bw())

## Better histograms
gh <- function(bins = 10) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Bootstrapping function
bootstrap <- function(x, statistic, n = 1000L) {
  bs <- replicate(n, {
    sb <- sample(x, replace = TRUE)
    statistic(sb)
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

Introduction

This lab is intended to serve as an introduction to the construction of confidence intervals in R. In particular, we will be considering two methods by which we can construct confidence intervals:

The \(\text{Point Estimate} \pm \text{Margin of Error}\) method, which is most frequently used in estimating the mean
Bootstrapping, which can be used to estimate any statistic

Before we get into any exercises, let’s take a little bit of time to introduce some new functions and techniques that we will be using to answer the questions.

Helpful Note: This lab is written to be self-contained, meaning that everything that I will ask of you will be introduced in this lab. There is absolutely no expectation that you memorize everything, and it is intended that you will reference these practice sections as needed for the lab and homework.

The two exceptions to this are the use of ggplot() and the use of dplyr functions, which were introduced in the previous lab. Extra guidance will be provided for questions involving dplyr.

Generating Data

Often we will try and use real data for our problems, but it can also be instructive to generate random data. Not only does this permit us to generate as much data as we wish, it has the added bonus that we are able to know exactly the true values of our population.

For example, we can generate random data that is normally distributed with the function rnorm() (“RandomNORMal”). The rnorm() function takes three arguments: n, indicating the sample size, mean, indicating the mean, and sd, indicating the standard deviation. If I wanted to generate 5 random variables from a normal distribution with mean value 10 and standard deviation 15, I would do:

rnorm(n = 5, mean = 100, sd = 15)

## [1] 114.070  84.846  88.888  80.279  77.416

As before, we don’t have to name the arguments and this will work just as well:

rnorm(5, 100, 15)

## [1]  92.109 100.911 107.750 102.459  70.320

You’ll note that each time we generate random data, it gives us something, well, random. We may find ourselves in situations where we want to be able to recreate exactly a random process. We can do this with the function set.seed():

set.seed(35)
rnorm(5, 100, 15)

## [1] 115.977 101.993  99.489  99.325 150.068

set.seed(35)
rnorm(5, 100, 15)

## [1] 115.977 101.993  99.489  99.325 150.068

As with everything else in R, we can assign these random values to a variable and do things with it. For example, here we collect a sample of 50 observations and compute the sample mean and standard deviation

x <- rnorm(50, 100, 15)

# Sample mean
mean(x)

## [1] 103.95

# Standard deviation
sd(x)

## [1] 14.083

# standard error is sd/sqrt(n)
sd(x) / sqrt(length(x))

## [1] 1.9917

By default, the rnorm() function returns a vector instead of a data.frame. If we wish to plot the random data we create, we first have to put it in a data.frame

## Sample data and put into data.frame
x <- rnorm(50, 100, 15)
df <- data.frame(x = x)

## Using gh() (assigned above) to make histograms that aren't dumb looking
ggplot(df, aes(x)) + 
  gh(bins = 10)

In this section, we briefly demonstrated how to generate random data in R. This involved using:

The rnorm(n, mean, sd) function to specify the parameters
We used mean(), sd(), and length() to find the mean, standard deviation, and standard error
We put our random variable in a data.frame to be able to plot it with ggplot()

Question 1: Generate 100 observations from the normal distribution with a mean value of 50 and a standard deviation of 20, then create a histogram showing the distribution of your sample.

Sampling Data

Generating data using rnorm() is equivalent to collecting a sample from a population of infinite size. Further, each time we generate data, the observations are all completely different. What we need next is a way to sample data from an existing, finite population and we do this with the function sample().

Here, for example, I take a sample of size 5 from the numbers 1 to 10

# This is my "population"
X <- 1:10

# Here, I take a sample of size 10
sample(X, size = 10)

##  [1]  8  2  9 10  7  4  3  6  1  5

By default, the sample() function performs sampling without replacement, meaning that after a number has been sampled once, it cannot be sampled again. As such, sampling 10 numbers from the set 1:10 will give us the same 10 numbers in a different order. We can sample with replacement by passing a second argument, replace = TRUE

# Sample with replacement
sample(X, size = 10, replace = TRUE)

##  [1] 6 1 8 8 1 1 8 7 2 9

Now I should see that some of my numbers are repeated in the sample. Sampling with replacement is the type of sampling that we will be using with the bootstrap.

We can perform this kind of sampling on variables in a data.frame as well. Consider the trees dataset built-in to R which contains measurements of felled cherry trees. The distribution of tree heights can be visualized simply enough:

ggplot(trees, aes(Height)) + 
  gh()

Here, I will sample tree height with replacement and again visualize it

## Sample tree heights
# (if I do not pass in `size = `, by default it will sample the same size)
th <- sample(trees$Height, replace = TRUE)

## Put sampled data in data.frame
df <- data.frame(Height = th)

## Create plot
ggplot(df, aes(Height)) +
  gh() + 
  labs(title = "Distribution of Resampled Tree Height")

This is the equivalent of collecting one bootstrapped sample and plotting its distribution.

In this section, we introduced the sample() function, as well as the argument replace = TRUE that allows us to sample with replacement. This will be important in the bootstrapping process.

Question 2: Using the faithful dataset built into R, create a histogram with 15 bins of the variable waiting which indicates the amount of waiting time between eruptions. What do you notice about this graph and its distribution?

Collect one sample of the same size of this variable and create a second histogram. Do you see any of the same characteristics in the bootstrapped distribution as you do in the original? What do you notice?

Computing Confidence Intervals with Point Estimate

The process of finding a confidence interval with point estimates involves two steps: finding the mean and standard error and putting them into our formula.

Data in a vector

If we have a vector of data (i.e., not a data.frame), we can compute it directly as such: here, we use the built-in dataset in R nhtemp which contains the mean annual temperature in Fahrenheit in New Haven, Connecticut from 1912 to 1971:

## First compute the mean
(m <- mean(nhtemp))

## [1] 51.16

# The standard error is standard devation / square root of sample size
(se <- sd(nhtemp) / sqrt(length(nhtemp)))

## [1] 0.16339

Note how we use the length() function to find the number of observations here. As our method is

\[ \overline{x} \pm c \times SE \] where \(c = 1, 2, 3\) for 68-95-99 percent CI, respectively, we can find our 95% confidence interval as such:

# Lower bound
m - 2*se

## [1] 50.833

m + 2*se

## [1] 51.487

We find a 95% confidence interval of (50.83, 51.49). For those averse to typing, a shorthand method of finding this is:

m + c(-2, 2) * se

## [1] 50.833 51.487

Data in a data.frame

For data that is already in a data.frame, we can find our summary statistics using dplyr. Here, for example, we estimate the average number of murders per 100,000 residents in the United States using state aggregated data compiled in 1973:

## Compute the mean and standard error using summarize()
summarize(USArrests, 
          meanMurder = mean(Murder), 
          seMurder = sd(Murder) / sqrt(n())) # Using n() for samp size

##   meanMurder seMurder
## 1      7.788  0.61596

We can use these numbers again directly, here to compute a 95% confidence interval for the number of murders per 100,000 residents

7.78 + c(-2, 2)*0.615

## [1] 6.55 9.01

Bootstrapping

We can think of the process of bootstrapping as following a simple algorithm:

First, given our sample of size \(N\), we sample with replacement a new sample from the original. This is the equivalent of pulling out a single marble from a bag of marbles, noting its color, then putting it back and drawing from the bag again
With our new sample, we compute the sample mean and record it. This is a bootstrapped sample mean
We repeat this process as many times as we wish (usually a thousand or more times)
The resulting collection of bootstrapped sample means represents our sampling distribution, which we will use to find our confidence interval

To help us with this procedure, we will need a new function called bootstrap() which I provided at the top of this lab. The bootstrap function takes three arguments: the data I want to bootstrap, the statistic I want to bootstrap, and the number of bootstrapped samples I wish to collect (by default, it will collect 1000). We will illustrate its use again using the USArrests data to find a confidence interval for the number of per capita murders in the United States.

## Want to bootstrap the sample of murders, USArrests$Murder
mean_boot <- bootstrap(USArrests$Murder, mean)

head(mean_boot)

##   Sample Statistic
## 1      1     7.736
## 2      2     7.544
## 3      3     8.546
## 4      4     8.572
## 5      5     8.308
## 6      6     8.492

head(mean_boot) prints out the first 6 rows of the data.frame returned by bootstrap(): we see that the first column indicates the sample number, while the second column indicates the sample statistic, in our case the mean. Because it is a data.frame, we can easily plot it:

ggplot(mean_boot, aes(Statistic)) + 
  gh(bins = 15)

Remember: this is a sampling distribution of the mean value. We resampled our data 1,000 times, computed the sample mean each time, and this is the distribution of values that were returned. From this point, we can use the quantile() function to return the quantiles that we want. Because we want the middle 95%, we need to take our quantiles at 2.5% and 97.5%

quantile(mean_boot$Statistic, probs = c(0.025, 0.975))

##   2.5%  97.5% 
## 6.5680 8.9603

Question 3: Compare the confidence intervals created with the point estimate method and the bootstrapped quantile methods.

How do they compare?
Would you expect the point estimate method to be the same for this dataset each time we compute it? Why or why not?
Would you expect the bootstrap quantile method to be the same for this dataset each time we compute it? Why or why not?

Review

Here, briefly, are worked examples from above to have a concise place to reference:

# Simulate a Population with random data
Pop <- rnorm(1000, mean = 100, sd = 15)

# Collect sample with replacement from data (here, of size 30)
samp <- sample(Pop, size = 30, replace = TRUE)

# Find mean and standard error using mean, sd, sqrt, and length
xbar <- mean(samp)
se <- sd(samp) / sqrt(length(samp))

# Compute PE +/- Margin of Error
xbar + c(-2, 2) * se

## [1]  94.981 106.072

# Do with bootstrap instead
bs <- bootstrap(samp, mean)
quantile(bs$Statistic,  prob = c(0.025, 0.975))

##    2.5%   97.5% 
##  95.413 106.202

Case Studies

Grinnell Rain

The data below contains the total monthly precipitation in Grinnell, IA from February 2014 to February 2024 (121 total months). We will treat all of these 121 months as the population we are interested in.

## rain data
rain <- read.csv("http://collinn.github.io/data/grinnell_rain.csv")
ggplot(rain, aes(precip)) + 
  gh(bins = 8) +
  labs(x = "Precipitation (Inches)", 
       title = "Rain Precipitation 2014-2024\nGrinnell, IA")

## True mean
mean(rain$precip)

## [1] 2.0698

For this problem, we will assume that we have only collected data on 40 random months in this period, and our goal will be to estimate the true value (\(\mu = 2.069\)) given the sample that we have. In other words, we will be using rs (Rain Sample) as our observed sample dataset.

## Create a random index of months for our sample 
# (copy these lines into your lab)
set.seed(123)
idx <- sample(1:nrow(rain), size = 40, replace = FALSE)
rs <- rain[idx, ]

Part A: Create a histogram of the amount of precipitation in our sample. How would you describe this distribution?
Part B: Using the sample() function, generate one sample with replacement from rs and create a histogram of the sample. How does this compare with the histogram in Part A?
Part C: Calculate the sample mean and standard error associated with monthly rainfall
Part D: Using the \(\text{Point Estimate} \pm \text{Margin of Error}\) method and your estimates from Part A, compute a 95% confidence interval for the population mean.
Part E: Using the bootstrap() function provided at the top of the lab, generate 1,000 bootstraps of the sample mean. Then, using the quantile() function, find an interval that contains 95% of the bootstrapped means. How does this interval compare with what you found in Part D?

Infant Heart Study

The data below contains the results of a randomized experiment conducted by surgeons at Harvard Medical School to compare “low-flow bypass” and “circulatory arrest” surgical procedures in the treatment of infants born with congenital heart defects. The study’s recorded outcomes are Psychomotor Development Index (PDI), a composite score measuring physiological development, with higher scores indicating greater development, and Mental Development Index (MDI), a composite score measuring mental development, with higher scores indicating greater development.

heart <- read.csv("https://remiller1450.github.io/data/InfantHeart.csv")

Our goal in this question is to try and determine if it is plausible that there are differences in PDI scores between the two treatment groups.

Part A: Using group_by() and summarize(), compute the mean value and the standard error of PDI for both treatment groups. (Hint: Use n() to determine the number of observations in each group)
Part B: Using the values found in Part A, create 95% confidence intervals for PDI for each of the two groups
Part C: Based on your results from Part B, what can you say about the relationship between PDI and treatment?