Lab 8 – t-Distribution and Bootstrapping

Introduction

Include the code chunk below in your Rmd document as it contains several functions that we will be using for this lab, including the gh() function from the CLT worksheet to help us create better histograms in ggplot. Also included is the se() function for computing standard error and the bootstrap() function which will be used in the second half of this lab.

library(ggplot2)
library(dplyr)

theme_set(theme_bw())

## Better histograms
gh <- function(bins = 8) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Bootstrap Function
bootstrap <- function(x, statistic, n = 1000L) {
  bs <- replicate(n, {
    sb <- sample(x, replace = TRUE)
    statistic(sb)
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

## Standard Error
se <- function(x) sd(x) / sqrt(length(x))

## College dataset
college <- read.csv("https://collinn.github.io/data/college2019.csv")

For an illustration of how gh() is used, see:

## Default way
p1 <- ggplot(college, aes(ACT_median)) + 
  geom_histogram(color = 'black', fill = 'gray80', bins = 15)

## Superior awesome gh() way
p2 <- ggplot(college, aes(ACT_median)) + gh(bins = 15)

## They are identical
gridExtra::grid.arrange(p1, p2, nrow = 1)

if we know distributional parameters, we don’t need the data anymore, that ways quantile takes data and q norm only takes distributional parameters

Also for introduction, introduce rt, rnorm, qt, qnorm, sample, se, bootstrap function

R Functions

There are a few R functions that will be helpful in constructing confidence intervals. In particular, we will begin with a family of functions responsible for finding quantiles. For each of these functions, we will need to provide the quantiles that we are interested in finding. Because we typically want our intervals in the form of plus/minus, we usually start with the coverage we are looking for and divide the difference by two. For example, to find the middle 95%, we know we are excluding 5% at the edges; 2.5% on the left side and 2.5% on the right side. This gives us the quantiles (0.025, 0.975). Similarly, if we wanted an 80% confidence interval, our quantiles should be (0.1, 0.9), etc.,.

Distributional Quantiles

The distributional quantile functions are those that construct distribution-specific quantiles (i.e., normal distribution, t-distribution) based on the distributional parameters (recall that distributional parameters are those that tell us everything we need to know about a distribution). The first of these, qnorm() finds the quantiles of a normal distribution (quantile of normal). By default, it assumes a value of mean = 0 and sd = 1. Here, we find the quantiles associated with a 95% confidence interval for a standard normal:

## Quantiles for normal distribution
qnorm(p = c(0.025, 0.975))

## [1] -1.96  1.96

If I know the sample mean and standard error from my sample, I can plug these values in directly. Here, consider finding the quantiles based on the four year completion rate for males from our college dataset:

## Mean and se of my sampling distribution
mm <- mean(college$FourYearComp_Males)
ss <- se(college$FourYearComp_Males)

qnorm(p = c(0.025, 0.975), mean = mm, sd = ss)

## [1] 0.46645 0.48599

This is the exact same thing we would get if doing the formula by hand where $C = 1.96$:

\[ \overline{x} \pm C \frac{\hat{\sigma}}{\sqrt{n}} \]

## Point estimate +/- MOE method
mm + c(-1.96, 1.96) * ss

## [1] 0.46645 0.48599

A few notes to make here:

Do not use mean and se as variable names when they already exist as functions. It will work, but it may confuse R in some cases, providing unexpected results. Instead, use names like varMean and varSE
Using qnorm() with the default values will give us our critical values for the margin of error method
The function se() was defined in the starting code block, but it does not exist in base R. If you try to run it and it says 'could not find function "se"' you need to run the code containing the function above. This will be true for the homework as well
The trick + c(-1.96, 1.96) above is a shortcut to avoid having to write out this expression twice

In the example above, the confidence interval from qnorm() and that from the margin of error method were identical because they are both based on the assumption that the distribution is normal. In other words, they represent two ways of solving the exact same problem.

Assuming that a distribution is exactly normal is typically of little consequence when the number of samples is very large or when the distribution of our sample is nearly normal. When this is not the case, i.e., when $n < 30$ or so, we need to use qt() to find the critical values (quantile for t). The only distributional parameter associated with the $t$ distribution is the degrees of freedom, $n - 1$. We can find the critical values for a t-distribution in the exact same way:

## df = 1094 because we have n = 1095 observations
qt(p = c(0.025, 0.975), df = 1095-1)

## [1] -1.9621  1.9621

Because the $t$ distribution is based on the standardized values of our observation, there is no way to add a mean or standard deviation to the estimate. As such, confidence intervals with the t-distribution must be done using the $\overline{x} \pm C \frac{\hat{\sigma}}{\sqrt{n}}$ method, where $C$ represents the critical values from a t distribution with n-1 degrees of freedom. To find the 95% confidence interval for the four year completion rate of males, we need to use the margin of error method

## Constructing 95% CI using t-dist critical values
mm + c(-1.9621, 1.9621) * ss

## [1] 0.46644 0.48600

In this case, we see that our 95% CI from the t-distribution is very close to what we found with the normal distribution. This is because the size of our sample was $n = 1095$ (the number of observations in the college dataset); as such, the normal distribution is a very close approximation to the true sampling distribution.

Question 1 Using the qnorm() function, find the critical values associated with an 80% confidence interval. How do these compare to the critical values of a 95% confidence interval? Explain.

Question 2 This question will use the hawks data. The code below will load the hawks dataset which we will use for the first parts of this problem. It will also create hawks2, a randomly sampled subset of the data.

hawks <- read.csv("https://collinn.github.io/data/hawks.csv")

Subset the hawks data using dplyr to include only Red-tailed hawks (Species == "RT"). How many observations are in this dataset?
Create a histogram of the variable Weight. What does it look like? Based on your experience with the CLT lab, how well do you think a normal approximation will fit the sampling distribution?
Find the sample mean and standard error for the Weight variable. Use the qt() function to find the appropriate quantiles to create a 95% confidence interval for the sample mean. Use these to construct a 95% confidence interval
Repeat the previous step, this time using the hawks2 subset. How does the sample mean in hawks2 compare with the sample mean in hawks? How does the size of the 95% confidence interval compare? Does changing the size of the sample appear to have more of an impact on the sample mean or on the confidence interval?

## First subset Hawks data with RT
# hwk <- filter(hawks, ...)

## Create subset of size n = 20
set.seed(89)
idx <- sample(seq_len(nrow(hwk)), size = 20)
hawks2 <- hwk[idx, ]

Basic Quantile Functions

Because we know everything we need to know about a distribution from it’s distributional parameters, there is no need to consider the data itself. However, when we don’t know a distribution (or it’s parameters), but we have data, we can use the data itself to estimate the quantiles. This is done with the quantile() function in R.

To illustrate how this works, let’s sample some random values using rnorm() (random normal) using the mean and standard error from the college dataset we found above

## Find mean and standard error
mm <- mean(college$FourYearComp_Males)
ss <- se(college$FourYearComp_Males)

## Generate 1000 normally distrubted random numbers with those means and se
rand <- rnorm(n = 1000, mean = mm, sd = ss)

## Find the quantiles of the rand data
quantile(x = rand, probs = c(0.025, 0.975))

##    2.5%   97.5% 
## 0.46711 0.48634

qnorm(c(0.025, 0.975), mean = mm, sd = ss)

## [1] 0.46645 0.48599

In this case, we should expect the quantiles to be pretty close: the observed data we have comes from a theoretical distribution that we know, so we should expect that their behavior would be the same. When we don’t know the underlying distribution however, or if we know that it’s not normal, this will be an efficient way to create estimates of what the quantiles should be.

Question 3 For this question, we are going to again use the college dataset

Create a histogram of the variable FourYearComp_Males, the four year graduation rate for males
Find the mean and standard deviation of the variable FourYearComp_Males. Use the qnorm() function to find the the 0.025 and 0.975 quantiles.
Using the quantile() function the 0.025 and 0.975 quantiles of the variable college$FourYearComp_Males. Would you expect these to be similar to what you found in part (b)? Why or why not?
Why are the quantiles you found in this problem so different than the ones we found in the example above using the same data? Be as specific as you can be.

Bootstrapping

Here we will use bootstrap function and quantile function and it’ll actually be super cool

Bootstrapping, as we have learned in class, is an algorithmic method for constructing a sampling distribution based on a collected sample by repeatedly sampling with replacement from our original sample.

To visualize what is meant by this, consider the following example where we start with collection of numbers from 1 to 10 and sample with replacement (note: you will not need to use the sample function in your work):

## My sample
x <- 1:10

## Sampling with replacement once
sample(x, replace = TRUE)

##  [1]  3  3 10  2  6  5  4  6  9 10

## Doing it again
sample(x, replace = TRUE)

##  [1]  5  3  9  9  9  3  8 10  7 10

As you can see, each iteration of sampling with replacement contains values from the original sample (x = 1:10), but not necessarily all of them; in fact, some of the values are duplicated. We can see that our first resample contains duplicates of 3, 6, and 10, while our second resample contains duplicates of 3, 9, and 10. We call these resampled samples our “bootstrapped samples.” The way bootstrapping works is that we perform this resampling process many times (typically 1,000 or more) and compute our statistic for each bootstrapped sample. The resulting collection of 1,000 bootstrapped statistics will serve as an empirical estimate of our sampling distribution.

To illustrate how this looks with a dataset, consider the built-in R data USArrests containing arrest data per 100,000 residents in each of the 50 US states in 1973. We can use the sample() function to collect a bootstrapped sample of indices and use this to create a bootstrapped dataset USA_boot. Below, we see a plot of the distribution of the original sample, along with the distribution of a single bootstrapped sample.

## Resample the indices of USArrests
idx <- sample(1:nrow(USArrests), replace = TRUE)

## Subset with the new indices
USA_boot <- USArrests[idx, ]

p1 <- ggplot(USArrests, aes(Murder)) + gh(bins = 9) + 
  ggtitle("Original Sample")
p2 <- ggplot(USA_boot, aes(Murder)) + gh(bins = 9) + 
  ggtitle("Bootstrapped Sample #1")

## Compare original with bootstrapped sample
gridExtra::grid.arrange(p1, p2, nrow = 1)

In order to get our bootstrapped sampling distribution, we would simply repeat this process many times, each time taking and recording the mean value of Murder in each of our bootstrapped samples. Conveniently, we can use the bootstrap() function, defined at the beginning of the lab, to perform this process for us. The bootstrap() function takes two arguments: the variable we wish to resample and the statistic we wish to bootstrap. By default, this will repliate the process 1,000 times

## Call the bootstrap function
boot <- bootstrap(USArrests$Murder, mean)

## It gives us back a data.frame with a sample mean for each Sample
head(boot)

##   Sample Statistic
## 1      1     7.694
## 2      2     7.358
## 3      3     8.528
## 4      4     8.088
## 5      5     8.054
## 6      6     7.856

We can visualize this sampling distribution in ggplot:

## Visualize the sampling distribution
ggplot(boot, aes(Statistic)) + gh(bins = 12)

To be clear, the data.frame boot contains the value of 1,000 bootstrapped means, effectively making this an estimate of our sampling distribution. This means that, as our estimate, it can be used to construct confidence intervals using the quantile() function. In other words, instead of using any distributional assumptions (i..e, normal or t-distribution), we can simply use the data generated to find a 95% confidence interval

## Estimate of 95% confidence interval
quantile(boot$Statistic, probs = c(0.025, 0.975))

##   2.5%  97.5% 
## 6.6198 9.0520

Question 4 Based on the histogram of the bootstrapped sampling distribution above, do you think that the 95% confidence interval constructed with bootstrapping should match what we would find using the point estimate $\pm$ margin of error method? Explain your answer.

Question 5 Verify using the qt() function what you answered in Question 4.

Question 6 This question uses the Grinnell Rain dataset. Typically, this dataset includes precipitation data on 121 months; here, we will collect a sample of size $n = 20$ instead

## Load data
rain <- read.csv("https://collinn.github.io/data/grinnell_rain.csv")

## Subset
set.seed(10)
idx <- sample(1:nrow(rain), size = 20)
rainsub <- rain[idx, ]

Part A Using your sample rainsub and the qt(), attempt to create an 80% confidence interval using the point estimate $\pm$ method (i.e., median $\pm C \times \hat{\sigma}/\sqrt{n}$)

Part B Use the bootstrap() function to bootstrap 1,000 samples of the median statistic. With your resulting data frame, create a histogram of the sampling distribution. Based on this, does it seem like the confidence interval you found in Part A is appropriate? Why or why not?

Part C Use the quantile() function to create an 80% confidence interval for the median. How does this compare with what you found in Part A?

Part D Now using the full rain dataset, find the true median value of the population. Does it fall within the intervals you constructed in Part A? How about Part B? Why did it work for one and not the other?