library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## Better histograms
gh <- function(bins = 10) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
## Bootstrapping function
bootstrap <- function(x, statistic, n = 1000L) {
bs <- replicate(n, {
sb <- sample(x, replace = TRUE)
statistic(sb)
})
data.frame(Sample = seq_len(n),
Statistic = bs)
}
This lab is intended to serve as an introduction to the construction of confidence intervals in R. In particular, we will be considering two methods by which we can construct confidence intervals:
Before we get into any exercises, let’s take a little bit of time to introduce some new functions and techniques that we will be using to answer the questions.
Helpful Note: This lab is written to be self-contained, meaning that everything that I will ask of you will be introduced in this lab. There is absolutely no expectation that you memorize everything, and it is intended that you will reference these practice sections as needed for the lab and homework.
The two exceptions to this are the use of ggplot()
and
the use of dplyr
functions, which were introduced in the
previous lab. Extra guidance will be provided for questions involving
dplyr.
Often we will try and use real data for our problems, but it can also be instructive to generate random data. Not only does this permit us to generate as much data as we wish, it has the added bonus that we are able to know exactly the true values of our population.
For example, we can generate random data that is normally distributed
with the function rnorm()
(“RandomNORMal”). The
rnorm()
function takes three arguments: n
,
indicating the sample size, mean
, indicating the mean, and
sd
, indicating the standard deviation. If I wanted to
generate 5 random variables from a normal distribution with mean value
10 and standard deviation 15, I would do:
rnorm(n = 5, mean = 100, sd = 15)
## [1] 114.070 84.846 88.888 80.279 77.416
As before, we don’t have to name the arguments and this will work just as well:
rnorm(5, 100, 15)
## [1] 92.109 100.911 107.750 102.459 70.320
You’ll note that each time we generate random data, it gives us
something, well, random. We may find ourselves in situations where we
want to be able to recreate exactly a random process. We can do this
with the function set.seed()
:
set.seed(35)
rnorm(5, 100, 15)
## [1] 115.977 101.993 99.489 99.325 150.068
set.seed(35)
rnorm(5, 100, 15)
## [1] 115.977 101.993 99.489 99.325 150.068
As with everything else in R
, we can assign these random
values to a variable and do things with it. For example, here we collect
a sample of 50 observations and compute the sample mean and standard
deviation
x <- rnorm(50, 100, 15)
# Sample mean
mean(x)
## [1] 103.95
# Standard deviation
sd(x)
## [1] 14.083
# standard error is sd/sqrt(n)
sd(x) / sqrt(length(x))
## [1] 1.9917
By default, the rnorm()
function returns a
vector instead of a data.frame. If we wish to plot the random
data we create, we first have to put it in a data.frame
## Sample data and put into data.frame
x <- rnorm(50, 100, 15)
df <- data.frame(x = x)
## Using gh() (assigned above) to make histograms that aren't dumb looking
ggplot(df, aes(x)) +
gh(bins = 10)
In this section, we briefly demonstrated how to generate random data in R. This involved using:
rnorm(n, mean, sd)
function to specify the
parametersmean()
, sd()
, and
length()
to find the mean, standard deviation, and standard
errorggplot()
Question 1: Generate 100 observations from the normal distribution with a mean value of 50 and a standard deviation of 20, then create a histogram showing the distribution of your sample.
Generating data using rnorm()
is equivalent to
collecting a sample from a population of infinite size. Further, each
time we generate data, the observations are all completely different.
What we need next is a way to sample data from an existing,
finite population and we do this with the function
sample()
.
Here, for example, I take a sample of size 5 from the numbers 1 to 10
# This is my "population"
X <- 1:10
# Here, I take a sample of size 10
sample(X, size = 10)
## [1] 8 2 9 10 7 4 3 6 1 5
By default, the sample()
function performs sampling
without replacement, meaning that after a number has been sampled
once, it cannot be sampled again. As such, sampling 10 numbers from the
set 1:10 will give us the same 10 numbers in a different order. We can
sample with replacement by passing a second argument,
replace = TRUE
# Sample with replacement
sample(X, size = 10, replace = TRUE)
## [1] 6 1 8 8 1 1 8 7 2 9
Now I should see that some of my numbers are repeated in the sample. Sampling with replacement is the type of sampling that we will be using with the bootstrap.
We can perform this kind of sampling on variables in a data.frame as
well. Consider the trees
dataset built-in to R which
contains measurements of felled cherry trees. The distribution of tree
heights can be visualized simply enough:
ggplot(trees, aes(Height)) +
gh()
Here, I will sample tree height with replacement and again visualize it
## Sample tree heights
# (if I do not pass in `size = `, by default it will sample the same size)
th <- sample(trees$Height, replace = TRUE)
## Put sampled data in data.frame
df <- data.frame(Height = th)
## Create plot
ggplot(df, aes(Height)) +
gh() +
labs(title = "Distribution of Resampled Tree Height")
This is the equivalent of collecting one bootstrapped sample and plotting its distribution.
In this section, we introduced the sample()
function, as
well as the argument replace = TRUE
that allows us to
sample with replacement. This will be important in the
bootstrapping process.
Question 2: Using the faithful
dataset
built into R, create a histogram with 15 bins of the variable
waiting
which indicates the amount of waiting time between
eruptions. What do you notice about this graph and its distribution?
Collect one sample of the same size of this variable and create a second histogram. Do you see any of the same characteristics in the bootstrapped distribution as you do in the original? What do you notice?
The process of finding a confidence interval with point estimates involves two steps: finding the mean and standard error and putting them into our formula.
If we have a vector of data (i.e., not a data.frame), we can
compute it directly as such: here, we use the built-in dataset in R
nhtemp
which contains the mean annual temperature in
Fahrenheit in New Haven, Connecticut from 1912 to 1971:
## First compute the mean
(m <- mean(nhtemp))
## [1] 51.16
# The standard error is standard devation / square root of sample size
(se <- sd(nhtemp) / sqrt(length(nhtemp)))
## [1] 0.16339
Note how we use the length()
function to find the number
of observations here. As our method is
\[ \overline{x} \pm c \times SE \] where \(c = 1, 2, 3\) for 68-95-99 percent CI, respectively, we can find our 95% confidence interval as such:
# Lower bound
m - 2*se
## [1] 50.833
m + 2*se
## [1] 51.487
We find a 95% confidence interval of (50.83, 51.49). For those averse to typing, a shorthand method of finding this is:
m + c(-2, 2) * se
## [1] 50.833 51.487
For data that is already in a data.frame, we can find our summary
statistics using dplyr
. Here, for example, we estimate the
average number of murders per 100,000 residents in the United States
using state aggregated data compiled in 1973:
## Compute the mean and standard error using summarize()
summarize(USArrests,
meanMurder = mean(Murder),
seMurder = sd(Murder) / sqrt(n())) # Using n() for samp size
## meanMurder seMurder
## 1 7.788 0.61596
We can use these numbers again directly, here to compute a 95% confidence interval for the number of murders per 100,000 residents
7.78 + c(-2, 2)*0.615
## [1] 6.55 9.01
We can think of the process of bootstrapping as following a simple algorithm:
To help us with this procedure, we will need a new function called
bootstrap()
which I provided at the top of this lab. The
bootstrap function takes three arguments: the data I want to bootstrap,
the statistic I want to bootstrap, and the number of
bootstrapped samples I wish to collect (by default, it will collect
1000). We will illustrate its use again using the USArrests
data to find a confidence interval for the number of per capita murders
in the United States.
## Want to bootstrap the sample of murders, USArrests$Murder
mean_boot <- bootstrap(USArrests$Murder, mean)
head(mean_boot)
## Sample Statistic
## 1 1 7.736
## 2 2 7.544
## 3 3 8.546
## 4 4 8.572
## 5 5 8.308
## 6 6 8.492
head(mean_boot)
prints out the first 6 rows of the
data.frame returned by bootstrap()
: we see that the first
column indicates the sample number, while the second column indicates
the sample statistic, in our case the mean. Because it is a data.frame,
we can easily plot it:
ggplot(mean_boot, aes(Statistic)) +
gh(bins = 15)
Remember: this is a sampling distribution of the mean value.
We resampled our data 1,000 times, computed the sample mean each time,
and this is the distribution of values that were returned. From this
point, we can use the quantile()
function to return the
quantiles that we want. Because we want the middle 95%, we need to take
our quantiles at 2.5% and 97.5%
quantile(mean_boot$Statistic, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 6.5680 8.9603
Question 3: Compare the confidence intervals created with the point estimate method and the bootstrapped quantile methods.
Here, briefly, are worked examples from above to have a concise place to reference:
# Simulate a Population with random data
Pop <- rnorm(1000, mean = 100, sd = 15)
# Collect sample with replacement from data (here, of size 30)
samp <- sample(Pop, size = 30, replace = TRUE)
# Find mean and standard error using mean, sd, sqrt, and length
xbar <- mean(samp)
se <- sd(samp) / sqrt(length(samp))
# Compute PE +/- Margin of Error
xbar + c(-2, 2) * se
## [1] 94.981 106.072
# Do with bootstrap instead
bs <- bootstrap(samp, mean)
quantile(bs$Statistic, prob = c(0.025, 0.975))
## 2.5% 97.5%
## 95.413 106.202
The data below contains the total monthly precipitation in Grinnell, IA from February 2014 to February 2024 (121 total months). We will treat all of these 121 months as the population we are interested in.
## rain data
rain <- read.csv("http://collinn.github.io/data/grinnell_rain.csv")
ggplot(rain, aes(precip)) +
gh(bins = 8) +
labs(x = "Precipitation (Inches)",
title = "Rain Precipitation 2014-2024\nGrinnell, IA")
## True mean
mean(rain$precip)
## [1] 2.0698
For this problem, we will assume that we have only collected data on
40 random months in this period, and our goal will be to estimate the
true value (\(\mu = 2.069\)) given the
sample that we have. In other words, we will be using rs
(Rain Sample) as our observed sample dataset.
## Create a random index of months for our sample
# (copy these lines into your lab)
set.seed(123)
idx <- sample(1:nrow(rain), size = 40, replace = FALSE)
rs <- rain[idx, ]
sample()
function,
generate one sample with replacement from rs
and
create a histogram of the sample. How does this compare with the
histogram in Part A?bootstrap()
function
provided at the top of the lab, generate 1,000 bootstraps of the sample
mean. Then, using the quantile()
function, find an interval
that contains 95% of the bootstrapped means. How does this interval
compare with what you found in Part D?The data below contains the results of a randomized experiment conducted by surgeons at Harvard Medical School to compare “low-flow bypass” and “circulatory arrest” surgical procedures in the treatment of infants born with congenital heart defects. The study’s recorded outcomes are Psychomotor Development Index (PDI), a composite score measuring physiological development, with higher scores indicating greater development, and Mental Development Index (MDI), a composite score measuring mental development, with higher scores indicating greater development.
heart <- read.csv("https://remiller1450.github.io/data/InfantHeart.csv")
Our goal in this question is to try and determine if it is plausible that there are differences in PDI scores between the two treatment groups.
group_by()
and
summarize()
, compute the mean value and the standard error
of PDI for both treatment groups. (Hint: Use n()
to
determine the number of observations in each group)