Include the code chunk below in your Rmd document as it contains
several functions that we will be using for this lab, including the
gh()
function from the CLT worksheet to help us create
better histograms in ggplot. Also included is the se()
function for computing standard error and the bootstrap()
function which will be used in the second half of this lab.
library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## Better histograms
gh <- function(bins = 8) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
## Bootstrap Function
bootstrap <- function(x, statistic, n = 1000L) {
bs <- replicate(n, {
sb <- sample(x, replace = TRUE)
statistic(sb)
})
data.frame(Sample = seq_len(n),
Statistic = bs)
}
## Standard Error
se <- function(x) sd(x) / sqrt(length(x))
## College dataset
college <- read.csv("https://collinn.github.io/data/college2019.csv")
For an illustration of how gh()
is used, see:
## Default way
p1 <- ggplot(college, aes(ACT_median)) +
geom_histogram(color = 'black', fill = 'gray80', bins = 15)
## Superior awesome gh() way
p2 <- ggplot(college, aes(ACT_median)) + gh(bins = 15)
## They are identical
gridExtra::grid.arrange(p1, p2, nrow = 1)
if we know distributional parameters, we don’t need the data anymore, that ways quantile takes data and q norm only takes distributional parameters
Also for introduction, introduce rt, rnorm, qt, qnorm, sample, se, bootstrap function
There are a few R functions that will be helpful in constructing confidence intervals. In particular, we will begin with a family of functions responsible for finding quantiles. For each of these functions, we will need to provide the quantiles that we are interested in finding. Because we typically want our intervals in the form of plus/minus, we usually start with the coverage we are looking for and divide the difference by two. For example, to find the middle 95%, we know we are excluding 5% at the edges; 2.5% on the left side and 2.5% on the right side. This gives us the quantiles (0.025, 0.975). Similarly, if we wanted an 80% confidence interval, our quantiles should be (0.1, 0.9), etc.,.
The distributional quantile functions are those that construct
distribution-specific quantiles (i.e., normal distribution,
t-distribution) based on the distributional parameters (recall
that distributional parameters are those that tell us everything we need
to know about a distribution). The first of these, qnorm()
finds the quantiles of a normal distribution (quantile
of normal). By default, it assumes a value of
mean = 0
and sd = 1
. Here, we find the
quantiles associated with a 95% confidence interval for a standard
normal:
## Quantiles for normal distribution
qnorm(p = c(0.025, 0.975))
## [1] -1.96 1.96
If I know the sample mean and standard error from my sample, I can plug these values in directly. Here, consider finding the quantiles based on the four year completion rate for males from our college dataset:
## Mean and se of my sampling distribution
mm <- mean(college$FourYearComp_Males)
ss <- se(college$FourYearComp_Males)
qnorm(p = c(0.025, 0.975), mean = mm, sd = ss)
## [1] 0.46645 0.48599
This is the exact same thing we would get if doing the formula by hand where \(C = 1.96\):
\[ \overline{x} \pm C \frac{\hat{\sigma}}{\sqrt{n}} \]
## Point estimate +/- MOE method
mm + c(-1.96, 1.96) * ss
## [1] 0.46645 0.48599
A few notes to make here:
mean
and se
as variable names
when they already exist as functions. It will work, but it may confuse R
in some cases, providing unexpected results. Instead, use names like
varMean
and varSE
qnorm()
with the default values will give us our
critical values for the margin of error methodse()
was defined in the starting code
block, but it does not exist in base R. If you try to run it and it says
'could not find function "se"'
you need to run the code
containing the function above. This will be true for the homework as
well+ c(-1.96, 1.96)
above is a shortcut to avoid
having to write out this expression twiceIn the example above, the confidence interval from
qnorm()
and that from the margin of error method were
identical because they are both based on the assumption that the
distribution is normal. In other words, they represent two ways of
solving the exact same problem.
Assuming that a distribution is exactly normal is typically of little
consequence when the number of samples is very large or when the
distribution of our sample is nearly normal. When this is not the case,
i.e., when \(n < 30\) or so, we need
to use qt()
to find the critical values
(quantile for t). The only
distributional parameter associated with the \(t\) distribution is the degrees of freedom,
\(n - 1\). We can find the critical
values for a t-distribution in the exact same way:
## df = 1094 because we have n = 1095 observations
qt(p = c(0.025, 0.975), df = 1095-1)
## [1] -1.9621 1.9621
Because the \(t\) distribution is based on the standardized values of our observation, there is no way to add a mean or standard deviation to the estimate. As such, confidence intervals with the t-distribution must be done using the \(\overline{x} \pm C \frac{\hat{\sigma}}{\sqrt{n}}\) method, where \(C\) represents the critical values from a t distribution with n-1 degrees of freedom. To find the 95% confidence interval for the four year completion rate of males, we need to use the margin of error method
## Constructing 95% CI using t-dist critical values
mm + c(-1.9621, 1.9621) * ss
## [1] 0.46644 0.48600
In this case, we see that our 95% CI from the t-distribution is very close to what we found with the normal distribution. This is because the size of our sample was \(n = 1095\) (the number of observations in the college dataset); as such, the normal distribution is a very close approximation to the true sampling distribution.
Question 1 Using the qnorm()
function,
find the critical values associated with an 80% confidence interval. How
do these compare to the critical values of a 95% confidence interval?
Explain.
Question 2 This question will use the
hawks
data. The code below will load the hawks
dataset which we will use for the first parts of this problem. It will
also create hawks2
, a randomly sampled subset of the
data.
hawks <- read.csv("https://collinn.github.io/data/hawks.csv")
Subset the hawks
data using dplyr to include only
Red-tailed hawks (Species == "RT"
). How many observations
are in this dataset?
Create a histogram of the variable Weight
. What does
it look like? Based on your experience with the CLT lab, how well do you
think a normal approximation will fit the sampling
distribution?
Find the sample mean and standard error for the
Weight
variable. Use the qt()
function to find
the appropriate quantiles to create a 95% confidence interval for the
sample mean. Use these to construct a 95% confidence interval
Repeat the previous step, this time using the hawks2
subset. How does the sample mean in hawks2
compare with the
sample mean in hawks
? How does the size of the 95%
confidence interval compare? Does changing the size of the sample appear
to have more of an impact on the sample mean or on the confidence
interval?
## First subset Hawks data with RT
# hwk <- filter(hawks, ...)
## Create subset of size n = 20
set.seed(89)
idx <- sample(seq_len(nrow(hwk)), size = 20)
hawks2 <- hwk[idx, ]
Because we know everything we need to know about a distribution from
it’s distributional parameters, there is no need to consider the data
itself. However, when we don’t know a distribution (or it’s parameters),
but we have data, we can use the data itself to estimate the quantiles.
This is done with the quantile()
function in R.
To illustrate how this works, let’s sample some random values using
rnorm()
(random normal)
using the mean and standard error from the college dataset we found
above
## Find mean and standard error
mm <- mean(college$FourYearComp_Males)
ss <- se(college$FourYearComp_Males)
## Generate 1000 normally distrubted random numbers with those means and se
rand <- rnorm(n = 1000, mean = mm, sd = ss)
## Find the quantiles of the rand data
quantile(x = rand, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 0.46711 0.48634
qnorm(c(0.025, 0.975), mean = mm, sd = ss)
## [1] 0.46645 0.48599
In this case, we should expect the quantiles to be pretty close: the observed data we have comes from a theoretical distribution that we know, so we should expect that their behavior would be the same. When we don’t know the underlying distribution however, or if we know that it’s not normal, this will be an efficient way to create estimates of what the quantiles should be.
Question 3 For this question, we are going to again
use the college
dataset
Create a histogram of the variable
FourYearComp_Males
, the four year graduation rate for
males
Find the mean and standard deviation of the variable
FourYearComp_Males
. Use the qnorm()
function
to find the the 0.025 and 0.975 quantiles.
Using the quantile()
function the 0.025 and 0.975
quantiles of the variable college$FourYearComp_Males
. Would
you expect these to be similar to what you found in part (b)? Why or why
not?
Why are the quantiles you found in this problem so different than the ones we found in the example above using the same data? Be as specific as you can be.
Here we will use bootstrap function and quantile function and it’ll actually be super cool
Bootstrapping, as we have learned in class, is an algorithmic method for constructing a sampling distribution based on a collected sample by repeatedly sampling with replacement from our original sample.
To visualize what is meant by this, consider the following example where we start with collection of numbers from 1 to 10 and sample with replacement (note: you will not need to use the sample function in your work):
## My sample
x <- 1:10
## Sampling with replacement once
sample(x, replace = TRUE)
## [1] 3 3 10 2 6 5 4 6 9 10
## Doing it again
sample(x, replace = TRUE)
## [1] 5 3 9 9 9 3 8 10 7 10
As you can see, each iteration of sampling with replacement
contains values from the original sample (x = 1:10
), but
not necessarily all of them; in fact, some of the values are duplicated.
We can see that our first resample contains duplicates of 3, 6, and 10,
while our second resample contains duplicates of 3, 9, and 10. We call
these resampled samples our “bootstrapped samples.” The way
bootstrapping works is that we perform this resampling process many
times (typically 1,000 or more) and compute our statistic for each
bootstrapped sample. The resulting collection of 1,000 bootstrapped
statistics will serve as an empirical estimate of our sampling
distribution.
To illustrate how this looks with a dataset, consider the built-in R
data USArrests
containing arrest data per 100,000 residents
in each of the 50 US states in 1973. We can use the
sample()
function to collect a bootstrapped sample of
indices and use this to create a bootstrapped dataset
USA_boot
. Below, we see a plot of the distribution of the
original sample, along with the distribution of a single bootstrapped
sample.
## Resample the indices of USArrests
idx <- sample(1:nrow(USArrests), replace = TRUE)
## Subset with the new indices
USA_boot <- USArrests[idx, ]
p1 <- ggplot(USArrests, aes(Murder)) + gh(bins = 9) +
ggtitle("Original Sample")
p2 <- ggplot(USA_boot, aes(Murder)) + gh(bins = 9) +
ggtitle("Bootstrapped Sample #1")
## Compare original with bootstrapped sample
gridExtra::grid.arrange(p1, p2, nrow = 1)
In order to get our bootstrapped sampling distribution, we would
simply repeat this process many times, each time taking and recording
the mean value of Murder
in each of our bootstrapped
samples. Conveniently, we can use the bootstrap()
function,
defined at the beginning of the lab, to perform this process for us. The
bootstrap()
function takes two arguments: the variable we
wish to resample and the statistic we wish to bootstrap. By default,
this will repliate the process 1,000 times
## Call the bootstrap function
boot <- bootstrap(USArrests$Murder, mean)
## It gives us back a data.frame with a sample mean for each Sample
head(boot)
## Sample Statistic
## 1 1 7.694
## 2 2 7.358
## 3 3 8.528
## 4 4 8.088
## 5 5 8.054
## 6 6 7.856
We can visualize this sampling distribution in ggplot:
## Visualize the sampling distribution
ggplot(boot, aes(Statistic)) + gh(bins = 12)
To be clear, the data.frame boot
contains the value of
1,000 bootstrapped means, effectively making this an estimate of our
sampling distribution. This means that, as our estimate, it can be used
to construct confidence intervals using the quantile()
function. In other words, instead of using any distributional
assumptions (i..e, normal or t-distribution), we can simply use the data
generated to find a 95% confidence interval
## Estimate of 95% confidence interval
quantile(boot$Statistic, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 6.6198 9.0520
Question 4 Based on the histogram of the bootstrapped sampling distribution above, do you think that the 95% confidence interval constructed with bootstrapping should match what we would find using the point estimate \(\pm\) margin of error method? Explain your answer.
Question 5 Verify using the qt()
function what you answered in Question 4.
Question 6 This question uses the Grinnell Rain dataset. Typically, this dataset includes precipitation data on 121 months; here, we will collect a sample of size \(n = 20\) instead
## Load data
rain <- read.csv("https://collinn.github.io/data/grinnell_rain.csv")
## Subset
set.seed(10)
idx <- sample(1:nrow(rain), size = 20)
rainsub <- rain[idx, ]
Part A Using your sample rainsub
and
the qt()
, attempt to create an 80% confidence interval
using the point estimate \(\pm\) method
(i.e., median \(\pm C \times
\hat{\sigma}/\sqrt{n}\))
Part B Use the bootstrap()
function to
bootstrap 1,000 samples of the median
statistic. With your
resulting data frame, create a histogram of the sampling distribution.
Based on this, does it seem like the confidence interval you found in
Part A is appropriate? Why or why not?
Part C Use the quantile()
function to
create an 80% confidence interval for the median. How does this compare
with what you found in Part A?
Part D Now using the full rain dataset, find the true median value of the population. Does it fall within the intervals you constructed in Part A? How about Part B? Why did it work for one and not the other?