Introduction

Below are a collection of functions that will be useful for this homework. Create a code chunk and copy and paste them into the top of your assignment. In particular, we will be using:

  • gh(), short for geom_histogram() that creates better looking graphs by default
  • bootstrap() to do our bootstrapping. I will illustrate how this function works in the next section
library(ggplot2)
library(dplyr)

theme_set(theme_bw())

## Better histograms
gh <- function(bins = 8) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Bootstrap Function
bootstrap <- function(x, statistic, n = 1000L) {
  bs <- replicate(n, {
    sb <- sample(x, replace = TRUE)
    statistic(sb)
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

## College dataset
college <- read.csv("https://collinn.github.io/data/college2019.csv")

For an illustration of how gh() is used, see:

## Default way
p1 <- ggplot(college, aes(ACT_median)) + 
  geom_histogram(color = 'black', fill = 'gray80', bins = 15)

## Superior awesome gh() way
p2 <- ggplot(college, aes(ACT_median)) + gh(bins = 15)

## They are identical
gridExtra::grid.arrange(p1, p2, nrow = 1)

Quantile Function

Just as we were able to use qt() and qnorm() to get the quantiles from a named distribution, we can use the quantile() function to determine the quantiles from an actual sample. For example, consider finding the 5th and 95th quantiles of a normal distribution with \(\mu = 50\) and \(\sigma = 15\):

## Quantiles from normal dist
qnorm(c(0.05, 0.95), mean = 50, sd = 15)
## [1] 25.327 74.673

Here, we input no data and use information from the distribution itself.

Alternatively, suppose that we had a vector x that represented a sample of 1,000 observations from a normal distribution with \(\mu = 50\) and \(\sigma = 15\) (the function rnorm() stands for Random NORMal). We can use the quantile() function to find the 5th and 9th quantiles of our sample, which should be very similar to what we found above. Note that for quantile(), the order of the arguments is flipped.

## Note that quantile() takes vector first, then quantiles
x <- rnorm(1000, mean = 50, sd = 15) # Randomly generate 1000 from this dist
quantile(x, probs = c(0.05, 0.95)) # Investigate the quantiles directly
##     5%    95% 
## 23.273 74.122

Bootstrapping

Recall that bootstrapping involves taking repeated samples with replacement and computing statistics on each bootstrapped sample. This process is performed for us with the bootstrap() function provided above. bootstrap() takes three arguments:

  1. The vector or variable you wish to bootstrap

  2. Which statistic you want to bootstrap (e.g., the mean, the median, the odds ratio)

  3. How many bootstrap samples you want to perform. The default is 1000

Suppose I want to bootstrap the mean of the Enrollment variable from the college dataset. To do this, I would do:

## The syntax dataset$variable will extract the variable from a data.frame
boot_stats <- bootstrap(college$Enrollment, statistic = mean)

## The bootstrap() function returns a dataframe with a row for each bootstrap mean
head(boot_stats)
##   Sample Statistic
## 1      1    6109.4
## 2      2    6259.2
## 3      3    6523.7
## 4      4    6141.8
## 5      5    6670.8
## 6      6    5878.8

Using the boot_stats$Statistic notation, I can find the mean and quantiles from my bootstrapped sample:

## The mean
mean(boot_stats$Statistic)
## [1] 6242.2
## 90% confidence interval
quantile(boot_stats$Statistic, probs = c(0.05, 0.95))
##     5%    95% 
## 5850.5 6639.7

We will use this construct for solving Question 0.

Problems

Question 0 (Bootstrapping)

(This is the only question that will use the bootstrap and quantile functions)

This question uses the Grinnell Rain dataset. Typically, this dataset includes precipitation data on 121 months; here, we will collect a sample of size \(n = 20\) instead

## Load data
rain <- read.csv("https://collinn.github.io/data/grinnell_rain.csv")

## Subset
set.seed(10)
idx <- sample(1:nrow(rain), size = 20)
rainsub <- rain[idx, ]
  • Part A Using your sample rainsub and the qt(), attempt to create an 80% confidence interval using the point estimate \(\pm\) method (i.e., median \(\pm \ C \times \hat{\sigma}/\sqrt{n}\))

  • Part B Use the bootstrap() function to bootstrap 1,000 samples of the median statistic. With your resulting data frame, create a histogram of the sampling distribution. Based on this, does it seem like the confidence interval you found in Part A is appropriate? Why or why not?

  • Part C Use the quantile() function to create an 80% confidence interval for the median. How does this compare with what you found in Part A?

  • Part D Now using the full rain dataset, find the true median value of the population. Does it fall within the intervals you constructed in Part A? How about Part B? Why did it work for one and not the other?

Question 1

Suppose that an investigator sets out to test 200 null hypotheses where exactly half of them are true and half of them are not. Additionally, suppose the tests have a Type I error rate of 5% and a Type II error rate of 20%

  1. Out of the 200 hypothesis tests carried out, how many should be expect to be Type I errors?

  2. How many would be Type II errors?

  3. Of the 200 tests, how many times would the investigator correctly fail to reject the null hypothesis?

  4. Out of all of the tests in which the null hypothesis was rejected, for what percentage was the null hypothesis actually true?

Question 2

Determine if the following statements are true or false. If they are false, state how they could be corrected.

  1. If a given test statistic is within a 95% confidence interval, it is also within a 99% confidence interval

  2. Decreasing the value of \(\alpha\) will increase the probability of a Type I error

  3. Suppose the null hypothesis for a proportion is \(H_0: p = 0.5\) and we fail to reject. In this case, the true population proportion is equal to 0.5

  4. With large sample sizes, even small differences between the null and observed values can be identified as statistically significant.

Question 3

A food safety inspector is called upon to investigate a restaurant with a few customer reports of poor sanitation practices. The food safety inspector uses a hypothesis testing framework to evaluate whether regulations are not being met. If he decides the restaurant is in gross violation, its license to serve food will be revoked.

  1. Write in words the null hypothesis

  2. What is a Type I error in this context?

  3. What is a Type II error in this context?

  4. Which error type is more problematic for the restaurant owner? Why?

  5. Which error is more problematic for diners? Why?

  6. As a diner, would you prefer that the food safety inspector requires strong evidence or very strong evidence of health concerns before revoking a restaurant’s license? Explain your reasoning.