Below are a collection of functions that will be useful for this homework. Create a code chunk and copy and paste them into the top of your assignment. In particular, we will be using:
gh(), short for geom_histogram() that
creates better looking graphs by defaultbootstrap() to do our bootstrapping. I will illustrate
how this function works in the next sectionlibrary(ggplot2)
library(dplyr)
theme_set(theme_bw())
## Better histograms
gh <- function(bins = 8) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
## Bootstrap Function
bootstrap <- function(x, statistic, n = 1000L) {
bs <- replicate(n, {
sb <- sample(x, replace = TRUE)
statistic(sb)
})
data.frame(Sample = seq_len(n),
Statistic = bs)
}
## College dataset
college <- read.csv("https://collinn.github.io/data/college2019.csv")
For an illustration of how gh() is used, see:
## Default way
p1 <- ggplot(college, aes(ACT_median)) +
geom_histogram(color = 'black', fill = 'gray80', bins = 15)
## Superior awesome gh() way
p2 <- ggplot(college, aes(ACT_median)) + gh(bins = 15)
## They are identical
gridExtra::grid.arrange(p1, p2, nrow = 1)
Just as we were able to use qt() and
qnorm() to get the quantiles from a named
distribution, we can use the quantile() function to
determine the quantiles from an actual sample. For example, consider
finding the 5th and 95th quantiles of a normal distribution with \(\mu = 50\) and \(\sigma = 15\):
## Quantiles from normal dist
qnorm(c(0.05, 0.95), mean = 50, sd = 15)
## [1] 25.327 74.673
Here, we input no data and use information from the distribution itself.
Alternatively, suppose that we had a vector x that
represented a sample of 1,000 observations from a normal distribution
with \(\mu = 50\) and \(\sigma = 15\) (the function
rnorm() stands for Random
NORMal). We can use the quantile()
function to find the 5th and 9th quantiles of our sample, which should
be very similar to what we found above. Note that for
quantile(), the order of the arguments is flipped.
## Note that quantile() takes vector first, then quantiles
x <- rnorm(1000, mean = 50, sd = 15) # Randomly generate 1000 from this dist
quantile(x, probs = c(0.05, 0.95)) # Investigate the quantiles directly
## 5% 95%
## 23.273 74.122
Recall that bootstrapping involves taking repeated samples with
replacement and computing statistics on each bootstrapped sample.
This process is performed for us with the bootstrap()
function provided above. bootstrap() takes three
arguments:
The vector or variable you wish to bootstrap
Which statistic you want to bootstrap (e.g., the mean, the median, the odds ratio)
How many bootstrap samples you want to perform. The default is 1000
Suppose I want to bootstrap the mean of the Enrollment
variable from the college dataset. To do this, I would
do:
## The syntax dataset$variable will extract the variable from a data.frame
boot_stats <- bootstrap(college$Enrollment, statistic = mean)
## The bootstrap() function returns a dataframe with a row for each bootstrap mean
head(boot_stats)
## Sample Statistic
## 1 1 6109.4
## 2 2 6259.2
## 3 3 6523.7
## 4 4 6141.8
## 5 5 6670.8
## 6 6 5878.8
Using the boot_stats$Statistic notation, I can find the
mean and quantiles from my bootstrapped sample:
## The mean
mean(boot_stats$Statistic)
## [1] 6242.2
## 90% confidence interval
quantile(boot_stats$Statistic, probs = c(0.05, 0.95))
## 5% 95%
## 5850.5 6639.7
We will use this construct for solving Question 0.
(This is the only question that will use the bootstrap and quantile functions)
This question uses the Grinnell Rain dataset. Typically, this dataset includes precipitation data on 121 months; here, we will collect a sample of size \(n = 20\) instead
## Load data
rain <- read.csv("https://collinn.github.io/data/grinnell_rain.csv")
## Subset
set.seed(10)
idx <- sample(1:nrow(rain), size = 20)
rainsub <- rain[idx, ]
Part A Using your sample rainsub
and the qt(), attempt to create an 80% confidence interval
using the point estimate \(\pm\) method
(i.e., median \(\pm \ C \times
\hat{\sigma}/\sqrt{n}\))
Part B Use the bootstrap() function
to bootstrap 1,000 samples of the median statistic. With
your resulting data frame, create a histogram of the sampling
distribution. Based on this, does it seem like the confidence interval
you found in Part A is appropriate? Why or why not?
Part C Use the quantile() function
to create an 80% confidence interval for the median. How does this
compare with what you found in Part A?
Part D Now using the full rain dataset, find the true median value of the population. Does it fall within the intervals you constructed in Part A? How about Part B? Why did it work for one and not the other?
Suppose that an investigator sets out to test 200 null hypotheses where exactly half of them are true and half of them are not. Additionally, suppose the tests have a Type I error rate of 5% and a Type II error rate of 20%
Out of the 200 hypothesis tests carried out, how many should be expect to be Type I errors?
How many would be Type II errors?
Of the 200 tests, how many times would the investigator correctly fail to reject the null hypothesis?
Out of all of the tests in which the null hypothesis was rejected, for what percentage was the null hypothesis actually true?
Determine if the following statements are true or false. If they are false, state how they could be corrected.
If a given test statistic is within a 95% confidence interval, it is also within a 99% confidence interval
Decreasing the value of \(\alpha\) will increase the probability of a Type I error
Suppose the null hypothesis for a proportion is \(H_0: p = 0.5\) and we fail to reject. In this case, the true population proportion is equal to 0.5
With large sample sizes, even small differences between the null and observed values can be identified as statistically significant.
A food safety inspector is called upon to investigate a restaurant with a few customer reports of poor sanitation practices. The food safety inspector uses a hypothesis testing framework to evaluate whether regulations are not being met. If he decides the restaurant is in gross violation, its license to serve food will be revoked.
Write in words the null hypothesis
What is a Type I error in this context?
What is a Type II error in this context?
Which error type is more problematic for the restaurant owner? Why?
Which error is more problematic for diners? Why?
As a diner, would you prefer that the food safety inspector requires strong evidence or very strong evidence of health concerns before revoking a restaurant’s license? Explain your reasoning.