library(ggplot2)
library(dplyr)

theme_set(theme_bw())

## Better histograms
gh <- function(bins = 8) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Bootstrap Function
bootstrap <- function(x, statistic, n = 1000L) {
  bs <- replicate(n, {
    sb <- sample(x, replace = TRUE)
    statistic(sb)
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

## Standard Error
se <- function(x) sd(x) / sqrt(length(x))

Question 1

Below is the equation used to construct confidence intervals using critical values:

\[ \overline{x} \pm C \times \frac{\hat{\sigma}}{\sqrt{n}} \] Explain what impact each term has on the size and location of confidence intervals and explain how the sample size impacts the critical values for a given confidence interval.

Question 2

As we saw in class, the Central Limit Theorem tells us that, for a population with mean \(\mu\) and standard deviation \(\sigma\), the sample mean will follow an approximately normal distribution with

\[ \overline{x} \sim N \left( \mu, \frac{\sigma}{\sqrt{n}} \right) \] In fact, the CLT also applies to our estimates of proportions which, as can be seen, are a type of mean. For example, consider ten coin flips in which four of them result in heads. We understand this proportion to be

\[ \hat{p} = \frac{4}{10} = 0.4 \] If we were to record these values as 1s and 0s instead and take the mean, we would find the same

## Ten flips, four heads
flips <- c(0, 1, 1, 0, 0, 0, 1, 0, 0, 1)
mean(flips)
## [1] 0.4

In fact, there is a special formula for the distribution of a proportion, \(p\), based on the calculation for its variance. It can be shown that, for “large enough” \(n\), the CLT tells us that the distribution of a proportion is given as

\[ \hat{p} \sim N \left( p, \ \ \sqrt{\frac{p(1-p)}{n}} \right) \] Given a sample proportion, we can use the CLT to construct a confidence interval for the proportion, just as we did for the sample mean. Use this information to answer the following problem:


In a study conducted by Johns Hopkins University researchers investigated the survival of babies born prematurely. They searched their hospital’s medical records and found that of 39 babies born at 25 weeks gestation (15 weeks early), 31 of these babies went on to survive at least 6 months.

Part A Using a normal approximation, construct a 95% confidence interval to estimate the true proportion of babies born at 25 weeks that are expected to survive at least 6 months.

Part B An article on Wikipedia suggests that of babies born at 25 weeks, 72% are expected to survive. Is this estimate consistent with what we found in Part A?

Question 3

The Center for Disease Control conducted a study on 6,168 women in hopes of finding factors related to breast cancer. In particular, they considered at which age a woman gave birth to her first child and computed the following contingency table:

##      Cancer
## Age     No  Yes
##   <25 4475   65
##   25+ 1597   31
###############################
### COPY AND RUN FOR PART C ###
###############################

## Here is data.frame containing data, 
df <- expand.grid(Cancer = c("Yes", "No"), 
                  Age = c("<25", "25+"))
df <- df[rep(1:4, times = c(4475, 65, 1597, 31)), ]


bootstrap_or <- function(x, n = 1000L) {
  bs <- replicate(n, {
    # Randomly select rows of data frame
    idx <- sample(nrow(x), replace = TRUE)
    samp_df <- x[idx, ]
    tt <- with(samp_df, table(Age, Cancer))
    (tt[1] * tt[4]) / (tt[2] * tt[3])
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

bs <- bootstrap_or(df)

Question 4

This question will involve the Grinnell rainfall data from 2014-2024, with a sample size collected from this data of size \(n = 25\)

rain <- read.csv("http://collinn.github.io/data/grinnell_rain.csv")

# replace = FALSE because this is our sample, not a bootstrap estimate
set.seed(1)
idx <- sample(nrow(rain), size = 40, replace = FALSE)

# rs = rain sample
rs <- rain[idx, ]

\[ CR = \frac{\text{mean}}{\text{median}} \]

If the mean value is greater than the median, we expect to find \(CR > 1\), and if the median is greater than the mean, we should expect \(CR < 1\); \(CR = 1\) when the mean and median are equal. Find the centrality ratio of our sample data. Does there appear to be compelling evidence that our data are skewed? If so, in which direction?

mm_ratio <- function(x) mean(x) / median(x)

mm_boot <- bootstrap(rs$precip, mm_ratio)