library(ggplot2)
library(dplyr)

# Prettier graphs
theme_set(theme_bw())

## Better histograms
gh <- function(bins = 10) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Bootstrapping function
bootstrap <- function(x, statistic, n = 1000L) {
  bs <- replicate(n, {
    sb <- sample(x, replace = TRUE)
    statistic(sb)
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

This homework contains several sections with included code that is intended to be run, either for loading functions or creating samples or subsets from the data. I encourage you to try and understand what it is doing as this may shed light on the underlying processes involved.

Question 1

A network operator has compiled data demonstrating the number of users connected to the internet through a server being accessed each minute, which we can access built-in from R with the command:

data(WWWusage)

Planning upgrades for a variety of possible future scenarios, you have been asked to perform an analysis to determine how many users we should be expecting per minute, on average.

## Capped data
capped <- WWWusage

## Values outside of this range are censored
capped[capped < 100] <- 100
capped[capped > 180] <- 180
cleaned <- WWWusage

## Remove 'incorrect' values
cleaned <- cleaned[cleaned > 100 & cleaned < 180]

Question 2

The Center for Disease Control conducted a study on 6,168 women in hopes of finding factors related to breast cancer. In particular, they considered at which age a woman gave birth to her first child and computed the following contingency table:

##      Cancer
## Age    Yes   No
##   <25 4475   65
##   25+ 1597   31
## Here is data.frame containing data, 
df <- expand.grid(Cancer = c("Yes", "No"), 
                  Age = c("<25", "25+"))
df <- df[rep(1:4, times = c(4475, 65, 1597, 31)), ]


bootstrap_or <- function(x, n = 1000L) {
  bs <- replicate(n, {
    # Randomly select rows of data frame
    idx <- sample(nrow(x), replace = TRUE)
    samp_df <- x[idx, ]
    tt <- with(samp_df, table(Age, Cancer))
    (tt[1] * tt[4]) / (tt[2] * tt[3])
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

bs <- bootstrap_or(df)

Question 3

This question will involve the Grinnell rainfall data from 2014-2024, with a sample size collected from this data of size \(n = 25\)

rain <- read.csv("http://collinn.github.io/data/grinnell_rain.csv")

# replace = FALSE because this is our sample, not a bootstrap estimate
set.seed(1)
idx <- sample(nrow(rain), size = 40, replace = FALSE)

# rs = rain sample
rs <- rain[idx, ]

\[ CR = \frac{mean}{median} \] If the mean value is greater than the median, we expect to find \(CR > 1\), and if the median is greater than the mean, we should expect \(CR < 1\); \(CR = 1\) when the mean and median are equal. Find the centrality ratio of our sample data. Does there appear to be compelling evidence that our data are skewed? If so, in which direction?

mm_ratio <- function(x) mean(x) / median(x)

mm_boot <- bootstrap(rs$precip, mm_ratio)