library(ggplot2)
library(dplyr)

# Prettier graphs
theme_set(theme_bw())

## Better histograms
gh <- function(bins = 10) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Bootstrapping function
bootstrap <- function(x, statistic, n = 1000L) {
  bs <- replicate(n, {
    sb <- sample(x, replace = TRUE)
    statistic(sb)
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

This homework contains several sections with included code that is intended to be run, either for loading functions or creating samples or subsets from the data. I encourage you to try and understand what it is doing as this may shed light on the underlying processes involved.

Question 1

A network operator has compiled data demonstrating the number of users connected to the internet through a server being accessed each minute, which we can access built-in from R with the command:

data(WWWusage)

Planning upgrades for a variety of possible future scenarios, you have been asked to perform an analysis to determine how many users we should be expecting per minute, on average.

Part A: This data is stored in R as a vector. Determine how many observations we have in our set, and then convert to a data frame to view the distribution of the data as a histogram using 10 bins. Do you notice any patterns in internet usage?
Part B: Using the bootstrap function provided above, find and report a 95% confidence interval for the mean.
Part C: Upon inspection of the server equipment, it appears that the network has not been able to adequately detect traffic in ranges below 100 or above 180, with numbers in these ranges being recorded at their caps (i.e., traffic at level 82 would recorded at 100 and traffic at 192 would be recorded at 180). This is reflected in the updated capped data below. Using this capped data, recreate a 95% confidence interval and compare it with the interval you found in Part B. How has it changed, and what explanation for this do you have? (Hint: Consider the distribution you visualized in Part A. What data is getting censored and how might this impact our intervals?)

## Capped data
capped <- WWWusage

## Values outside of this range are censored
capped[capped < 100] <- 100
capped[capped > 180] <- 180

Part D: A friend of yours observes this capped data and notes that it is biased, as some of the small numbers are rounded to larger ones and large numbers are rounded to smaller ones. Better, they say, that we remove these biased ones all together and just focus on the data we know is correct. As proof that this is superior, they compute the 95% confidence interval on the “cleaned” data and argue that even though there are fewer observations as some were removed, the confidence interval got smaller, making it more precise and providing a better estimate of the range of plausible values. Using the cleaned data, compute this confidence interval to verify their statement. How many observations remain in the cleaned data, and how does this confidence interval compare to those we found previously.

cleaned <- WWWusage

## Remove 'incorrect' values
cleaned <- cleaned[cleaned > 100 & cleaned < 180]

Part E: Do you agree with your friend’s assessment of this problem? You should address the following points:
- Were capped and cleaned representative samples of the population in question?
- Why do you think the 95% confidence interval became smaller for the “cleaned” data even though it had fewer observations?
- If you had to choose between capped and cleaned for your analysis, which would you prefer?

Question 2

The Center for Disease Control conducted a study on 6,168 women in hopes of finding factors related to breast cancer. In particular, they considered at which age a woman gave birth to her first child and computed the following contingency table:

##      Cancer
## Age    Yes   No
##   <25 4475   65
##   25+ 1597   31

Part A: Compute the odds for both groups of women (those <25 years old at the delivery of their first child and those 25+ years old) for developing breast cancer.
Part B: Compute the odds ratio of getting breast cancer for those younger than 25 compared to those older than 25. Based on this point estimate, would you say that women <25 have increased odds of getting breast cancer?
Part C: Unlike the mean, a statistic which has a measure of departure such as the standard deviation, this is no immediate deviation metric available for the odds ratio. Use the function defined below, bootstrap_or() to compute a 95% confidence interval for the odds ratio. Based on this, would you say that there is evidence to suggest there is increased risk associated with early child birth?
Part D (not graded for credit): For a simple numeric vector, we simply resampled the vector itself to create our bootstrapped sample. Hopw would you describe the bootstrapping process for this problem? You might consider the following when summarizing your answer:
1. Is a table considered data or a summary of data?
2. Are we bootstrapping a variable or observations? How are these different?
3. Try reading and interpretting the R code for the bootstrap_or() function below.

## Here is data.frame containing data, 
df <- expand.grid(Cancer = c("Yes", "No"), 
                  Age = c("<25", "25+"))
df <- df[rep(1:4, times = c(4475, 65, 1597, 31)), ]


bootstrap_or <- function(x, n = 1000L) {
  bs <- replicate(n, {
    # Randomly select rows of data frame
    idx <- sample(nrow(x), replace = TRUE)
    samp_df <- x[idx, ]
    tt <- with(samp_df, table(Age, Cancer))
    (tt[1] * tt[4]) / (tt[2] * tt[3])
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

bs <- bootstrap_or(df)

Question 3

This question will involve the Grinnell rainfall data from 2014-2024, with a sample size collected from this data of size \(n = 25\)

rain <- read.csv("http://collinn.github.io/data/grinnell_rain.csv")

# replace = FALSE because this is our sample, not a bootstrap estimate
set.seed(1)
idx <- sample(nrow(rain), size = 40, replace = FALSE)

# rs = rain sample
rs <- rain[idx, ]

Part A: Create a histogram showing the distribution of precipitation in the population dataset rain and compare it with that drawn from our random sample in rs. Do they appear to be reasonably close? If we were not able to verify the population distribution, what can we do to ensure the quality of our sample?
Part B: When distributions of data are skewed, such as they are in the Grinnell rain data, we know that the mean and the median will not be equal to one another. One statistic that can capture this information is the ratio between the mean and the median. We will call this the centrality ratio, denoted \(CR\), where

\[ CR = \frac{mean}{median} \] If the mean value is greater than the median, we expect to find \(CR > 1\), and if the median is greater than the mean, we should expect \(CR < 1\); \(CR = 1\) when the mean and median are equal. Find the centrality ratio of our sample data. Does there appear to be compelling evidence that our data are skewed? If so, in which direction?

Part C: Instead of relying on a point estimate, we might instead consider creating a 95% confidence interval of this skew. Run the code below to create bootstrapped samples and describe the 95% confidence interval. Does it appear symmetric about our point estimate? In other words, is the upper bound of our confidence interval the same distance away from our point estimate as the lower bound of the confidence interval?

mm_ratio <- function(x) mean(x) / median(x)

mm_boot <- bootstrap(rs$precip, mm_ratio)

Part D: Based on your bootstrapped interval, would you say there is compelling evidence from our sample that our data is skewed? What do you think in light of this having seen the distribution of the entire population?
Part E: Using mm_boot that you computed in the previous part, create a histogram showing the sampling distribution of the centrality ratio statistic. What do you notice about this distribution? Would the 68-95-99 rule be applicable here?

STA-290 HW 4

2024-03-01

Question 1

Question 2

Question 3