library(ggplot2)
library(dplyr)
# Prettier graphs
theme_set(theme_bw())
## Better histograms
gh <- function(bins = 10) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
## Bootstrapping function
bootstrap <- function(x, statistic, n = 1000L) {
bs <- replicate(n, {
sb <- sample(x, replace = TRUE)
statistic(sb)
})
data.frame(Sample = seq_len(n),
Statistic = bs)
}
This homework contains several sections with included code that is intended to be run, either for loading functions or creating samples or subsets from the data. I encourage you to try and understand what it is doing as this may shed light on the underlying processes involved.
A network operator has compiled data demonstrating the number of users connected to the internet through a server being accessed each minute, which we can access built-in from R with the command:
data(WWWusage)
Planning upgrades for a variety of possible future scenarios, you have been asked to perform an analysis to determine how many users we should be expecting per minute, on average.
Part A: This data is stored in R as a vector. Determine how many observations we have in our set, and then convert to a data frame to view the distribution of the data as a histogram using 10 bins. Do you notice any patterns in internet usage?
Part B: Using the bootstrap function provided above, find and report a 95% confidence interval for the mean.
Part C: Upon inspection of the server equipment,
it appears that the network has not been able to adequately detect
traffic in ranges below 100 or above 180, with numbers in these ranges
being recorded at their caps (i.e., traffic at level 82 would recorded
at 100 and traffic at 192 would be recorded at 180). This is reflected
in the updated capped
data below. Using this capped data,
recreate a 95% confidence interval and compare it with the interval you
found in Part B. How has it changed, and what explanation for this do
you have? (Hint: Consider the distribution you visualized in Part A.
What data is getting censored and how might this impact our
intervals?)
## Capped data
capped <- WWWusage
## Values outside of this range are censored
capped[capped < 100] <- 100
capped[capped > 180] <- 180
cleaned <- WWWusage
## Remove 'incorrect' values
cleaned <- cleaned[cleaned > 100 & cleaned < 180]
capped
and cleaned
representative
samples of the population in question?capped
and
cleaned
for your analysis, which would you prefer?The Center for Disease Control conducted a study on 6,168 women in hopes of finding factors related to breast cancer. In particular, they considered at which age a woman gave birth to her first child and computed the following contingency table:
## Cancer
## Age Yes No
## <25 4475 65
## 25+ 1597 31
Part A: Compute the odds for both groups of women (those <25 years old at the delivery of their first child and those 25+ years old) for developing breast cancer.
Part B: Compute the odds ratio of getting breast cancer for those younger than 25 compared to those older than 25. Based on this point estimate, would you say that women <25 have increased odds of getting breast cancer?
Part C: Unlike the mean, a statistic which has a
measure of departure such as the standard deviation, this is no
immediate deviation metric available for the odds ratio. Use the
function defined below, bootstrap_or()
to compute a 95%
confidence interval for the odds ratio. Based on this, would you say
that there is evidence to suggest there is increased risk associated
with early child birth?
Part D (not graded for credit): For a simple numeric vector, we simply resampled the vector itself to create our bootstrapped sample. Hopw would you describe the bootstrapping process for this problem? You might consider the following when summarizing your answer:
bootstrap_or()
function below.## Here is data.frame containing data,
df <- expand.grid(Cancer = c("Yes", "No"),
Age = c("<25", "25+"))
df <- df[rep(1:4, times = c(4475, 65, 1597, 31)), ]
bootstrap_or <- function(x, n = 1000L) {
bs <- replicate(n, {
# Randomly select rows of data frame
idx <- sample(nrow(x), replace = TRUE)
samp_df <- x[idx, ]
tt <- with(samp_df, table(Age, Cancer))
(tt[1] * tt[4]) / (tt[2] * tt[3])
})
data.frame(Sample = seq_len(n),
Statistic = bs)
}
bs <- bootstrap_or(df)
This question will involve the Grinnell rainfall data from 2014-2024, with a sample size collected from this data of size \(n = 25\)
rain <- read.csv("http://collinn.github.io/data/grinnell_rain.csv")
# replace = FALSE because this is our sample, not a bootstrap estimate
set.seed(1)
idx <- sample(nrow(rain), size = 40, replace = FALSE)
# rs = rain sample
rs <- rain[idx, ]
rain
and compare
it with that drawn from our random sample in rs
. Do they
appear to be reasonably close? If we were not able to verify the
population distribution, what can we do to ensure the quality of our
sample?\[ CR = \frac{mean}{median} \] If the mean value is greater than the median, we expect to find \(CR > 1\), and if the median is greater than the mean, we should expect \(CR < 1\); \(CR = 1\) when the mean and median are equal. Find the centrality ratio of our sample data. Does there appear to be compelling evidence that our data are skewed? If so, in which direction?
mm_ratio <- function(x) mean(x) / median(x)
mm_boot <- bootstrap(rs$precip, mm_ratio)
Part D: Based on your bootstrapped interval, would you say there is compelling evidence from our sample that our data is skewed? What do you think in light of this having seen the distribution of the entire population?
Part E: Using mm_boot
that you
computed in the previous part, create a histogram showing the
sampling distribution of the centrality ratio statistic. What
do you notice about this distribution? Would the 68-95-99 rule be
applicable here?