Homework #4

Question 1

This problem involves constructing a dataset to organize observations for future study.

Begin by downloading here a compressed folder containing images of fungi growth started in petri dishes. While the images themselves will be processed at a later date, for now we need to create a record of current observations. This will involve constructing a dataset that has the following variables:

Plate: A record of the plate number, an integer value 1-50
Strain: The strain of fungi being recorded, taking the values T2, T4, or T7
Competitor: A value of "C" or "NC" according to whether or not the fungi sample is grown with a competing species
Time_Plated: The date and time the fungi were plated. For each observation, this is the same value: February 15, 2026 at 8AM
Time_Observed: The date and time the image was generated. This should be reported in 24H time (e.g., 16:00 instead of 4PM)
Duration: Amount of time between plating and observation

All of this information is recorded in the file or path name of the images (the image files are empty). Your submission should include the code to generate this database, along with the use of head() to show the first 10 rows.

Question 2

Bootstrapping is a statistical re-sampling technique that is used to construct confidence intervals for a given statistic. The general idea of the bootstrap is that from a given sample, we can construct “new” samples by simpling resampling with replacement from the original. The image below demonstrates how a single sample can be used to produce several other bootstrap samples

The general process works like this:

Begin with a vector x. This could be the column of a data.frame (e.g., df$x)
Decide on a statistic statistic that we wish to bootstrap. This should be a function of x
Decide on a number of bootstraps samples, B. A typical value is B = 1000, but this should be able to change
For each bootstrap, we should collect a sample() with replacement from the vector x the same length as x. See ?sample()
From each sample, we should find the value of stat(x) and record it. We will end up with a length B vector of statistics
Return this vector as a data.frame

From this vector of statistics, we should then:

Find the mean. This is our estimate of the statistic
Find the quantiles of this vector using quantile(). This should give us a confidence interval

Here is an example of using a bootstrap function to find the trimmed mean of a vector

## Here is my data
set.seed(123)
x <- rnorm(n = 25, mean = 10, sd = 2)

## Here is trimmed mean function
trim_mean <- function(x) {
  mean(x, trim = 0.1)
}

## Calling my bootstrap function with 3 arguments, returns a data.frame
boot <- bootstrap(x = x, statistic = trim_mean, B = 500)

## Find mean and 90% confidence interval
mean(boot$Statistic)

## [1] 9.8558

quantile(boot$Statistic, probs = c(0.05, 0.95))

##      5%     95% 
##  9.2312 10.5353

By returning a data.frame, we are making it easy to quickly plot the sampling distribution of a statistic.

library(ggplot2)
ggplot(boot, aes(Statistic)) + 
  geom_histogram(color = "black", fill = "gray80", bins = 12) +
  ggtitle("Sampling distribution of trimmed mean")

Part 1: Create your own function bootstrap() that behaves like the one above. You can verify it works by passing in the same statistic function and randomly generated x from above.
Part 2: We are going to use our bootstrapping function to create a new statistic to investigate the Grinnell rainfall data rainsub given below:

## Load data
rain <- read.csv("https://collinn.github.io/data/grinnell_rain.csv")

## Subset
set.seed(10)
idx <- sample(1:nrow(rain), size = 20)
rainsub <- rain[idx, ]

First, create a function that generates a statistic consisting of a ratio of the mean to the median (i.e., mean(x) / median(x)). If this ratio is close to one, the mean and the median are similar and the data are not skewed. If it is larger or smaller than one, there is evidence of skew.
Using this statistic, bootstrap the sample rainsub with B = 1000 to generate an estimate of the sampling distribution. Find a 90% confdience interval for this statistic. Does this interval contain 1?
Write your conclusion to the null hypothesis: There is no skew present in the distribution of rainfall in Grinnell, IA

Homework #4

Question 1

Question 2

Question 3