This problem involves constructing a dataset to organize observations for future study.
Begin by downloading here a compressed folder containing images of fungi growth started in petri dishes. While the images themselves will be processed at a later date, for now we need to create a record of current observations. This will involve constructing a dataset that has the following variables:
"C" or
"NC" according to whether or not the fungi sample is grown
with a competing speciesAll of this information is recorded in the file or path name of the
images (the image files are empty). Your submission should include the
code to generate this database, along with the use of
head() to show the first 10 rows.
Bootstrapping is a statistical re-sampling technique that is used to construct confidence intervals for a given statistic. The general idea of the bootstrap is that from a given sample, we can construct “new” samples by simpling resampling with replacement from the original. The image below demonstrates how a single sample can be used to produce several other bootstrap samples
The general process works like this:
x. This could be the column of a
data.frame (e.g., df$x)statistic that we wish to
bootstrap. This should be a function of xB. A typical
value is B = 1000, but this should be able to changesample()
with replacement from the vector x the same length
as x. See ?sample()stat(x)
and record it. We will end up with a length B vector of
statisticsFrom this vector of statistics, we should then:
quantile().
This should give us a confidence intervalHere is an example of using a bootstrap function to find the trimmed mean of a vector
## Here is my data
set.seed(123)
x <- rnorm(n = 25, mean = 10, sd = 2)
## Here is trimmed mean function
trim_mean <- function(x) {
mean(x, trim = 0.1)
}
## Calling my bootstrap function with 3 arguments, returns a data.frame
boot <- bootstrap(x = x, statistic = trim_mean, B = 500)
## Find mean and 90% confidence interval
mean(boot$Statistic)
## [1] 9.8558
quantile(boot$Statistic, probs = c(0.05, 0.95))
## 5% 95%
## 9.2312 10.5353
By returning a data.frame, we are making it easy to quickly plot the sampling distribution of a statistic.
library(ggplot2)
ggplot(boot, aes(Statistic)) +
geom_histogram(color = "black", fill = "gray80", bins = 12) +
ggtitle("Sampling distribution of trimmed mean")
Part 1: Create your own function
bootstrap() that behaves like the one above. You can verify
it works by passing in the same statistic function and
randomly generated x from above.
Part 2: We are going to use our bootstrapping
function to create a new statistic to investigate the Grinnell rainfall
data rainsub given below:
## Load data
rain <- read.csv("https://collinn.github.io/data/grinnell_rain.csv")
## Subset
set.seed(10)
idx <- sample(1:nrow(rain), size = 20)
rainsub <- rain[idx, ]
First, create a function that generates a statistic consisting of
a ratio of the mean to the median (i.e.,
mean(x) / median(x)). If this ratio is close to one, the
mean and the median are similar and the data are not skewed. If it is
larger or smaller than one, there is evidence of skew.
Using this statistic, bootstrap the sample rainsub
with B = 1000 to generate an estimate of the sampling
distribution. Find a 90% confdience interval for this statistic. Does
this interval contain 1?
Write your conclusion to the null hypothesis: There is no skew present in the distribution of rainfall in Grinnell, IA
Will we get to lists?