Type the following:
I will not copy and paste anything from this homework assignment page into my own solutions because if I do there is a high probability that my R Markdown file will not render correctly
Specifically, copying in \(\alpha\), \(\mu\) or \(H_0\) will cause it to break
Determine if the following statements are true or false. If they are false, state how they could be corrected.
If a given test statistic is within a 95% confidence interval, it is also within a 99% confidence interval
Decreasing the value of \(\alpha\) will increase the probability of a Type I error
Suppose the null hypothesis for a proportion is \(H_0: p = 0.5\) and we fail to reject. In this case, the true population proportion is equal to 0.5
With large sample sizes, even small differences between the null and observed values can be identified as statistically significant.
Suppose we collect a sample of elephant tusks and, regarding their length find that in our sample of \(n = 20\) we have
\[ \overline{x} = 128\text{cm}, \qquad \hat{\sigma} = 12\text{cm} \]
Part A Several means have been proposed to be the true mean. For each of the means provided below, compute the \(t\)-statistic associated with the sample mean:
Part B Suppose we are wanting to test our hypothesis with 90% confidence. Find the critical values of the appropriate \(t\)-distribution that mark the bounds of where 90% of our data should lie
Part C Which of the hypothesized means in Part A fell within the bounds of the 90% critical values? Which were outside the bounds? Justify your answer with the appropriate test statistics.
Part D Reflecting on Part C, what does it mean if our t-statistic falls within the bounds of our critical values? Does this mean that \(\mu_0\) is likely true? Why or why not?
Part E Reflecting on Part D, what conclusions can you come to if our data is consistent with multiple hypotheses? Why do we say that we “fail to reject the null hypothesis” instead of “accepting the null hypothesis”?
The Happy Planet dataset includes a number of indices of well-being across 143 countries. One of these indices, Life Expectancy, is visualized in the histogram below.
We are interested in determining the proportion of countries in which the average life expectancy is greater than 70 years. To this end, we collected two different samples:
Use this information to answer the following questions
Part A Find the estimated proportion and standard error for each of the samples provided.
Part B Which sample provides more evidence against the null hypothesis \(H_0: p_0 = 0.725\)?
This question uses the penguin dataset
penguin <- read.csv("https://collinn.github.io/data/penguins.csv")
Part A Use the appropriate dplyr functions to:
flipper_size that takes
the value "Large" if the flipper is larger than the median
flipper length and "Small" if it flipper is smaller or less
than the medianThen, create a table of the data tabulating sex and size
Part B What proportion of female penguins are considered small? What proportion are considered large?
Part C Construct an 80% confidence interval on the difference in proportion of large flippers between male and female penguins. Based on this, does there appear to be evidence that the proportion of large flippers is the same in each sex?
Part D Create a t-statistic for this difference in proportion. If \(H_0: p_1 = p_2\), would you reject or fail to reject \(H_0\) for a Type I error rate of 5% (i.e., critical values from 95% confidence)
The following code will be needed for this short section on bootstrapping
## Better histograms
gh <- function(bins = 8) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
## Bootstrap Function
bootstrap <- function(x, statistic, n = 1000L) {
bs <- replicate(n, {
sb <- sample(x, replace = TRUE)
statistic(sb)
})
data.frame(Sample = seq_len(n),
Statistic = bs)
}
Just as we were able to use qt() and
qnorm() to get the quantiles from a named
distribution, we can use the quantile() function to
determine the quantiles from an actual sample. For example, consider
finding the 5th and 95th quantiles of a normal distribution with \(\mu = 50\) and \(\sigma = 15\):
## Quantiles from normal dist
qnorm(c(0.05, 0.95), mean = 50, sd = 15)
## [1] 25.327 74.673
Here, we input no data and use information from the distribution itself.
Alternatively, suppose that we had a vector x that
represented a sample of 1,000 observations from a normal distribution
with \(\mu = 50\) and \(\sigma = 15\) (the function
rnorm() stands for Random
NORMal). We can use the quantile()
function to find the 5th and 9th quantiles of our sample, which should
be very similar to what we found above. Note that for
quantile(), the order of the arguments is flipped.
## Note that quantile() takes vector first, then quantiles
x <- rnorm(1000, mean = 50, sd = 15) # Randomly generate 1000 from this dist
quantile(x, probs = c(0.05, 0.95)) # Investigate the quantiles directly
## 5% 95%
## 26.031 76.123
Recall that bootstrapping involves taking repeated samples with
replacement and computing statistics on each bootstrapped sample.
This process is performed for us with the bootstrap()
function provided above. bootstrap() takes three
arguments:
The vector or variable you wish to bootstrap
Which statistic you want to bootstrap (e.g., the mean, the median, the odds ratio)
How many bootstrap samples you want to perform. The default is 1000
Suppose I want to bootstrap the mean of the
bill_length_mm variable from the penguin
dataset. To do this, I would do:
## The syntax dataset$variable will extract the variable from a data.frame
boot_stats <- bootstrap(penguin$bill_length_mm, statistic = mean)
## The bootstrap() function returns a dataframe with a row for each bootstrap mean
head(boot_stats)
## Sample Statistic
## 1 1 43.832
## 2 2 44.285
## 3 3 44.115
## 4 4 44.239
## 5 5 44.304
## 6 6 43.815
Using the boot_stats$Statistic notation, I can find the
mean and quantiles from my bootstrapped sample:
## The mean
mean(boot_stats$Statistic)
## [1] 44.006
## 90% confidence interval
quantile(boot_stats$Statistic, probs = c(0.05, 0.95))
## 5% 95%
## 43.523 44.490
(This is the only question that will use the bootstrap and quantile functions)
This question uses the Grinnell Rain dataset. Typically, this dataset includes precipitation data on 121 months; here, we will collect a sample of size \(n = 20\) instead
## Load data
rain <- read.csv("https://collinn.github.io/data/grinnell_rain.csv")
## Subset
set.seed(10)
idx <- sample(1:nrow(rain), size = 20)
rainsub <- rain[idx, ]
Part A Using your sample rainsub
and the qt(), attempt to create an 80% confidence
interval using the point estimate \(\pm\) method (median \(\pm \ C \times
\hat{\sigma}/\sqrt{n}\))
Part B Use the bootstrap() function
to bootstrap 1,000 samples of the median statistic. With
your resulting data frame, create a histogram of the sampling
distribution. Based on this, does it seem like the confidence interval
you found in Part A is appropriate? Why or why not?
Part C Use the quantile() function
to create an 80% confidence interval for the median. How does this
compare with what you found in Part A?
Part D Now using the full rain dataset, find the true median value of the population. Does it fall within the intervals you constructed in Part A? How about Part B? Why did it work for one and not the other?