Question 0

Type the following:

I will not copy and paste anything from this homework assignment page into my own solutions because if I do there is a high probability that my R Markdown file will not render correctly

Specifically, copying in $\alpha$, $\mu$ or $H_0$ will cause it to break

Question 1

Determine if the following statements are true or false. If they are false, state how they could be corrected.

If a given test statistic is within a 95% confidence interval, it is also within a 99% confidence interval
Decreasing the value of $\alpha$ will increase the probability of a Type I error
Suppose the null hypothesis for a proportion is $H_0: p = 0.5$ and we fail to reject. In this case, the true population proportion is equal to 0.5
With large sample sizes, even small differences between the null and observed values can be identified as statistically significant.

Question 2

Suppose we collect a sample of elephant tusks and, regarding their length find that in our sample of $n = 20$ we have

\[ \overline{x} = 128\text{cm}, \qquad \hat{\sigma} = 12\text{cm} \]

Part A Several means have been proposed to be the true mean. For each of the means provided below, compute the $t$-statistic associated with the sample mean:

$\mu_0 = 125$
$\mu_0 = 132$
$\mu_0 = 115$

Part B Suppose we are wanting to test our hypothesis with 90% confidence. Find the critical values of the appropriate $t$-distribution that mark the bounds of where 90% of our data should lie

Part C Which of the hypothesized means in Part A fell within the bounds of the 90% critical values? Which were outside the bounds? Justify your answer with the appropriate test statistics.

Part D Reflecting on Part C, what does it mean if our t-statistic falls within the bounds of our critical values? Does this mean that $\mu_0$ is likely true? Why or why not?

Part E Reflecting on Part D, what conclusions can you come to if our data is consistent with multiple hypotheses? Why do we say that we “fail to reject the null hypothesis” instead of “accepting the null hypothesis”?

Question 3

The Happy Planet dataset includes a number of indices of well-being across 143 countries. One of these indices, Life Expectancy, is visualized in the histogram below.

We are interested in determining the proportion of countries in which the average life expectancy is greater than 70 years. To this end, we collected two different samples:

Sample 1: 13 countries had life expectancy greater than 70, 12 did not
Sample 2: 16 countries had life expectancy greater than 70, 14 did not

Use this information to answer the following questions

Part A Find the estimated proportion and standard error for each of the samples provided.

Part B Which sample provides more evidence against the null hypothesis $H_0: p_0 = 0.725$?

Question 4

This question uses the penguin dataset

penguin <- read.csv("https://collinn.github.io/data/penguins.csv")

Part A Use the appropriate dplyr functions to:

Subset the dataset to include only Chinstrap penguins
Create a new variable called flipper_size that takes the value "Large" if the flipper is larger than the median flipper length and "Small" if it flipper is smaller or less than the median

Then, create a table of the data tabulating sex and size

Part B What proportion of female penguins are considered small? What proportion are considered large?

Part C Construct an 80% confidence interval on the difference in proportion of large flippers between male and female penguins. Based on this, does there appear to be evidence that the proportion of large flippers is the same in each sex?

Part D Create a t-statistic for this difference in proportion. If $H_0: p_1 = p_2$, would you reject or fail to reject $H_0$ for a Type I error rate of 5% (i.e., critical values from 95% confidence)

Bootstrapping

The following code will be needed for this short section on bootstrapping

## Better histograms
gh <- function(bins = 8) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Bootstrap Function
bootstrap <- function(x, statistic, n = 1000L) {
  bs <- replicate(n, {
    sb <- sample(x, replace = TRUE)
    statistic(sb)
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

Quantile Function

Just as we were able to use qt() and qnorm() to get the quantiles from a named distribution, we can use the quantile() function to determine the quantiles from an actual sample. For example, consider finding the 5th and 95th quantiles of a normal distribution with $\mu = 50$ and $\sigma = 15$:

## Quantiles from normal dist
qnorm(c(0.05, 0.95), mean = 50, sd = 15)

## [1] 25.327 74.673

Here, we input no data and use information from the distribution itself.

Alternatively, suppose that we had a vector x that represented a sample of 1,000 observations from a normal distribution with $\mu = 50$ and $\sigma = 15$ (the function rnorm() stands for Random NORMal). We can use the quantile() function to find the 5th and 9th quantiles of our sample, which should be very similar to what we found above. Note that for quantile(), the order of the arguments is flipped.

## Note that quantile() takes vector first, then quantiles
x <- rnorm(1000, mean = 50, sd = 15) # Randomly generate 1000 from this dist
quantile(x, probs = c(0.05, 0.95)) # Investigate the quantiles directly

##     5%    95% 
## 26.031 76.123

Bootstrapping

Recall that bootstrapping involves taking repeated samples with replacement and computing statistics on each bootstrapped sample. This process is performed for us with the bootstrap() function provided above. bootstrap() takes three arguments:

The vector or variable you wish to bootstrap
Which statistic you want to bootstrap (e.g., the mean, the median, the odds ratio)
How many bootstrap samples you want to perform. The default is 1000

Suppose I want to bootstrap the mean of the bill_length_mm variable from the penguin dataset. To do this, I would do:

## The syntax dataset$variable will extract the variable from a data.frame
boot_stats <- bootstrap(penguin$bill_length_mm, statistic = mean)

## The bootstrap() function returns a dataframe with a row for each bootstrap mean
head(boot_stats)

##   Sample Statistic
## 1      1    43.832
## 2      2    44.285
## 3      3    44.115
## 4      4    44.239
## 5      5    44.304
## 6      6    43.815

Using the boot_stats$Statistic notation, I can find the mean and quantiles from my bootstrapped sample:

## The mean
mean(boot_stats$Statistic)

## [1] 44.006

## 90% confidence interval
quantile(boot_stats$Statistic, probs = c(0.05, 0.95))

##     5%    95% 
## 43.523 44.490

Question 5

(This is the only question that will use the bootstrap and quantile functions)

This question uses the Grinnell Rain dataset. Typically, this dataset includes precipitation data on 121 months; here, we will collect a sample of size $n = 20$ instead

## Load data
rain <- read.csv("https://collinn.github.io/data/grinnell_rain.csv")

## Subset
set.seed(10)
idx <- sample(1:nrow(rain), size = 20)
rainsub <- rain[idx, ]

Part A Using your sample rainsub and the qt(), attempt to create an 80% confidence interval using the point estimate $\pm$ method (median $\pm \ C \times \hat{\sigma}/\sqrt{n}$)
Part B Use the bootstrap() function to bootstrap 1,000 samples of the median statistic. With your resulting data frame, create a histogram of the sampling distribution. Based on this, does it seem like the confidence interval you found in Part A is appropriate? Why or why not?
Part C Use the quantile() function to create an 80% confidence interval for the median. How does this compare with what you found in Part A?
Part D Now using the full rain dataset, find the true median value of the population. Does it fall within the intervals you constructed in Part A? How about Part B? Why did it work for one and not the other?

Homework 5

Question 0

Question 1

Question 2

Question 3

Question 4

Bootstrapping

Quantile Function

Bootstrapping

Question 5