## Copy and paste the code chunk below into the top of your R Markdown document
library(ggplot2)
library(dplyr)

theme_set(theme_bw())

## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")

Introduction

This lab is intended to serve as an opportunity to practice some old skills and learn some new ones. In particular, our goals are to practice using:

  • qnorm() and qt()
  • Using dplyr to create variables for analysis
  • Using R as a calculator (order of operations)

Quantile functions

Briefly, recall that the qnorm() and qt() functions are used to find the critical values associated with particular quantiles of a distribution.

## Quantiles for 80% CI for a standard normal
qnorm(c(0.1, 0.9))
## [1] -1.2816  1.2816
## For t-distribution, need to include df
qt(c(0.1, 0.9), df = 15)
## [1] -1.3406  1.3406

If you understand that the critical values will always be negatives of each other, we can shorten this and simply ask for the positive (or negative) one

qnorm(0.95)
## [1] 1.6449

This tells me my critical values for a 90% confidence interval are 1.65 and -1.65

dplyr

The reference lab can be found here

Hypothesis Testing and Confidence Intervals in R

There are many ways to go about testing hypotheses and constructing confidence intervals in R. Here we will explore two. The first involves a manual construction; we find the critical values, determine the standard error, and compute the associated values. While this is a bit more laborious, it is computationally very quick and let’s us easily save or validate our pieces as we go. The second method introduces t.test() a function built into R that allows us to quickly perform all kinds of t tests.

CalculatoR

This may be obvious, but it is worthwhile understanding convenient ways to use R as a calculator. For example, we saw in the lecture slides (and homework) that for the John’s Hopkins preterm study had 39 total babies, of which 31 survived at 6 months. If I wanted a confidence interval for this, I may be tempted to type:

(31/39) + qt(c(0.05, 0.95), df = 28) * sqrt((31/39) * (1 - 31/39) / 39)
## [1] 0.68488 0.90487

This expression is messy, however, and can easily hide errors

More simply, we can take advantage of the fact that R allows us to assign values to variables. This allows us to compute our expressions in advance

## Assign p and n
p <- 31/39
n <- 39

## Standard error formula
se <- sqrt( p * (1-p) / n )

## Critical values
C <- qt(c(0.05, 0.95), df = 28)

## Confidence interval
p +  C * se
## [1] 0.68488 0.90487

Note that the order of operations is preserved, according to PEMDAS

Using t.test()

In R, we can create perform hypothesis testing and create standard confidence intervals based on the \(t\)-distribution using the function t.test(). While the t.test() function will take a number of useful arguments, for now we only concern ourselves with three:

  • x, a numeric vector we wish to construct a confidence interval for
  • mu, a single number indicating the null hypothesis
  • conf.level which gives the confidence interval for a specified \(\alpha\) (our confidence interval is \(1 - \alpha\))

The output of the t.test() function will include a collection of useful information, including the point estimate and confidence interval for the variable. For example, here we use the built-in mtcars dataset along with t.test() to find a 90% confidence interval for average miles per gallon:

t.test(mtcars$mpg, conf.level = 0.9)
## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = 18.9, df = 31, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  18.284 21.897
## sample estimates:
## mean of x 
##    20.091

From the output, you should note a few things:

  1. At the bottom of the output, we see our point estimate, \(\overline{x} = 20.09\)

  2. Just above the point estimate, we see our estimate of the 90% confidence interval, here \((18.24, 21.89)\)

  3. All of the information above this is related to hypothesis testing. This includes:

    • A \(t\)-statistic, equal to \(t = 18.9\)
    • Degrees of freedom equal to \(df = 31\)
    • A \(p\)-value of \(p < 2 \times 10^{-15}\)

However, as we were not conducting any hypothesis test in this particular example, this information can be ignored.

By default, the t.test() function assumes a null hypothesis of \(H_0: \mu_0 = 0\). As our t-statistic is computed as

\[ t = \frac{\overline{x} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] this gives rise to the computed statistic \(t = 18.9\) that we saw in the output (and was subsequently used to compute the p-value).

To see a more realistic example, suppose we wanted to investigate the hypothesis that the average miles per gallon was equal to 1 mpg. That is, suppose we have \(H_0: \mu_0 = 18\). We could evaluate this hypothesis using the t.test function with the mu argument:

t.test(x = mtcars$mpg, 
       mu = 18, 
       conf.level = 0.9)
## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = 1.96, df = 31, p-value = 0.059
## alternative hypothesis: true mean is not equal to 18
## 90 percent confidence interval:
##  18.284 21.897
## sample estimates:
## mean of x 
##    20.091

Note that neither our point estimate or our 90% confidence interval changed. What did change, however, was the calculated t-statistic as well as the p-value.

Problem 1

Below is code to recreate the two subsets of the hawk data containing only Red-tailed hawks

## RT Hawk data
hawks <- read.csv("https://collinn.github.io/data/hawks.csv")
hawks <- filter(hawks, Species == "RT")

## Subset of RT Hawk data
set.seed(89)
idx <- sample(seq_len(nrow(hawks)), size = 20)
hawks2 <- hawks[idx, ]
  • Part A: Perform a t-test on both hawks and hawks2 to test the hypothesis that the variable Weight has a mean value of \(\mu_0 = 1050\). According to the output, what is the sample mean for weight for each sample? If you were testing at the \(\alpha = 0.05\) level, which would you conclude for each test?

  • Part B: Explain why the conclusions you came to were different in Part A.

Problem 2

The following code takes a sample of size \(n = 20\) from the college dataset. The function set.seed() takes an integer value that, when run, ensures that any random operations we perform (such as sampling) will take the same value each time. Here, we set our seed to be 123. Copy all of the code below to put into your own document

## Controls the randomness in the sampling step
set.seed(123)

## Sample 20 row numbers
idx <- sample(1:nrow(college), size = 20)

## Only keep those rows that were sampled
c_samp <- college[idx, ]

Done this way, we can all be sure that we are working with an identical sample.

  • Part A Using statistics found in your sample c_samp, provide an estimate of the sampling distribution for \(\overline{X}\) for the variable ACT_median

  • Part B Construct a 90% confidence interval for the true average value of the median ACT

  • Part C Treating the full college dataset as our entire population, find the true mean value for the average median ACT. Use t.test() to see if our sample is consistent with this mean testing at \(\alpha = 0.05\). What conclusion do you come to?

Problem 3

This problem involves finding a difference in proportions between two groups. The statistical test for a difference in proportions is nearly identical to that for a univariate mean. For example, suppose I want to compare incidence of sickness between drug and placebo. We might first start with a table:

Sick Not sick
Drug 4 21
Placebo 18 7

In group 1, the drug group, I find the proportion of individuals who were sick to be \(p_1 = \frac{4}{25}\), whereas for group 2, the placebo group, I find the proportion to be \(p_2 = \frac{18}{7}\). According to the central limit theorem, the difference in these proportions will be approximately normal with

\[ \hat{p}_1 - \hat{p}_2 \sim N \left(p_1 - p_2, \sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_1(1 - p_2)}{n_2}} \right) \] When constructing a \(t\)-statistic from this difference, the statistic \(t = \frac{\text{Mean}}{\text{Standard Error}}\) will have \(df = n_1 + n_2 - 2\) degrees of freedom.

We have not yet learned to do this using t.test(), but it will be worthwhile to understand how to do this on paper and with a calculator.


Here, we will again start by taking a random sample of our college dataset, this time of size 50

## Controls the randomness in the sampling step
set.seed(1)

## Sample 20 row numbers
idx <- sample(1:nrow(college), size = 50)

## Only keep those rows that were sampled
c_samp <- college[idx, ]

Use the sample to answer the following questions.

  • Part A First, use the appropriate dplyr function, along with ifelse() to construct a new variable called Size that takes the value “Small” for colleges with enrollment less than 2500 and “Large” for colleges with enrollment greater than or equal to 2500

  • Part B Create a table using the table() function to show the number of public and private schools that are considered large and small

  • Part C From your table, how many colleges in your sample are Private? How many are Public? What proportion of each is considered a large school?

  • Part D Create a confidence interval of a size of your choosing with the appropriate critical values that provides a range of reasonable values for the true difference in proportion between public and private schools. Based on this interval, does it seem like a true difference in proportions is equal to zero?

  • Part E Find the \(t\)-statistic associated with the observed difference in proportions.