## Copy and paste the code chunk below into the top of your R Markdown document
library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")
This lab is intended to serve as an opportunity to practice some old skills and learn some new ones. In particular, our goals are to practice using:
qnorm() and qt()dplyr to create variables for analysisR as a calculator (order of operations)Briefly, recall that the qnorm() and qt()
functions are used to find the critical values associated with
particular quantiles of a distribution.
## Quantiles for 80% CI for a standard normal
qnorm(c(0.1, 0.9))
## [1] -1.2816 1.2816
## For t-distribution, need to include df
qt(c(0.1, 0.9), df = 15)
## [1] -1.3406 1.3406
If you understand that the critical values will always be negatives of each other, we can shorten this and simply ask for the positive (or negative) one
qnorm(0.95)
## [1] 1.6449
This tells me my critical values for a 90% confidence interval are 1.65 and -1.65
dplyrThe reference lab can be found here
There are many ways to go about testing hypotheses and constructing
confidence intervals in R. Here we will explore two. The first involves
a manual construction; we find the critical values, determine the
standard error, and compute the associated values. While this is a bit
more laborious, it is computationally very quick and let’s us easily
save or validate our pieces as we go. The second method introduces
t.test() a function built into R that allows us to quickly
perform all kinds of t tests.
This may be obvious, but it is worthwhile understanding convenient ways to use R as a calculator. For example, we saw in the lecture slides (and homework) that for the John’s Hopkins preterm study had 39 total babies, of which 31 survived at 6 months. If I wanted a confidence interval for this, I may be tempted to type:
(31/39) + qt(c(0.05, 0.95), df = 28) * sqrt((31/39) * (1 - 31/39) / 39)
## [1] 0.68488 0.90487
This expression is messy, however, and can easily hide errors
More simply, we can take advantage of the fact that R allows us to assign values to variables. This allows us to compute our expressions in advance
## Assign p and n
p <- 31/39
n <- 39
## Standard error formula
se <- sqrt( p * (1-p) / n )
## Critical values
C <- qt(c(0.05, 0.95), df = 28)
## Confidence interval
p + C * se
## [1] 0.68488 0.90487
Note that the order of operations is preserved, according to PEMDAS
In R, we can create perform hypothesis testing and create standard
confidence intervals based on the \(t\)-distribution using the function
t.test(). While the t.test() function will
take a number of useful arguments, for now we only concern ourselves
with three:
x, a numeric vector we wish to construct a confidence
interval formu, a single number indicating the null hypothesisconf.level which gives the confidence interval for a
specified \(\alpha\) (our confidence
interval is \(1 - \alpha\))The output of the t.test() function will include a
collection of useful information, including the point estimate and
confidence interval for the variable. For example, here we use the
built-in mtcars dataset along with t.test() to
find a 90% confidence interval for average miles per gallon:
t.test(mtcars$mpg, conf.level = 0.9)
##
## One Sample t-test
##
## data: mtcars$mpg
## t = 18.9, df = 31, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 18.284 21.897
## sample estimates:
## mean of x
## 20.091
From the output, you should note a few things:
At the bottom of the output, we see our point estimate, \(\overline{x} = 20.09\)
Just above the point estimate, we see our estimate of the 90% confidence interval, here \((18.24, 21.89)\)
All of the information above this is related to hypothesis testing. This includes:
However, as we were not conducting any hypothesis test in this particular example, this information can be ignored.
By default, the t.test() function assumes a null
hypothesis of \(H_0: \mu_0 = 0\). As
our t-statistic is computed as
\[ t = \frac{\overline{x} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] this gives rise to the computed statistic \(t = 18.9\) that we saw in the output (and was subsequently used to compute the p-value).
To see a more realistic example, suppose we wanted to investigate the
hypothesis that the average miles per gallon was equal to 1 mpg. That
is, suppose we have \(H_0: \mu_0 =
18\). We could evaluate this hypothesis using the
t.test function with the mu argument:
t.test(x = mtcars$mpg,
mu = 18,
conf.level = 0.9)
##
## One Sample t-test
##
## data: mtcars$mpg
## t = 1.96, df = 31, p-value = 0.059
## alternative hypothesis: true mean is not equal to 18
## 90 percent confidence interval:
## 18.284 21.897
## sample estimates:
## mean of x
## 20.091
Note that neither our point estimate or our 90% confidence interval changed. What did change, however, was the calculated t-statistic as well as the p-value.
Below is code to recreate the two subsets of the hawk data containing only Red-tailed hawks
## RT Hawk data
hawks <- read.csv("https://collinn.github.io/data/hawks.csv")
hawks <- filter(hawks, Species == "RT")
## Subset of RT Hawk data
set.seed(89)
idx <- sample(seq_len(nrow(hawks)), size = 20)
hawks2 <- hawks[idx, ]
Part A: Perform a t-test on both
hawks and hawks2 to test the hypothesis that
the variable Weight has a mean value of \(\mu_0 = 1050\). According to the output,
what is the sample mean for weight for each sample? If you were testing
at the \(\alpha = 0.05\) level, which
would you conclude for each test?
Part B: Explain why the conclusions you came to were different in Part A.
The following code takes a sample of size \(n = 20\) from the college dataset. The
function set.seed() takes an integer value that, when run,
ensures that any random operations we perform (such as sampling) will
take the same value each time. Here, we set our seed to be 123. Copy all
of the code below to put into your own document
## Controls the randomness in the sampling step
set.seed(123)
## Sample 20 row numbers
idx <- sample(1:nrow(college), size = 20)
## Only keep those rows that were sampled
c_samp <- college[idx, ]
Done this way, we can all be sure that we are working with an identical sample.
Part A Using statistics found in your sample
c_samp, provide an estimate of the sampling distribution
for \(\overline{X}\) for the variable
ACT_median
Part B Construct a 90% confidence interval for the true average value of the median ACT
Part C Treating the full college
dataset as our entire population, find the true mean value for the
average median ACT. Use t.test() to see if our sample is
consistent with this mean testing at \(\alpha
= 0.05\). What conclusion do you come to?
This problem involves finding a difference in proportions between two groups. The statistical test for a difference in proportions is nearly identical to that for a univariate mean. For example, suppose I want to compare incidence of sickness between drug and placebo. We might first start with a table:
| Sick | Not sick | |
|---|---|---|
| Drug | 4 | 21 |
| Placebo | 18 | 7 |
In group 1, the drug group, I find the proportion of individuals who were sick to be \(p_1 = \frac{4}{25}\), whereas for group 2, the placebo group, I find the proportion to be \(p_2 = \frac{18}{7}\). According to the central limit theorem, the difference in these proportions will be approximately normal with
\[ \hat{p}_1 - \hat{p}_2 \sim N \left(p_1 - p_2, \sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_1(1 - p_2)}{n_2}} \right) \] When constructing a \(t\)-statistic from this difference, the statistic \(t = \frac{\text{Mean}}{\text{Standard Error}}\) will have \(df = n_1 + n_2 - 2\) degrees of freedom.
We have not yet learned to do this using t.test(), but
it will be worthwhile to understand how to do this on paper and with a
calculator.
Here, we will again start by taking a random sample of our college dataset, this time of size 50
## Controls the randomness in the sampling step
set.seed(1)
## Sample 20 row numbers
idx <- sample(1:nrow(college), size = 50)
## Only keep those rows that were sampled
c_samp <- college[idx, ]
Use the sample to answer the following questions.
Part A First, use the appropriate dplyr
function, along with ifelse() to construct a new variable
called Size that takes the value “Small” for colleges with
enrollment less than 2500 and “Large” for colleges with enrollment
greater than or equal to 2500
Part B Create a table using the
table() function to show the number of public and private
schools that are considered large and small
Part C From your table, how many colleges in your sample are Private? How many are Public? What proportion of each is considered a large school?
Part D Create a confidence interval of a size of your choosing with the appropriate critical values that provides a range of reasonable values for the true difference in proportion between public and private schools. Based on this interval, does it seem like a true difference in proportions is equal to zero?
Part E Find the \(t\)-statistic associated with the observed difference in proportions.