library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## Better histograms
gh <- function(bins = 10) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
## Bootstrapping function
bootstrap <- function(x, statistic, n = 1000L) {
bs <- replicate(n, {
sb <- sample(x, replace = TRUE)
statistic(sb)
})
data.frame(Sample = seq_len(n),
Statistic = bs)
}
This lab is intended to serve as an introduction to hypothesis testing. Where available, we will use built-in functions in R to find and report p-values. In particular, we will investigate the finding of p-values for proportions, differences in proportions, odds ratios, means, and differences of means
A quick note on these tests and statistical tests done in R generally – much of what we have introduced and talked about in class describes the basic theoretical justification for hypothesis testing. That is, establish a an assumption via the null hypothesis, construct a distribution of values we may see under the null, and then determine if our observed data is inconsistent with what we had assumed. This general procedure underlies much of statistical inference.
As is often the case in many disciplines, there are typically several ways to solve a problem. An example that came up for some students involves the question of whether \(p_0\) or \(\hat{p}\) should be used in the construction of the standard error. Both are valid constructions, and though they offer slightly different results, asymptotically (that is as \(n \rightarrow \infty\)) they lead to the same conclusion.
This will be the case for many of the functions we will use in R for hypothesis testing; the solutions that you find here will not necessarily match what you would find by hand following the procedures we introduce in class. As such, guidance for answering questions should be as follows:
binom.test()
will not. However, the impact of sample size
on the width of intervals will act the same in both cases.This set of tests involves collecting data on a single group of observations and testing various hypotheses.
While we are able to use our normality assumptions in constructing
confidence intervals and p-values for single proportions, we are also
able to compute them exactly using the theoretical properties
of a true distribution. That is, while counting the number of heads in
successive coin flips will become approximately normal as the number of
flips increases, we can also utilize the fact that coin flipping follows
a binomial
distribution. As such, we can use an exact binomial to compute our
confidence intervals and p-values rather than relying a normal
approximations. This is done with the R function
binom.test()
. This function takes several arguments which
you can learn more about with ?binom.test
.
For now, let’s consider again the Johns Hopkins study that found 31
of 39 infants born preterm survived to 6 months. Here, we are wanting to
check the hypothesis \(H_0: p_0 =
0.7\). To do this, we will use binom.test()
with the
arguments:
x
– the number of “successes” we observedn
– the total number of observationsp
– our hypothesized proportionThe function will print out a litany of useful information
binom.test(x = 31, n = 39, p = 0.7)
##
## Exact binomial test
##
## data: 31 and 39
## number of successes = 31, number of trials = 39, p-value = 0.22
## alternative hypothesis: true probability of success is not equal to 0.7
## 95 percent confidence interval:
## 0.63536 0.90704
## sample estimates:
## probability of success
## 0.79487
First, observed the bottom line, “probability of success” which is simply our observed test statistic, \(\hat{p} = 31/39\). Note, however, that the \(p\)-value found here is not the same as what we found in the slides. This is a consequence of the fact that here we are utilizing assumptions from the binomial distribution rather than our normal approximation.
Just as the \(p\)-values change, so too do our confidence intervals. For example, in our slides Wednesday we found a 95% confidence interval to be (0.668, 0.922), while here they are slightly different, being (0.635, 0.907).
It is worth noting that we can also use binom.test()
for
the construction of confidence intervals. By default, the interval
printed out is 95%, but we could just as well find an 80% confidence
interval using the argument conf.level
binom.test(x = 31, n = 39, p = 0.7, conf.level = 0.8)
##
## Exact binomial test
##
## data: 31 and 39
## number of successes = 31, number of trials = 39, p-value = 0.22
## alternative hypothesis: true probability of success is not equal to 0.7
## 80 percent confidence interval:
## 0.68823 0.87671
## sample estimates:
## probability of success
## 0.79487
Question 1: We see above that changing our confidence interval to 80% from 95% changed the interval, but not the resulting p-value. Why is this the case?
Question 2: You might also notice that the
confidence interval we found using binom.test()
is no
longer symmetric around the estimate \(\hat{p}\). Why is this the case?
Question 3: There is an intimate relationship between p-values and confidence intervals in that the exact same attributes of our sample data are used to construct each. Presented below are two scenarios:
Using the binom.test()
function, create 95% confidence
intervals and construct a \(p\)-value
for the null hypothesis \(p_0 = 0.5\).
Then answer the following:
In R, we can create perform hypothesis testing and create standard
confidence intervals based on the \(t\)-distribution using the function
t.test()
. While the t.test()
function will
take a number of useful arguments that we will use in the future, for
now we only concern ourselves with three:
x
, a numeric vector we wish to construct a confidence
interval formu
, a single number indicating the null hypothesisconf.level
which gives the confidence interval for a
specified \(\alpha\)The output of the t.test()
function will include a
collection of useful information, ranging from the point estimate for
the mean value of x
, as well as a confidence interval for
this value. For example, consider the built-in dataset
mtcars
in R. Here, we use t.test()
to find a
90% confidence interval for average miles per gallon:
t.test(mtcars$mpg, conf.level = 0.9)
##
## One Sample t-test
##
## data: mtcars$mpg
## t = 18.9, df = 31, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 18.284 21.897
## sample estimates:
## mean of x
## 20.091
From the output, you should note a few things:
At the bottom of the output, we see our point estimate, \(\overline{x} = 20.09\)
Just above the point estimate, we see our estimate of the 90% confidence interval, here \((18.24, 21.89)\)
All of the information above this is related to hypothesis testing. This includes:
However, as we were not conducting any hypothesis test in this particular example, this information can be ignored.
By default, the t.test()
function assumes a null
hypothesis of \(H_0: \mu_0 = 0\). As
our t-statistic is computed as
\[ t = \frac{\overline{x} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] this gives rise to the computed statistic \(t = 18.9\) that we saw in the output (and was subsequently used to compute the p-value).
To see a more realistic example, suppose we wanted to investigate the
hypothesis that the average miles per gallon was equal to 16 mpg. That
is, suppose we have \(H_0: \mu_0 =
18\). We could evaluate this hypothesis using the
t.test
function with the mu
argument:
t.test(x = mtcars$mpg,
mu = 18,
conf.level = 0.9)
##
## One Sample t-test
##
## data: mtcars$mpg
## t = 1.96, df = 31, p-value = 0.059
## alternative hypothesis: true mean is not equal to 18
## 90 percent confidence interval:
## 18.284 21.897
## sample estimates:
## mean of x
## 20.091
In this example, note that our point estimate did not change, nor did our 90% confidence interval. What did change, however, was the calculated t-statistic as well as the p-value.
Question 4: Explain why both the t-statistic and the
p-value changed in our second call of t.test()
while the
point estimate and the confidence interval stayed the same.
Question 5: Using the subsets of the data below, find and report 95% confidence intervals for total enrollment at public and private colleges.
library(dplyr)
college <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
col_priv <- filter(college, Private == "Private")
col_pub <- filter(college, Private == "Public")
Testing and evaluating confidence intervals for correlation can be
done with the function cor.test
. For example, using our
college data, suppose we are interested in evaluating the correlation
between net tuition and average faculty salary
cor.test(college$Net_Tuition, college$Avg_Fac_Salary)
##
## Pearson's product-moment correlation
##
## data: college$Net_Tuition and college$Avg_Fac_Salary
## t = 19.6, df = 1648, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.39538 0.47364
## sample estimates:
## cor
## 0.43534
Just as with the other testing functions, we can modify the range of
our confidence interval with the function conf.level
:
cor.test(college$Net_Tuition, college$Avg_Fac_Salary, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: college$Net_Tuition and college$Avg_Fac_Salary
## t = 19.6, df = 1648, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.40939 0.46057
## sample estimates:
## cor
## 0.43534
Question 6: Using the ToothGrowth
data
built-in to R that we have seen previously, construct a 90% confidence
interval for the correlation between dose of Vitamin C and tooth length.
What is the null hypothesis when evaluating if two variables are
correlated? Does this data seem to support or refute our null
hypothesis?
Question 7: Revist the lab on ggplot
to refresh how to use a data.frame to create a plot in R. For the
ToothGrowth
dataset, create a scatter plot showing the
relationship between “dose” on the x-axis and “len” (length) on the
y-axis. Does this plot seem consistent with you conclusion in Question
6? (Hint: Use geom_jitter()
instead of
geom_point()
for a slightly nicer looking plot)
These sets of tests will involve testing proportions or differences between two groups
We have seen previously hypothesis testing for a single proportion, for example, establishing the null hypothesis that \(H_0: p_0\) when evaluating the outcome of a series of binary events, i.e., what proportion of severely pre-term infants survive past 6 months? A related question seeks to investigate the proportion of events between two separate groups. For example, consider giving a drug and a placebo to two groups of patients and investigating the proportion of recipients who develop negative side effects: if the proportion of side effects is the same between two groups, we may conclude that there is no association between the drug and side-effects. Likewise, if this proportion is different, we may conclude that there is an association between the groups and an outcome.
One method for investigating this is examining the difference in proportions. For groups 1 and 2 with proportions \(p_1\) and \(p_2\), respectively, we might frame the null hypothesis that there is no difference as
\[ H_0: p_1 - p_2 = 0 \] Like with the single proportion case, the central limit theorem also applies to the difference in proportions, with the approximate distribution being
\[ \hat{p}_1 - \hat{p}_2 \sim N \left(p_1 - p_2, \ \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right) \] where \(n_1\) and \(n_2\) represent the number of observations in each group. Note that the standard deviation term, which includes the sum of standard deviations of both groups, is known as the pooled variance which we will describe in some detail when reviewing the two-sample t-test.
Again as before, a confidence interval around this estimate takes the form of
\[ (\hat{p}_1 - \hat{p}_2) \pm C \times \text{SE} \]
For our example here, consider again the Titanic dataset in R, where we investigate the number of passengers and crew members of each sex who survived or died
## Titanic data in R
titanic <- as.data.frame(Titanic)
titanic <- titanic[rep(1:nrow(titanic), times = titanic$Freq), ]
titanic$Freq <- NULL
# Table
with(titanic, table(Sex, Survived))
## Survived
## Sex No Yes
## Male 1364 367
## Female 126 344
Of the males, we find the proportion of survived to be \(\hat{p}_1 = \frac{367}{1731}\), whereas the
proportion for females is \(\hat{p}_2 =
\frac{344}{470}\). Implied in this is our group sizes, with \(n_1 = 1731\) and \(n_2 = 470\). To test the difference in
proportions, we will use the prop.test()
function, similar
to the binom.test()
, but now passing in a vector
indicating the number of “successes” and a vector for the total
# Difference in proportion test for titanic data
prop.test(x = c(367, 344), n = c(1731, 470))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(367, 344) out of c(1731, 470)
## X-squared = 454, df = 1, p-value <0.0000000000000002
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.56569 -0.47411
## sample estimates:
## prop 1 prop 2
## 0.21202 0.73191
There are a few things from this output to notice. From the top:
conf.level
Generally speaking, from these tests we will be interested in identifying the p-value, the confidence interval, and possibly estimates of each of the relevant proportions.
Question 8: The results of the clinical trial for the Pfizer Covid-19 mRNA injections were published in December 2020. The study consisted of a sample of 43,548 individuals, 21,720 of whom were given the BNT162b2 injection and 21,728 being given a placebo. After a period of 6 months, 8 individuals in the drug group exhibited symptoms of Covid-19, while the number in the placebo group was 162.
Just as with the one sample test, two sample t-tests are done with
the function t.test()
. There are two ways to use this
function, depending on how your data is stored.
The simplest case involves passing two independent vectors. For
example, here we have a vector x
of length 50 and a vector
y
of length 20. To conduct a t-test on the difference of
these two vectors, its enough to simply pass them in as the first two
arguments:
## Random vector x
set.seed(123)
x <- rnorm(50, mean = 20, sd = 2)
y <- rnorm(20, mean = 15, sd = 1)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 15, df = 66, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.2438 5.5491
## sample estimates:
## mean of x mean of y
## 20.069 15.172
As before, you’ll find that the t.test()
function
returns a t-statistic, degrees of freedom, a p-value, confidence
interval, and then estimates of the mean difference. You’ll also note in
this case that the heading at the top indicates that this is a “Welch
Two Sample t-test”, corresponding to the fact that it makes no
assumptions about the standard deviations of your samples being the
same.
The next two cases involve data that is stored in a data.frame. First, consider the case where each of the varibles we wish to test are stored in a data.frame as columns
## Pretend data set
set.seed(123)
df <- data.frame(A = rnorm(5, 1, 1),
B = rnorm(5, 0, 1))
print(df)
## A B
## 1 0.43952 1.71506
## 2 0.76982 0.46092
## 3 2.55871 -1.26506
## 4 1.07051 -0.68685
## 5 1.12929 -0.44566
If the two variables that you wish to test are both columns in a
data.frame, you can use the $
operator to pass them in like
vectors
## Pass in A and B columns
t.test(df$A, df$B)
##
## Welch Two Sample t-test
##
## data: df$A and df$B
## t = 1.95, df = 7.14, p-value = 0.091
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.25577 2.73155
## sample estimates:
## mean of x mean of y
## 1.193570 -0.044319
Implicit in this data structure is the assumption that these variables are the same length. More frequently, you have a data.frame that contains a column indicating groups and a second column indicating the values we want to test. For example, consider the format of the following example data:
set.seed(123)
df <- data.frame(Group = rep(c("A", "B"), times = c(6, 5)),
Value = c(rnorm(6, 0, 1), rnorm(5, 0, 1)))
print(df)
## Group Value
## 1 A -0.560476
## 2 A -0.230177
## 3 A 1.558708
## 4 A 0.070508
## 5 A 0.129288
## 6 A 1.715065
## 7 B 0.460916
## 8 B -1.265061
## 9 B -0.686853
## 10 B -0.445662
## 11 B 1.224082
Here, there is no simple way to to separate the values associated
with group A over those with group B; indeed, to use the
t.test
function as we did before would require creating
multiple subsets of the data.frame.
Alternatively, we can use a formula syntax of the form
Value ~ Group
indicating that we want to test the
differences in means from the Value
column associated with
the groups in the Group
column
t.test(Value ~ Group, data = df)
##
## Welch Two Sample t-test
##
## data: Value by Group
## t = 1, df = 8.53, p-value = 0.34
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## -0.75217 1.93150
## sample estimates:
## mean in group A mean in group B
## 0.44715 -0.14252
Next we consider paired testing with the
t.test()
function. This can be done by adding the argument
paired = TRUE
when calling the function. Here is the
example with the language scores from lecture
## Pre and post test scores on French language exam
pre <- c(32,31,29,10,30,33,22,25,32,20,30,20,24,24,31,30,15,32,23,23)
post <- c(34,31,35,16,33,36,24,28,26,26,36,26,27,24,32,31,15,34,26,26)
t.test(post, pre, paired = TRUE)
##
## Paired t-test
##
## data: post and pre
## t = 3.86, df = 19, p-value = 0.001
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 1.1461 3.8539
## sample estimates:
## mean difference
## 2.5
Because the t-test is symmetrical the order in which we put these in doesn’t matter, though it is worth while noting that switching the order will change the sign of the confidence interval and t-statistic, but it won’t change the p-value
t.test(pre, post, paired = TRUE)
##
## Paired t-test
##
## data: pre and post
## t = -3.86, df = 19, p-value = 0.001
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -3.8539 -1.1461
## sample estimates:
## mean difference
## -2.5
Question 9: Using the college
dataset,
conduct a two-sample t-test on the difference in means for the variable
ACT_median
between public and private schools. Modify the
arguments to the t.test()
function so that it returns an
estimate of the 99% confidence interval. Testing at the \(\alpha = 0.01\) level, would you reject or
fail to reject the hypothesis that the median ACT between public and
private schools is the same.
college <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
Question 10: Again using the college
dataset, this time now conduct a two-sample t-test on the four-year
completion rate of males and females (FourYearComp_Females
and FourYearComp_Males
). What would you conclude from the
results of your test?
Question 11: Using the pre
and
post
variables for the French language test created above,
create a new vector difference
that is equal to the
difference between the two. Then, do a one sample t-test on the new
difference
vector, testing the null hypothesis that the
difference is equal to zero. How do the results of this test compare
with the paired t-test done above?
This portion of the labs deals with tests of goodness-of-fit and tests of independence for categorical variables. Both tests are performed with the construction of the \(\chi^2\) test statistic, computed as
\[ \chi^2 = \sum_{i=1}^k \frac{(\text{Observed}_i - \text{Expected}_i)^2}{\text{Expected}_i} \] where \(i = 1, \dots, k\) iterates through each of the cells of the table and where the expected counts are what we should expect to see under the null hypothesis. The resulting test statistic follows a \(\chi^2\) distribution, with degrees of freedom calculated to be:
In both instances, we can compute this test statistic in R using the
chisq.test()
function which will give us a value for the
test statistic, the degrees of freedom, and the resulting p-value under
the null hypothesis.
The one categorical variable cases, in the context of a \(\chi^2\) test (note: you can write in
rmarkdown by typing \(\text{\$\\chi^2\$}\)), is the
goodness-of-fit test. To use the chisq.test()
function, all
we need is a vector that includes the counts of each of our categories,
along with the proportions we should expect to see under then null
hypothesis.
For example, consider again our example from class in which we wish to create a standardized exam with 400 questions, each question having multiple choice answers A-E. If we wish to create an exam in which the probability of the answer for each question being equal, we would have a null hypothesis of the form
\[ H_0: p_A = p_B = p_C = p_D = p_E = 0.2 \] where the expected counts should be (under the null):
A | B | C | D | E |
---|---|---|---|---|
80 | 80 | 80 | 80 | 80 |
In class, we considered an example where the observed distribution had the following counts:
A | B | C | D | E |
---|---|---|---|---|
74 | 90 | 76 | 87 | 73 |
To conduct a \(\chi^2\) test that our observed values where drawn from the null distribution, we would create our vector with these values, along with the associated probabilities:
## Create obs variable with the observed counts
obs <- c(74,90,76,87,73)
## Pass this into chisq.test, with null probabilities
chisq.test(obs, p = c(0.2,0.2,0.2,0.2,0.2))
##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 3.12, df = 4, p-value = 0.54
By default, chisq.test()
will assign an equal
probability to each of the outcomes. In this case, because our null
hypothesis sets each of these probabilities to be equal, we can omit the
probability argument when calling the function:
chisq.test(obs)
##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 3.12, df = 4, p-value = 0.54
In each case, we find a test statistic of \(\chi^2 = 3.12\) and a \(p\)-value of \(p = 0.54\).
Contrast this with the example we gave for Alameda county, where we were investigating the observed counts of each race/ethnicity in the jury pooled which we compared against proportions estimated from US Census data. Because our null hypothesis in this case no longer assumes that our expected proportions are equal, we will have to pass them in ourselves. The values here are taken from the lecture slides:
jury <- c(780, 117,114,384,58)
chisq.test(jury, p = c(0.54, 0.18, 0.12, 0.15, 0.01))
##
## Chi-squared test for given probabilities
##
## data: jury
## X-squared = 357, df = 4, p-value <0.0000000000000002
Again, we get the value of the \(\chi^2\) statistic, along with degrees of freedom and a \(p\)-value.
Question 12: Just as a normal distribution describes phenomenon in which observations tend to fall symmetrically about the mean, the Poisson distribution is a probability distribution that is used to describe the number of discrete events we should expect to occur within a period of time (i.e., how many students walk into the student union each hour). This question involves assessing whether or not a particular collection of data is consistent with a Poisson distribution.
In 1898, Ladislaus von Bortkiewicz collected data on the number of soldiers in the Prussian army that kicked to death by mules. The data collected follows 10 corps, each observed for 20 years, resulting in a total of 200 “corp-years”. In 109 corp-years, there were no deaths from mule kicks, in 65 corp-years, there was one death, and so on. Below is a table with all of the observed data
Number of Deaths | Number of Corps-Years |
---|---|
0 | 109 |
1 | 65 |
2 | 22 |
3+ | 4 |
If the number of death via mules did match a Poisson distribution, the expected number of corps-years with each death count should have the following proportions:
0 deaths | 1 deaths | 2 deaths | 3+ deaths |
---|---|---|---|
0.55 | 0.33 | 0.1 | 0.02 |
Conduct a \(\chi^2\) test to determine whether or not the observed data is consistent with what we should expect to see if death-by-mule-kick truly did follow a Poisson distribution. What \(p\)-value do you find, and what does this say about your null hypothesis? (note: you may ignore the warning about expected counts)
When considering two categorical variables, we are (usually) less interested in assessing goodness-of-fit in the expected counts of two variables and more interested in determining if there is any sort of association between the two variables. In other words, for a single observation, does knowing the value of one categorical variable give us information about the other?
To an extent, we have seen this already when considering two-way tables with categorical variables. Consider, for example, the data we had seen on auto accidents in Florida recording the number of fatal and non-fatal fatalities, along with seatbelt use. That data is summarized again here in the following table:
Fatal | Non-fatal | |
---|---|---|
No Seatbelt | 1085 | 55623 |
Seatbelt | 703 | 441239 |
For our \(\chi^2\) test, we need
only pass a matrix or table of these values and
chisq.test()
will automatically perform a test of
independence:
seatbelt <- matrix(c(1085, 703, 55623, 441239), nrow = 2)
chisq.test(seatbelt) # we do not need to add proportions here
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: seatbelt
## X-squared = 4324, df = 1, p-value <0.0000000000000002
Question 13: This question will use the college dataset loaded below (due to a small error, it is slightly different than the one used above)
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
table()
function,
create a 2-way table of these two variables. Save this table to a new
variable called tab