This lab will introduce the idea of hypothesis testing with a particular type of test known as a \(t\)-test. By t-test, we simply mean performing a hypothesis test in which we construct a test statistic, \(t\) that follows a t-distribution.
In R, we can create perform hypothesis testing and create standard
confidence intervals based on the \(t\)-distribution using the function
t.test()
. While the t.test()
function will
take a number of useful arguments that we will use in the future, for
now we only concern ourselves with three:
x
, a numeric vector we wish to construct a confidence
interval formu
, a single number indicating the null hypothesisconf.level
which gives the confidence interval for a
specified \(\alpha\) (i.e., our
confidence interval is \(1 -
\alpha\))The output of the t.test()
function will include a
collection of useful information, ranging from the point estimate for
the mean value of x
, as well as a confidence interval for
this value. For example, consider the built-in dataset
mtcars
in R. Here, we use t.test()
to find a
90% confidence interval for average miles per gallon:
t.test(mtcars$mpg, conf.level = 0.9)
##
## One Sample t-test
##
## data: mtcars$mpg
## t = 18.9, df = 31, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 18.284 21.897
## sample estimates:
## mean of x
## 20.091
From the output, you should note a few things:
At the bottom of the output, we see our point estimate, \(\overline{x} = 20.09\)
Just above the point estimate, we see our estimate of the 90% confidence interval, here \((18.24, 21.89)\)
All of the information above this is related to hypothesis testing. This includes:
However, as we were not conducting any hypothesis test in this particular example, this information can be ignored.
By default, the t.test()
function assumes a null
hypothesis of \(H_0: \mu_0 = 0\). As
our t-statistic is computed as
\[ t = \frac{\overline{x} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] this gives rise to the computed statistic \(t = 18.9\) that we saw in the output (and was subsequently used to compute the p-value).
To see a more realistic example, suppose we wanted to investigate the
hypothesis that the average miles per gallon was equal to 16 mpg. That
is, suppose we have \(H_0: \mu_0 =
18\). We could evaluate this hypothesis using the
t.test
function with the mu
argument:
t.test(x = mtcars$mpg,
mu = 18,
conf.level = 0.9)
##
## One Sample t-test
##
## data: mtcars$mpg
## t = 1.96, df = 31, p-value = 0.059
## alternative hypothesis: true mean is not equal to 18
## 90 percent confidence interval:
## 18.284 21.897
## sample estimates:
## mean of x
## 20.091
In this example, note that our point estimate did not change, nor did our 90% confidence interval. What did change, however, was the calculated t-statistic as well as the p-value.
Question 1: Explain why both the t-statistic and the
p-value changed in our second call of t.test()
while the
point estimate and the confidence interval stayed the same.
Question 2: Using the subsets of the data below, find and report 95% confidence intervals for total enrollment at public and private colleges.
library(dplyr)
college <- read.csv("https://collinn.github.io/data/college2019.csv")
col_priv <- filter(college, Type == "Private")
col_pub <- filter(college, Type == "Public")
Question 3: Below is code to recreate the two subsets of the hawk data containing only Red-tailed hawks
## RT Hawk data
hawks <- read.csv("https://collinn.github.io/data/hawks.csv")
hawks <- filter(hawks, Species == "RT")
## Subset of RT Hawk data
set.seed(89)
idx <- sample(seq_len(nrow(hawks)), size = 20)
hawks2 <- hawks[idx, ]
Part A: Find the mean body weight for Red-tailed
hawks in both the hawks
and hawks2
datasets.
How do they compare?
Part B: Using the average weight of the Red-tailed’s hawks, perform a t-test on each dataset to test the hypothesis that \(\mu_0 = 1050\). If you were testing at the \(\alpha = 0.05\) level, which would you conclude?
Part C: Explain why the conclusions you came to were different in Part C.
Using R to perform two sample \(t\)-tests is an immediate extension from
the one variable case in that the same function is used, along with the
same arguments. Before, we had to be careful to specify the function
argument mu
to state what our null hypothesis was; in the
two sample case, however, the default value of mu = 0
will
typically be what we want.
There are two ways to pass an argument to the t.test()
function for two sample data, depending on how the data is arranged. The
first type of data arrangement, called “long format” describe a
situation in which one column has the group name and the other has the
value we are interested in. In the mtcars
dataset, for
example, we see that the am
column designates if the
vehicle has automatic or manual transmission, while the
qsec
column gives the quarter mile time of a vehicle
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
To use this data in the t.test()
function, we will use a
formula syntax that is of the form outcome ~ group
, along
with an argument showin which dataset we are using:
t.test(qsec ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: qsec by am
## t = 1.29, df = 25.5, p-value = 0.21
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.49185 2.13817
## sample estimates:
## mean in group 0 mean in group 1
## 18.183 17.360
The other method is appropriate when the two values we want to compare each exist in their own column. For example, the college dataset has one column each for male and female four-year graduation rates. To test for a difference in these, we would pass each in separately:
t.test(college$FourYearComp_Males, college$FourYearComp_Females)
##
## Welch Two Sample t-test
##
## data: college$FourYearComp_Males and college$FourYearComp_Females
## t = -10.4, df = 2185, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.085437 -0.058291
## sample estimates:
## mean of x mean of y
## 0.47622 0.54808
Here, we see compelling evidence that the four year graduation rate for women is not equal to the four year graduation rate for men.
Question 4 For this question, we will be using the sandwhich dataset which includes the number of ants found on a sandwich left out at a picnic according to the toppings, bread, and butter that was on it
sandwich <- read.csv("https://collinn.github.io/data/sandwich.csv")
Perform a two sample t-test to determine if the mean value of ants on a sandwhich was the same for sandwiches that had butter compared to those that did not
Question 5 For this question we will be using the hawks dataset
hawks <- read.csv("https://collinn.github.io/data/hawks.csv")
## (Do this for part 4)
hawks <- filter(hawks, Hallux < 50)
dplyr
to filter
the datset to
only include sharp-shinned hawks, Species == "SS"
filter
your sharp-shinned dataset again to remove
outliers, using the criterion Hallux < 50
We can implement paired t-tests in R by simplying setting
the argument paired = TRUE
in the t.test()
function. Here, for example, is the data for our French language
institute we saw in class:
french <- read.csv("https://collinn.github.io/data/french_institute.csv")
head(french)
## pre post
## 1 32 34
## 2 31 31
## 3 29 35
## 4 10 16
## 5 30 33
## 6 33 36
Because pre and post test scores are each contained in their own
column, we will use the second method of passing in our arguments to the
t.test()
function
t.test(french$pre, french$post, paired = TRUE)
##
## Paired t-test
##
## data: french$pre and french$post
## t = -3.86, df = 19, p-value = 0.001
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -3.8539 -1.1461
## sample estimates:
## mean difference
## -2.5
Question 6 Use dplyr
and the
mutate()
function to create a new variable in the
french
dataset called diff
. Perform a
one-sample t-test on this variable and compare it with the paired output
above. How do the t-statistic, degrees of freedom, and p-values
compare?
Question 7 Included below is Karl Pearson’s Father/Son height data. Using this dataset, perform a two-sample paired t-test to determine if there was an average difference between father and son height.
height <- read.csv("https://collinn.github.io/data/pearson.csv")
This portion of the labs deals with tests of goodness-of-fit and tests of independence for categorical variables. Both tests are performed with the construction of the \(\chi^2\) test statistic, computed as
\[ \chi^2 = \sum_{i=1}^k \frac{(\text{Observed}_i - \text{Expected}_i)^2}{\text{Expected}_i} \] where \(i = 1, \dots, k\) iterates through each of the cells of the table and where the expected counts are what we should expect to see under the null hypothesis. The resulting test statistic follows a \(\chi^2\) distribution, with degrees of freedom calculated to be:
In both instances, we can compute this test statistic in R using the
chisq.test()
function which will give us a value for the
test statistic, the degrees of freedom, and the resulting p-value under
the null hypothesis.
The one categorical variable cases, in the context of a \(\chi^2\) test (note: you can write in
rmarkdown by typing \(\text{\$\\chi^2\$}\)), is the
goodness-of-fit test. To use the chisq.test()
function, all
we need is a vector that includes the counts of each of our categories,
along with the proportions we should expect to see under then null
hypothesis.
For example, consider again our example from class in which we wish to create a standardized exam with 400 questions, each question having multiple choice answers A-E. If we wish to create an exam in which the probability of the answer for each question being equal, we would have a null hypothesis of the form
\[ H_0: p_A = p_B = p_C = p_D = p_E = 0.2 \] where the expected counts should be (under the null):
A | B | C | D | E |
---|---|---|---|---|
80 | 80 | 80 | 80 | 80 |
In class, we considered an example where the observed distribution had the following counts:
A | B | C | D | E |
---|---|---|---|---|
74 | 90 | 76 | 87 | 73 |
To conduct a \(\chi^2\) test that our observed values where drawn from the null distribution, we would create our vector with these values, along with the associated probabilities:
## Create obs variable with the observed counts
obs <- c(74,90,76,87,73)
## Pass this into chisq.test, with null probabilities
chisq.test(obs, p = c(0.2,0.2,0.2,0.2,0.2))
##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 3.12, df = 4, p-value = 0.54
By default, chisq.test()
will assign an equal
probability to each of the outcomes. In this case, because our null
hypothesis sets each of these probabilities to be equal, we can omit the
probability argument when calling the function:
chisq.test(obs)
##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 3.12, df = 4, p-value = 0.54
In each case, we find a test statistic of \(\chi^2 = 3.12\) and a \(p\)-value of \(p = 0.54\).
Contrast this with the example we gave for Alameda county, where we were investigating the observed counts of each race/ethnicity in the jury pooled which we compared against proportions estimated from US Census data. Because our null hypothesis in this case no longer assumes that our expected proportions are equal, we will have to pass them in ourselves. The values here are taken from the lecture slides:
jury <- c(780, 117,114,384,58)
chisq.test(jury, p = c(0.54, 0.18, 0.12, 0.15, 0.01))
##
## Chi-squared test for given probabilities
##
## data: jury
## X-squared = 357, df = 4, p-value <0.0000000000000002
Again, we get the value of the \(\chi^2\) statistic, along with degrees of freedom and a \(p\)-value.
Question 8: Just as a normal distribution describes phenomenon in which observations tend to fall symmetrically about the mean, the Poisson distribution is a probability distribution that is used to describe the number of discrete events we should expect to occur within a period of time (i.e., how many students walk into the student union each hour). This question involves assessing whether or not a particular collection of data is consistent with a Poisson distribution.
In 1898, Ladislaus von Bortkiewicz collected data on the number of soldiers in the Prussian army that kicked to death by mules. The data collected follows 10 corps, each observed for 20 years, resulting in a total of 200 “corp-years”. In 109 corp-years, there were no deaths from mule kicks, in 65 corp-years, there was one death, and so on. Below is a table with all of the observed data
Number of Deaths | Number of Corps-Years |
---|---|
0 | 109 |
1 | 65 |
2 | 22 |
3+ | 4 |
If the number of death via mules did match a Poisson distribution, the expected number of corps-years with each death count should have the following proportions:
0 deaths | 1 deaths | 2 deaths | 3+ deaths |
---|---|---|---|
0.55 | 0.33 | 0.1 | 0.02 |
Conduct a \(\chi^2\) test to determine whether or not the observed data is consistent with what we should expect to see if death-by-mule-kick truly did follow a Poisson distribution. What \(p\)-value do you find, and what does this say about your null hypothesis? (note: you may ignore the warning about expected counts)
When considering two categorical variables, we are (usually) less interested in assessing goodness-of-fit in the expected counts of two variables and more interested in determining if there is any sort of association between the two variables. In other words, for a single observation, does knowing the value of one categorical variable give us information about the other?
To an extent, we have seen this already when considering two-way tables with categorical variables. Consider, for example, the data we had seen on auto accidents in Florida recording the number of fatal and non-fatal fatalities, along with seatbelt use. That data is summarized again here in the following table:
Fatal | Non-fatal | |
---|---|---|
No Seatbelt | 1085 | 55623 |
Seatbelt | 703 | 441239 |
For our \(\chi^2\) test, we need
only pass a matrix or table of these values and
chisq.test()
will automatically perform a test of
independence:
seatbelt <- matrix(c(1085, 703, 55623, 441239), nrow = 2)
chisq.test(seatbelt) # we do not need to add proportions here
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: seatbelt
## X-squared = 4324, df = 1, p-value <0.0000000000000002
Question 9: This question will use the college dataset loaded below
college <- read.csv("https://collinn.github.io/data/college2019.csv")
table()
function,
create a 2-way table of these two variables. Save this table to a new
variable called tab