Lab 9 – Hypothesis Testing

Introduction

This lab will introduce the idea of hypothesis testing with a particular type of test known as a $t$-test. By t-test, we simply mean performing a hypothesis test in which we construct a test statistic, $t$ that follows a t-distribution.

Testing means

In R, we can create perform hypothesis testing and create standard confidence intervals based on the $t$-distribution using the function t.test(). While the t.test() function will take a number of useful arguments that we will use in the future, for now we only concern ourselves with three:

x, a numeric vector we wish to construct a confidence interval for
mu, a single number indicating the null hypothesis
conf.level which gives the confidence interval for a specified $\alpha$ (i.e., our confidence interval is $1 - \alpha$)

The output of the t.test() function will include a collection of useful information, ranging from the point estimate for the mean value of x, as well as a confidence interval for this value. For example, consider the built-in dataset mtcars in R. Here, we use t.test() to find a 90% confidence interval for average miles per gallon:

t.test(mtcars$mpg, conf.level = 0.9)

## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = 18.9, df = 31, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  18.284 21.897
## sample estimates:
## mean of x 
##    20.091

From the output, you should note a few things:

At the bottom of the output, we see our point estimate, $\overline{x} = 20.09$
Just above the point estimate, we see our estimate of the 90% confidence interval, here $(18.24, 21.89)$
All of the information above this is related to hypothesis testing. This includes:
- A $t$-statistic, equal to $t = 18.9$
- Degrees of freedom equal to $df = 31$
- A $p$-value of $p < 2 \times 10^{-15}$

However, as we were not conducting any hypothesis test in this particular example, this information can be ignored.

By default, the t.test() function assumes a null hypothesis of $H_0: \mu_0 = 0$. As our t-statistic is computed as

\[ t = \frac{\overline{x} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] this gives rise to the computed statistic $t = 18.9$ that we saw in the output (and was subsequently used to compute the p-value).

To see a more realistic example, suppose we wanted to investigate the hypothesis that the average miles per gallon was equal to 16 mpg. That is, suppose we have $H_0: \mu_0 = 18$. We could evaluate this hypothesis using the t.test function with the mu argument:

t.test(x = mtcars$mpg, 
       mu = 18, 
       conf.level = 0.9)

## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = 1.96, df = 31, p-value = 0.059
## alternative hypothesis: true mean is not equal to 18
## 90 percent confidence interval:
##  18.284 21.897
## sample estimates:
## mean of x 
##    20.091

In this example, note that our point estimate did not change, nor did our 90% confidence interval. What did change, however, was the calculated t-statistic as well as the p-value.

Question 1: Explain why both the t-statistic and the p-value changed in our second call of t.test() while the point estimate and the confidence interval stayed the same.

Question 2: Using the subsets of the data below, find and report 95% confidence intervals for total enrollment at public and private colleges.

library(dplyr)
college <- read.csv("https://collinn.github.io/data/college2019.csv")

col_priv <- filter(college, Type == "Private")
col_pub <- filter(college, Type == "Public")

Question 3: Below is code to recreate the two subsets of the hawk data containing only Red-tailed hawks

## RT Hawk data
hawks <- read.csv("https://collinn.github.io/data/hawks.csv")
hawks <- filter(hawks, Species == "RT")

## Subset of RT Hawk data
set.seed(89)
idx <- sample(seq_len(nrow(hawks)), size = 20)
hawks2 <- hawks[idx, ]

Part A: Find the mean body weight for Red-tailed hawks in both the hawks and hawks2 datasets. How do they compare?
Part B: Using the average weight of the Red-tailed’s hawks, perform a t-test on each dataset to test the hypothesis that $\mu_0 = 1050$. If you were testing at the $\alpha = 0.05$ level, which would you conclude?
Part C: Explain why the conclusions you came to were different in Part C.

Two Sample Tests

Using R to perform two sample $t$-tests is an immediate extension from the one variable case in that the same function is used, along with the same arguments. Before, we had to be careful to specify the function argument mu to state what our null hypothesis was; in the two sample case, however, the default value of mu = 0 will typically be what we want.

Two Sample t-test

There are two ways to pass an argument to the t.test() function for two sample data, depending on how the data is arranged. The first type of data arrangement, called “long format” describe a situation in which one column has the group name and the other has the value we are interested in. In the mtcars dataset, for example, we see that the am column designates if the vehicle has automatic or manual transmission, while the qsec column gives the quarter mile time of a vehicle

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

To use this data in the t.test() function, we will use a formula syntax that is of the form outcome ~ group, along with an argument showin which dataset we are using:

t.test(qsec ~ am, data = mtcars)

## 
##  Welch Two Sample t-test
## 
## data:  qsec by am
## t = 1.29, df = 25.5, p-value = 0.21
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.49185  2.13817
## sample estimates:
## mean in group 0 mean in group 1 
##          18.183          17.360

The other method is appropriate when the two values we want to compare each exist in their own column. For example, the college dataset has one column each for male and female four-year graduation rates. To test for a difference in these, we would pass each in separately:

t.test(college$FourYearComp_Males, college$FourYearComp_Females)

## 
##  Welch Two Sample t-test
## 
## data:  college$FourYearComp_Males and college$FourYearComp_Females
## t = -10.4, df = 2185, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.085437 -0.058291
## sample estimates:
## mean of x mean of y 
##   0.47622   0.54808

Here, we see compelling evidence that the four year graduation rate for women is not equal to the four year graduation rate for men.

Question 4 For this question, we will be using the sandwhich dataset which includes the number of ants found on a sandwich left out at a picnic according to the toppings, bread, and butter that was on it

sandwich <- read.csv("https://collinn.github.io/data/sandwich.csv")

Perform a two sample t-test to determine if the mean value of ants on a sandwhich was the same for sandwiches that had butter compared to those that did not

Question 5 For this question we will be using the hawks dataset

hawks <- read.csv("https://collinn.github.io/data/hawks.csv")

## (Do this for part 4)
hawks <- filter(hawks, Hallux < 50)

First, use dplyr to filter the datset to only include sharp-shinned hawks, Species == "SS"
Create a boxplot showing the distribution of hallux length in sharp-shinned hawks for each of the two sexes. What do you notice?
Perform a two-sample t-test to determine if there is evidence suggesting a difference in average hallux (killing talon) length between male and female hawks. Based on this test, what would you conclude?
filter your sharp-shinned dataset again to remove outliers, using the criterion Hallux < 50
Recreate your boxplot and repeat your test from part (3). What impact did outliers have on the results of your test? Why do you think that is?

Paired Testing

We can implement paired t-tests in R by simplying setting the argument paired = TRUE in the t.test() function. Here, for example, is the data for our French language institute we saw in class:

french <- read.csv("https://collinn.github.io/data/french_institute.csv")

head(french)

##   pre post
## 1  32   34
## 2  31   31
## 3  29   35
## 4  10   16
## 5  30   33
## 6  33   36

Because pre and post test scores are each contained in their own column, we will use the second method of passing in our arguments to the t.test() function

t.test(french$pre, french$post, paired = TRUE)

## 
##  Paired t-test
## 
## data:  french$pre and french$post
## t = -3.86, df = 19, p-value = 0.001
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -3.8539 -1.1461
## sample estimates:
## mean difference 
##            -2.5

Question 6 Use dplyr and the mutate() function to create a new variable in the french dataset called diff. Perform a one-sample t-test on this variable and compare it with the paired output above. How do the t-statistic, degrees of freedom, and p-values compare?

Question 7 Included below is Karl Pearson’s Father/Son height data. Using this dataset, perform a two-sample paired t-test to determine if there was an average difference between father and son height.

height <- read.csv("https://collinn.github.io/data/pearson.csv")

$\chi^2$ Tests

This portion of the labs deals with tests of goodness-of-fit and tests of independence for categorical variables. Both tests are performed with the construction of the $\chi^2$ test statistic, computed as

\[ \chi^2 = \sum_{i=1}^k \frac{(\text{Observed}_i - \text{Expected}_i)^2}{\text{Expected}_i} \] where $i = 1, \dots, k$ iterates through each of the cells of the table and where the expected counts are what we should expect to see under the null hypothesis. The resulting test statistic follows a $\chi^2$ distribution, with degrees of freedom calculated to be:

In the one categorical case, $df = k-1$ where $k$ is the number of unique categories
In the two categorical case, $df = (\text{# rows} - 1) \times (\text{# columns} - 1)$

In both instances, we can compute this test statistic in R using the chisq.test() function which will give us a value for the test statistic, the degrees of freedom, and the resulting p-value under the null hypothesis.

Goodness-of-fit Tests (1 Variable)

The one categorical variable cases, in the context of a $\chi^2$ test (note: you can write in rmarkdown by typing $\text{\$\\chi^2\$}$), is the goodness-of-fit test. To use the chisq.test() function, all we need is a vector that includes the counts of each of our categories, along with the proportions we should expect to see under then null hypothesis.

For example, consider again our example from class in which we wish to create a standardized exam with 400 questions, each question having multiple choice answers A-E. If we wish to create an exam in which the probability of the answer for each question being equal, we would have a null hypothesis of the form

\[ H_0: p_A = p_B = p_C = p_D = p_E = 0.2 \] where the expected counts should be (under the null):

A	B	C	D	E
80	80	80	80	80

In class, we considered an example where the observed distribution had the following counts:

A	B	C	D	E
74	90	76	87	73

To conduct a $\chi^2$ test that our observed values where drawn from the null distribution, we would create our vector with these values, along with the associated probabilities:

## Create obs variable with the observed counts
obs <- c(74,90,76,87,73)

## Pass this into chisq.test, with null probabilities
chisq.test(obs, p = c(0.2,0.2,0.2,0.2,0.2))

## 
##  Chi-squared test for given probabilities
## 
## data:  obs
## X-squared = 3.12, df = 4, p-value = 0.54

By default, chisq.test() will assign an equal probability to each of the outcomes. In this case, because our null hypothesis sets each of these probabilities to be equal, we can omit the probability argument when calling the function:

chisq.test(obs)

## 
##  Chi-squared test for given probabilities
## 
## data:  obs
## X-squared = 3.12, df = 4, p-value = 0.54

In each case, we find a test statistic of $\chi^2 = 3.12$ and a $p$-value of $p = 0.54$.

Contrast this with the example we gave for Alameda county, where we were investigating the observed counts of each race/ethnicity in the jury pooled which we compared against proportions estimated from US Census data. Because our null hypothesis in this case no longer assumes that our expected proportions are equal, we will have to pass them in ourselves. The values here are taken from the lecture slides:

jury <- c(780, 117,114,384,58)
chisq.test(jury, p = c(0.54, 0.18, 0.12, 0.15, 0.01))

## 
##  Chi-squared test for given probabilities
## 
## data:  jury
## X-squared = 357, df = 4, p-value <0.0000000000000002

Again, we get the value of the $\chi^2$ statistic, along with degrees of freedom and a $p$-value.

Question 8: Just as a normal distribution describes phenomenon in which observations tend to fall symmetrically about the mean, the Poisson distribution is a probability distribution that is used to describe the number of discrete events we should expect to occur within a period of time (i.e., how many students walk into the student union each hour). This question involves assessing whether or not a particular collection of data is consistent with a Poisson distribution.

In 1898, Ladislaus von Bortkiewicz collected data on the number of soldiers in the Prussian army that kicked to death by mules. The data collected follows 10 corps, each observed for 20 years, resulting in a total of 200 “corp-years”. In 109 corp-years, there were no deaths from mule kicks, in 65 corp-years, there was one death, and so on. Below is a table with all of the observed data

Number of Deaths	Number of Corps-Years
0	109
1	65
2	22
3+	4

If the number of death via mules did match a Poisson distribution, the expected number of corps-years with each death count should have the following proportions:

0 deaths	1 deaths	2 deaths	3+ deaths
0.55	0.33	0.1	0.02

Conduct a $\chi^2$ test to determine whether or not the observed data is consistent with what we should expect to see if death-by-mule-kick truly did follow a Poisson distribution. What $p$-value do you find, and what does this say about your null hypothesis? (note: you may ignore the warning about expected counts)

Tests for independence (2 Variable)

When considering two categorical variables, we are (usually) less interested in assessing goodness-of-fit in the expected counts of two variables and more interested in determining if there is any sort of association between the two variables. In other words, for a single observation, does knowing the value of one categorical variable give us information about the other?

To an extent, we have seen this already when considering two-way tables with categorical variables. Consider, for example, the data we had seen on auto accidents in Florida recording the number of fatal and non-fatal fatalities, along with seatbelt use. That data is summarized again here in the following table:

	Fatal	Non-fatal
No Seatbelt	1085	55623
Seatbelt	703	441239

For our $\chi^2$ test, we need only pass a matrix or table of these values and chisq.test() will automatically perform a test of independence:

seatbelt <- matrix(c(1085, 703, 55623, 441239), nrow = 2)
chisq.test(seatbelt) # we do not need to add proportions here

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  seatbelt
## X-squared = 4324, df = 1, p-value <0.0000000000000002

Question 9: This question will use the college dataset loaded below

college <- read.csv("https://collinn.github.io/data/college2019.csv")

Part A: Create a proportional bar chart between the variables for Region and Type Does it appear that there is an association between these two variables? In other words, does knowing anything about the region tell you anything about the proportion of Public or Private colleges?
Part B: Using the table() function, create a 2-way table of these two variables. Save this table to a new variable called tab
Part C: Conduct a $\chi^2$ test on the observed frequencies that you found in the table in Part B. Based on this, what would you conclude about there being an association between the variables for Region and Private?