library(ggplot2)
library(dplyr)

theme_set(theme_bw())

## Better histograms
gh <- function(bins = 10) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Bootstrapping function
bootstrap <- function(x, statistic, n = 1000L) {
  bs <- replicate(n, {
    sb <- sample(x, replace = TRUE)
    statistic(sb)
  })
  data.frame(Sample = seq_len(n), 
             Statistic = bs)
}

Introduction

This lab is intended to serve as an introduction to hypothesis testing. Where available, we will use built-in functions in R to find and report p-values. In particular, we will investigate the finding of p-values for proportions, differences in proportions, odds ratios, means, and differences of means

Note (4/12)

A quick note on these tests and statistical tests done in R generally – much of what we have introduced and talked about in class describes the basic theoretical justification for hypothesis testing. That is, establish a an assumption via the null hypothesis, construct a distribution of values we may see under the null, and then determine if our observed data is inconsistent with what we had assumed. This general procedure underlies much of statistical inference.

As is often the case in many disciplines, there are typically several ways to solve a problem. An example that came up for some students involves the question of whether \(p_0\) or \(\hat{p}\) should be used in the construction of the standard error. Both are valid constructions, and though they offer slightly different results, asymptotically (that is as \(n \rightarrow \infty\)) they lead to the same conclusion.

This will be the case for many of the functions we will use in R for hypothesis testing; the solutions that you find here will not necessarily match what you would find by hand following the procedures we introduce in class. As such, guidance for answering questions should be as follows:

  1. When a question is strictly conceptual (i.e., how would you do this, how does changing X impact Y), use what you have learned in lecture or from your notes to answer the question
  2. When the question asks to report a value or an interval, use what is provided by the output in R
  3. If a question combines both conceptual questions with an applied example, use R to find your values, but continue to let your answer be guided by concepts. An example might be asking about sample size and confidence intervals for the binomial – the methods introduced in class will always result in symmetrical interavls, while those provided by binom.test() will not. However, the impact of sample size on the width of intervals will act the same in both cases.

One Sample Hypothesis Testing

This set of tests involves collecting data on a single group of observations and testing various hypotheses.

Testing proportions

While we are able to use our normality assumptions in constructing confidence intervals and p-values for single proportions, we are also able to compute them exactly using the theoretical properties of a true distribution. That is, while counting the number of heads in successive coin flips will become approximately normal as the number of flips increases, we can also utilize the fact that coin flipping follows a binomial distribution. As such, we can use an exact binomial to compute our confidence intervals and p-values rather than relying a normal approximations. This is done with the R function binom.test(). This function takes several arguments which you can learn more about with ?binom.test.

For now, let’s consider again the Johns Hopkins study that found 31 of 39 infants born preterm survived to 6 months. Here, we are wanting to check the hypothesis \(H_0: p_0 = 0.7\). To do this, we will use binom.test() with the arguments:

  • x – the number of “successes” we observed
  • n – the total number of observations
  • p – our hypothesized proportion

The function will print out a litany of useful information

binom.test(x = 31, n = 39, p = 0.7)
## 
##  Exact binomial test
## 
## data:  31 and 39
## number of successes = 31, number of trials = 39, p-value = 0.22
## alternative hypothesis: true probability of success is not equal to 0.7
## 95 percent confidence interval:
##  0.63536 0.90704
## sample estimates:
## probability of success 
##                0.79487

First, observed the bottom line, “probability of success” which is simply our observed test statistic, \(\hat{p} = 31/39\). Note, however, that the \(p\)-value found here is not the same as what we found in the slides. This is a consequence of the fact that here we are utilizing assumptions from the binomial distribution rather than our normal approximation.

Just as the \(p\)-values change, so too do our confidence intervals. For example, in our slides Wednesday we found a 95% confidence interval to be (0.668, 0.922), while here they are slightly different, being (0.635, 0.907).

It is worth noting that we can also use binom.test() for the construction of confidence intervals. By default, the interval printed out is 95%, but we could just as well find an 80% confidence interval using the argument conf.level

binom.test(x = 31, n = 39, p = 0.7, conf.level = 0.8)
## 
##  Exact binomial test
## 
## data:  31 and 39
## number of successes = 31, number of trials = 39, p-value = 0.22
## alternative hypothesis: true probability of success is not equal to 0.7
## 80 percent confidence interval:
##  0.68823 0.87671
## sample estimates:
## probability of success 
##                0.79487

Question 1: We see above that changing our confidence interval to 80% from 95% changed the interval, but not the resulting p-value. Why is this the case?

Question 2: You might also notice that the confidence interval we found using binom.test() is no longer symmetric around the estimate \(\hat{p}\). Why is this the case?

Question 3: There is an intimate relationship between p-values and confidence intervals in that the exact same attributes of our sample data are used to construct each. Presented below are two scenarios:

  1. We flipped a coin 5 times and it landed on head 4 times
  2. We flipped a coin 50 times and it landed on head 40 times

Using the binom.test() function, create 95% confidence intervals and construct a \(p\)-value for the null hypothesis \(p_0 = 0.5\). Then answer the following:

  1. What is our estimate of \(\hat{p}\) for each of these scenarios?
  2. What changed in our \(p\)-values and confidence intervals?
  3. Which of these scenarios provides more evidence against the null hypothesis? Why?

Testing means

In R, we can create perform hypothesis testing and create standard confidence intervals based on the \(t\)-distribution using the function t.test(). While the t.test() function will take a number of useful arguments that we will use in the future, for now we only concern ourselves with three:

  • x, a numeric vector we wish to construct a confidence interval for
  • mu, a single number indicating the null hypothesis
  • conf.level which gives the confidence interval for a specified \(\alpha\)

The output of the t.test() function will include a collection of useful information, ranging from the point estimate for the mean value of x, as well as a confidence interval for this value. For example, consider the built-in dataset mtcars in R. Here, we use t.test() to find a 90% confidence interval for average miles per gallon:

t.test(mtcars$mpg, conf.level = 0.9)
## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = 18.9, df = 31, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  18.284 21.897
## sample estimates:
## mean of x 
##    20.091

From the output, you should note a few things:

  1. At the bottom of the output, we see our point estimate, \(\overline{x} = 20.09\)

  2. Just above the point estimate, we see our estimate of the 90% confidence interval, here \((18.24, 21.89)\)

  3. All of the information above this is related to hypothesis testing. This includes:

    • A \(t\)-statistic, equal to \(t = 18.9\)
    • Degrees of freedom equal to \(df = 31\)
    • A \(p\)-value of \(p < 2 \times 10^{-15}\)

However, as we were not conducting any hypothesis test in this particular example, this information can be ignored.

By default, the t.test() function assumes a null hypothesis of \(H_0: \mu_0 = 0\). As our t-statistic is computed as

\[ t = \frac{\overline{x} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] this gives rise to the computed statistic \(t = 18.9\) that we saw in the output (and was subsequently used to compute the p-value).

To see a more realistic example, suppose we wanted to investigate the hypothesis that the average miles per gallon was equal to 16 mpg. That is, suppose we have \(H_0: \mu_0 = 18\). We could evaluate this hypothesis using the t.test function with the mu argument:

t.test(x = mtcars$mpg, 
       mu = 18, 
       conf.level = 0.9)
## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = 1.96, df = 31, p-value = 0.059
## alternative hypothesis: true mean is not equal to 18
## 90 percent confidence interval:
##  18.284 21.897
## sample estimates:
## mean of x 
##    20.091

In this example, note that our point estimate did not change, nor did our 90% confidence interval. What did change, however, was the calculated t-statistic as well as the p-value.

Question 4: Explain why both the t-statistic and the p-value changed in our second call of t.test() while the point estimate and the confidence interval stayed the same.

Question 5: Using the subsets of the data below, find and report 95% confidence intervals for total enrollment at public and private colleges.

library(dplyr)
college <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

col_priv <- filter(college, Private == "Private")
col_pub <- filter(college, Private == "Public")

Testing Correlation

Testing and evaluating confidence intervals for correlation can be done with the function cor.test. For example, using our college data, suppose we are interested in evaluating the correlation between net tuition and average faculty salary

cor.test(college$Net_Tuition, college$Avg_Fac_Salary)
## 
##  Pearson's product-moment correlation
## 
## data:  college$Net_Tuition and college$Avg_Fac_Salary
## t = 19.6, df = 1648, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.39538 0.47364
## sample estimates:
##     cor 
## 0.43534

Just as with the other testing functions, we can modify the range of our confidence interval with the function conf.level:

cor.test(college$Net_Tuition, college$Avg_Fac_Salary, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  college$Net_Tuition and college$Avg_Fac_Salary
## t = 19.6, df = 1648, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.40939 0.46057
## sample estimates:
##     cor 
## 0.43534

Question 6: Using the ToothGrowth data built-in to R that we have seen previously, construct a 90% confidence interval for the correlation between dose of Vitamin C and tooth length. What is the null hypothesis when evaluating if two variables are correlated? Does this data seem to support or refute our null hypothesis?

Question 7: Revist the lab on ggplot to refresh how to use a data.frame to create a plot in R. For the ToothGrowth dataset, create a scatter plot showing the relationship between “dose” on the x-axis and “len” (length) on the y-axis. Does this plot seem consistent with you conclusion in Question 6? (Hint: Use geom_jitter() instead of geom_point() for a slightly nicer looking plot)

Two Sample Hypothesis Testing

These sets of tests will involve testing proportions or differences between two groups

Difference in proportion

We have seen previously hypothesis testing for a single proportion, for example, establishing the null hypothesis that \(H_0: p_0\) when evaluating the outcome of a series of binary events, i.e., what proportion of severely pre-term infants survive past 6 months? A related question seeks to investigate the proportion of events between two separate groups. For example, consider giving a drug and a placebo to two groups of patients and investigating the proportion of recipients who develop negative side effects: if the proportion of side effects is the same between two groups, we may conclude that there is no association between the drug and side-effects. Likewise, if this proportion is different, we may conclude that there is an association between the groups and an outcome.

One method for investigating this is examining the difference in proportions. For groups 1 and 2 with proportions \(p_1\) and \(p_2\), respectively, we might frame the null hypothesis that there is no difference as

\[ H_0: p_1 - p_2 = 0 \] Like with the single proportion case, the central limit theorem also applies to the difference in proportions, with the approximate distribution being

\[ \hat{p}_1 - \hat{p}_2 \sim N \left(p_1 - p_2, \ \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right) \] where \(n_1\) and \(n_2\) represent the number of observations in each group. Note that the standard deviation term, which includes the sum of standard deviations of both groups, is known as the pooled variance which we will describe in some detail when reviewing the two-sample t-test.

Again as before, a confidence interval around this estimate takes the form of

\[ (\hat{p}_1 - \hat{p}_2) \pm C \times \text{SE} \]

For our example here, consider again the Titanic dataset in R, where we investigate the number of passengers and crew members of each sex who survived or died

## Titanic data in R
titanic <- as.data.frame(Titanic)
titanic <- titanic[rep(1:nrow(titanic), times = titanic$Freq), ]
titanic$Freq <- NULL

# Table
with(titanic, table(Sex, Survived))
##         Survived
## Sex        No  Yes
##   Male   1364  367
##   Female  126  344

Of the males, we find the proportion of survived to be \(\hat{p}_1 = \frac{367}{1731}\), whereas the proportion for females is \(\hat{p}_2 = \frac{344}{470}\). Implied in this is our group sizes, with \(n_1 = 1731\) and \(n_2 = 470\). To test the difference in proportions, we will use the prop.test() function, similar to the binom.test(), but now passing in a vector indicating the number of “successes” and a vector for the total

# Difference in proportion test for titanic data
prop.test(x = c(367, 344), n = c(1731, 470))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(367, 344) out of c(1731, 470)
## X-squared = 454, df = 1, p-value <0.0000000000000002
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.56569 -0.47411
## sample estimates:
##  prop 1  prop 2 
## 0.21202 0.73191

There are a few things from this output to notice. From the top:

  1. In the header of the results, we see that this is a 2-sample test for equality of proportions with a continuity correction. Recall the note at the top of the lab indicating that these differences in software will slightly modify what we would find computing this by hand
  2. Next, we see information on a “chi-squared test.” We will discuss this next week
  3. Next, we see a p-value along with a 95 percent confidence interval for the true difference in proportions. We can change the width of this confidence interval with the argument conf.level
  4. Finally, we see sample estimates for each proportion, which is equivalent to what we would have found had we computed \(\hat{p}_1\) and \(\hat{p}_2\) directly

Generally speaking, from these tests we will be interested in identifying the p-value, the confidence interval, and possibly estimates of each of the relevant proportions.

Question 8: The results of the clinical trial for the Pfizer Covid-19 mRNA injections were published in December 2020. The study consisted of a sample of 43,548 individuals, 21,720 of whom were given the BNT162b2 injection and 21,728 being given a placebo. After a period of 6 months, 8 individuals in the drug group exhibited symptoms of Covid-19, while the number in the placebo group was 162.

  • What is the estimated proportion of individuals in each group who developed symptoms of Covid-19
  • Find a 95% confidence interval for the true difference in proportion between these groups
  • Perform a statistical test on the difference in proportions. Base on this, what would you conclude?

Two-sample t-test

Just as with the one sample test, two sample t-tests are done with the function t.test(). There are two ways to use this function, depending on how your data is stored.

The simplest case involves passing two independent vectors. For example, here we have a vector x of length 50 and a vector y of length 20. To conduct a t-test on the difference of these two vectors, its enough to simply pass them in as the first two arguments:

## Random vector x
set.seed(123)
x <- rnorm(50, mean = 20, sd = 2)
y <- rnorm(20, mean = 15, sd = 1)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 15, df = 66, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.2438 5.5491
## sample estimates:
## mean of x mean of y 
##    20.069    15.172

As before, you’ll find that the t.test() function returns a t-statistic, degrees of freedom, a p-value, confidence interval, and then estimates of the mean difference. You’ll also note in this case that the heading at the top indicates that this is a “Welch Two Sample t-test”, corresponding to the fact that it makes no assumptions about the standard deviations of your samples being the same.

The next two cases involve data that is stored in a data.frame. First, consider the case where each of the varibles we wish to test are stored in a data.frame as columns

## Pretend data set
set.seed(123)
df <- data.frame(A = rnorm(5, 1, 1), 
                 B = rnorm(5, 0, 1))

print(df)
##         A        B
## 1 0.43952  1.71506
## 2 0.76982  0.46092
## 3 2.55871 -1.26506
## 4 1.07051 -0.68685
## 5 1.12929 -0.44566

If the two variables that you wish to test are both columns in a data.frame, you can use the $ operator to pass them in like vectors

## Pass in A and B columns
t.test(df$A, df$B)
## 
##  Welch Two Sample t-test
## 
## data:  df$A and df$B
## t = 1.95, df = 7.14, p-value = 0.091
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.25577  2.73155
## sample estimates:
## mean of x mean of y 
##  1.193570 -0.044319

Implicit in this data structure is the assumption that these variables are the same length. More frequently, you have a data.frame that contains a column indicating groups and a second column indicating the values we want to test. For example, consider the format of the following example data:

set.seed(123)
df <- data.frame(Group = rep(c("A", "B"), times = c(6, 5)), 
                 Value = c(rnorm(6, 0, 1), rnorm(5, 0, 1)))
print(df)
##    Group     Value
## 1      A -0.560476
## 2      A -0.230177
## 3      A  1.558708
## 4      A  0.070508
## 5      A  0.129288
## 6      A  1.715065
## 7      B  0.460916
## 8      B -1.265061
## 9      B -0.686853
## 10     B -0.445662
## 11     B  1.224082

Here, there is no simple way to to separate the values associated with group A over those with group B; indeed, to use the t.test function as we did before would require creating multiple subsets of the data.frame.

Alternatively, we can use a formula syntax of the form Value ~ Group indicating that we want to test the differences in means from the Value column associated with the groups in the Group column

t.test(Value ~ Group, data = df)
## 
##  Welch Two Sample t-test
## 
## data:  Value by Group
## t = 1, df = 8.53, p-value = 0.34
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -0.75217  1.93150
## sample estimates:
## mean in group A mean in group B 
##         0.44715        -0.14252

Next we consider paired testing with the t.test() function. This can be done by adding the argument paired = TRUE when calling the function. Here is the example with the language scores from lecture

## Pre and post test scores on French language exam
pre <- c(32,31,29,10,30,33,22,25,32,20,30,20,24,24,31,30,15,32,23,23)
post <- c(34,31,35,16,33,36,24,28,26,26,36,26,27,24,32,31,15,34,26,26)

t.test(post, pre, paired = TRUE)
## 
##  Paired t-test
## 
## data:  post and pre
## t = 3.86, df = 19, p-value = 0.001
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  1.1461 3.8539
## sample estimates:
## mean difference 
##             2.5

Because the t-test is symmetrical the order in which we put these in doesn’t matter, though it is worth while noting that switching the order will change the sign of the confidence interval and t-statistic, but it won’t change the p-value

t.test(pre, post, paired = TRUE)
## 
##  Paired t-test
## 
## data:  pre and post
## t = -3.86, df = 19, p-value = 0.001
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -3.8539 -1.1461
## sample estimates:
## mean difference 
##            -2.5

Question 9: Using the college dataset, conduct a two-sample t-test on the difference in means for the variable ACT_median between public and private schools. Modify the arguments to the t.test() function so that it returns an estimate of the 99% confidence interval. Testing at the \(\alpha = 0.01\) level, would you reject or fail to reject the hypothesis that the median ACT between public and private schools is the same.

college <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

Question 10: Again using the college dataset, this time now conduct a two-sample t-test on the four-year completion rate of males and females (FourYearComp_Females and FourYearComp_Males). What would you conclude from the results of your test?

Question 11: Using the pre and post variables for the French language test created above, create a new vector difference that is equal to the difference between the two. Then, do a one sample t-test on the new difference vector, testing the null hypothesis that the difference is equal to zero. How do the results of this test compare with the paired t-test done above?

\(\chi^2\) Tests

This portion of the labs deals with tests of goodness-of-fit and tests of independence for categorical variables. Both tests are performed with the construction of the \(\chi^2\) test statistic, computed as

\[ \chi^2 = \sum_{i=1}^k \frac{(\text{Observed}_i - \text{Expected}_i)^2}{\text{Expected}_i} \] where \(i = 1, \dots, k\) iterates through each of the cells of the table and where the expected counts are what we should expect to see under the null hypothesis. The resulting test statistic follows a \(\chi^2\) distribution, with degrees of freedom calculated to be:

  • In the one categorical case, \(df = k-1\) where \(k\) is the number of unique categories
  • In the two categorical case, \(df = (\text{# rows} - 1) \times (\text{# columns} - 1)\)

In both instances, we can compute this test statistic in R using the chisq.test() function which will give us a value for the test statistic, the degrees of freedom, and the resulting p-value under the null hypothesis.

Goodness-of-fit Tests (1 Variable)

The one categorical variable cases, in the context of a \(\chi^2\) test (note: you can write in rmarkdown by typing \(\text{\$\\chi^2\$}\)), is the goodness-of-fit test. To use the chisq.test() function, all we need is a vector that includes the counts of each of our categories, along with the proportions we should expect to see under then null hypothesis.

For example, consider again our example from class in which we wish to create a standardized exam with 400 questions, each question having multiple choice answers A-E. If we wish to create an exam in which the probability of the answer for each question being equal, we would have a null hypothesis of the form

\[ H_0: p_A = p_B = p_C = p_D = p_E = 0.2 \] where the expected counts should be (under the null):

A B C D E
80 80 80 80 80

In class, we considered an example where the observed distribution had the following counts:

A B C D E
74 90 76 87 73

To conduct a \(\chi^2\) test that our observed values where drawn from the null distribution, we would create our vector with these values, along with the associated probabilities:

## Create obs variable with the observed counts
obs <- c(74,90,76,87,73)

## Pass this into chisq.test, with null probabilities
chisq.test(obs, p = c(0.2,0.2,0.2,0.2,0.2))
## 
##  Chi-squared test for given probabilities
## 
## data:  obs
## X-squared = 3.12, df = 4, p-value = 0.54

By default, chisq.test() will assign an equal probability to each of the outcomes. In this case, because our null hypothesis sets each of these probabilities to be equal, we can omit the probability argument when calling the function:

chisq.test(obs)
## 
##  Chi-squared test for given probabilities
## 
## data:  obs
## X-squared = 3.12, df = 4, p-value = 0.54

In each case, we find a test statistic of \(\chi^2 = 3.12\) and a \(p\)-value of \(p = 0.54\).

Contrast this with the example we gave for Alameda county, where we were investigating the observed counts of each race/ethnicity in the jury pooled which we compared against proportions estimated from US Census data. Because our null hypothesis in this case no longer assumes that our expected proportions are equal, we will have to pass them in ourselves. The values here are taken from the lecture slides:

jury <- c(780, 117,114,384,58)
chisq.test(jury, p = c(0.54, 0.18, 0.12, 0.15, 0.01))
## 
##  Chi-squared test for given probabilities
## 
## data:  jury
## X-squared = 357, df = 4, p-value <0.0000000000000002

Again, we get the value of the \(\chi^2\) statistic, along with degrees of freedom and a \(p\)-value.

Question 12: Just as a normal distribution describes phenomenon in which observations tend to fall symmetrically about the mean, the Poisson distribution is a probability distribution that is used to describe the number of discrete events we should expect to occur within a period of time (i.e., how many students walk into the student union each hour). This question involves assessing whether or not a particular collection of data is consistent with a Poisson distribution.

In 1898, Ladislaus von Bortkiewicz collected data on the number of soldiers in the Prussian army that kicked to death by mules. The data collected follows 10 corps, each observed for 20 years, resulting in a total of 200 “corp-years”. In 109 corp-years, there were no deaths from mule kicks, in 65 corp-years, there was one death, and so on. Below is a table with all of the observed data

Number of Deaths Number of Corps-Years
0 109
1 65
2 22
3+ 4

If the number of death via mules did match a Poisson distribution, the expected number of corps-years with each death count should have the following proportions:

0 deaths 1 deaths 2 deaths 3+ deaths
0.55 0.33 0.1 0.02

Conduct a \(\chi^2\) test to determine whether or not the observed data is consistent with what we should expect to see if death-by-mule-kick truly did follow a Poisson distribution. What \(p\)-value do you find, and what does this say about your null hypothesis? (note: you may ignore the warning about expected counts)

Tests for independence (2 Variable)

When considering two categorical variables, we are (usually) less interested in assessing goodness-of-fit in the expected counts of two variables and more interested in determining if there is any sort of association between the two variables. In other words, for a single observation, does knowing the value of one categorical variable give us information about the other?

To an extent, we have seen this already when considering two-way tables with categorical variables. Consider, for example, the data we had seen on auto accidents in Florida recording the number of fatal and non-fatal fatalities, along with seatbelt use. That data is summarized again here in the following table:

Fatal Non-fatal
No Seatbelt 1085 55623
Seatbelt 703 441239

For our \(\chi^2\) test, we need only pass a matrix or table of these values and chisq.test() will automatically perform a test of independence:

seatbelt <- matrix(c(1085, 703, 55623, 441239), nrow = 2)
chisq.test(seatbelt) # we do not need to add proportions here
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  seatbelt
## X-squared = 4324, df = 1, p-value <0.0000000000000002

Question 13: This question will use the college dataset loaded below (due to a small error, it is slightly different than the one used above)

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
  • Part A: Create a proportional bar chart between the variables for Region and Private. Does it appear that there is an association between these two variables? In other words, does knowing anything about the region tell you anything about the proportion of Public or Private colleges?
  • Part B: Using the table() function, create a 2-way table of these two variables. Save this table to a new variable called tab
  • Part C: Conduct a \(\chi^2\) test on the observed frequencies that you found in the table in Part B. Based on this, what would you conclude about there being an association between the variables for Region and Private?