1 Learning Goals

Understand the Concept	Apply the Concept in R
1. Recognize when chi-square test of independence is appropriate. • State the null and alternative hypotheses. • Difference between chi-square test of independence and goodness-of-fit.	1. Run a chi-square test of independence in R.

2 A scenario of what was covered last time

A study asks participants if they prefer to drink coffee, tea or juice for breakfast. Among 60 people, 32 said coffee, 6 said tea and 22 said juice.

Research question: Do 40% of people like coffee, 20% like tea, and 40% like juice?

That is a question that chi-square goodness-of-fit is very good at! i.e. Does the information we observe fit the proportions we were thinking?

\[\chi^2 = \sum_{i=1}^k \frac{(\text{observed}_i - \text{expected}_i)^2}{\text{expected}_i} = \frac{(32-60 \times 0.4)^2}{60 \times 0.4} + \frac{(6-60 \times 0.2)^2}{60 \times 0.2} + \frac{(22-60 \times 0.4)^2}{60 \times 0.4} = \frac{35}{6}\]

Using chi-square distribution with \(k-1 = 2\) degrees of freedom, we have:

1 - pchisq(35/6, 2)

## [1] 0.05411377

chisq.test(x = c(coffee = 32, tea = 6, juice = 22), p = c(0.4,0.2,0.4))

## 
##  Chi-squared test for given probabilities
## 
## data:  c(coffee = 32, tea = 6, juice = 22)
## X-squared = 5.8333, df = 2, p-value = 0.05411

3 Chi-square test of independence

3.1 Consider this instead

A study asks participants if they prefer to drink coffee, tea or juice for breakfast. And they think there may be a difference in the answers among statistics majors and non-statistics majors.

For 60 math majors, 32 said coffee, 6 said tea and 22 said juice.
For 40 non-math majors, 15 said coffee, 6 said tea and 19 said juice.

Now we need a new statistical tool.

3.2 Warning: We can’t use chi-square goodness-of-fit test

You may wonder: What if we use one group as the ground truth, and just compare the second group to the first group?

It’s a common and very tempting thought. The short answer is: We don’t know that’s the ground truth because the first group (that was going to be treated as the truth) has some sampling variability in there as well.

3.3 New tool!

\(H_0:\) There is no association between major and breakfast drink preference.
\(H_0:\) There is an association between major and breakfast drink preference.

Under the null hypothesis, both groups are drawn from a single underlying unknown distribution. The best estimate of this shared distribution is the pooled data using all the information we have. The blue numbers show what the numbers are expected to be if the null hypothesis is true.

Next, we use the same formula to calculate the test statistic

\[\begin{aligned} \chi^2 & = \sum_{i=1}^r \sum_{j=1}^c \frac{(\text{observed}_{ij} - \text{expected}_{ij})^2}{\text{expected}_{ij}} \\ & = \frac{(32-28.2)^2}{28.2} + \frac{(6-7.2)^2}{7.2} + \frac{(22-24.6)^2}{24.6} + \frac{(15-18.8)^2}{18.8} + \frac{(6-4.8)^2}{4.8} + \frac{(19-16.4)^2}{16.4} \\ & \approx 2.47 \end{aligned}\]

Finally, we conduct the hypothesis test. The degrees of freedom is

\[(\text{Number of rows} - 1) \times (\text{Number of columns} - 1)\]

In this case, it is 2. We fail to reject the null hypothesis with one of two ways:

We can first get the critical value and compare to 2.47.

qchisq(0.95, df = 2)

## [1] 5.991465

Alternatively, we can directly get the p-value.

1 - pchisq(2.47, df = 2)

## [1] 0.2908348

Code box:

What is the pchisq function?

Previously we introduced the qt function, which gives the critical value corresponding to a specific quantile of the t-distribution. Here, pchisq stands for the probability that the chisquare distributed random variable is less than or equal to a specific value. As you might imagine, there is similarly qchisq, pt, or in general p/q + name of distribution for many common distributions.

To make our lives simpler, we can also use this formula when calculating the expected count. See the math box below on why that’s true.

\[\text{Expected count} = \frac{\text{Row total} \times \text{Column total}}{\text{Grand total}}\]

Math box (Optional):

Recall that in a test of independence, the null hypothesis is that the two categorical variables are independent.

If they are indeed independent (i.e. under the null hypothesis), the probability of being in cell \((i,j)\) is:

\[P(\text{Row} \; i \; \cap \text{Column} \; j) = P(\text{Row} \; i) P(\text{Column} \; j)\]

If the total size is \(N = \text{Grand total}\), then the expected count in cell \((i,j)\) is:

\[E_{ij} = N \times P(\text{Row} \; i \; \cap \text{Column} \; j)\]

While we don’t know the probability \(P(\text{Row} \; i)\) or \(P(\text{Column} \; j)\) in the population, the sample proportions are the best unbiased estimates.

So to put it together,

\[E_{ij} = N \times \frac{\text{row total}_i}{N} \times \frac{\text{column total}_j}{N} = \frac{\text{Row total} \times \text{Column total}}{\text{N (Grand total)}}\]

3.4 Chi-square test of independence in R

data <- matrix(c(32, 6, 22, 15, 6, 19), nrow = 2, byrow = TRUE)

colnames(data) <- c("Coffee", "Tea", "Juice")
rownames(data) <- c("Math", "NonMath")

chisq.test(data)

## Warning in chisq.test(data): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  data
## X-squared = 2.4671, df = 2, p-value = 0.2913

Code box:

Sometimes you may see a warning message Warning in chisq.test(data): Chi-squared approximation may be incorrect.

It usually means that R thinks that the expected counts are a bit on the small side to do this exactly. This is because the test statistic is compared to a chi-square distribution approximately, which becomes more valid when expected counts are larger. The rule of thumb is 5 for each expected count, and the warning is triggered because there is a 4.8.

4 In-Class Activity

Now that we have conquered a categorical variable with 2 levels vs a categorical variable with 3 levels. Now you try a case where both categorical variables have 3 levels!

For the data below, test if there’s an association between major and drink preference by hand and using R.

Click to show answer by hand

Calculate expected count

The expected count formula is: \(E_{ij} = \frac{(\text{row total}_i)(\text{column total}_j)}{N}\).

Math (row total = 50)

Coffee: \(E = 50 \times 48 / 100 = 24\)
Tea: \(E = 50 \times 17 / 100 = 8.5\)
Juice: \(E = 50 \times 35 / 100 = 17.5\)

CS (row total = 30)

Coffee: \(E = 30 \times 48 / 100 = 14.4\)
Tea: \(E = 30 \times 17 / 100 = 5.1\)
Juice: \(E = 30 \times 35 / 100 = 10.5\)

Biology (row total = 20)

Coffee: \(E = 20 \times 48 / 100 = 9.6\)
Tea: \(E = 20 \times 17 / 100 = 3.4\)
Juice: \(E = 20 \times 35 / 100 = 7.0\)

Calculate chi-square test statistic

Math

Coffee: \((22 - 24)^2 / 24 = 0.167\)
Tea: \((9 - 8.5)^2 / 8.5 = 0.029\)
Juice: \((19 - 17.5)^2 / 17.5 = 0.129\)

Coffee: \((16 - 14.4)^2 / 14.4 = 0.178\)
Tea: \((6 - 5.1)^2 / 5.1 = 0.159\)
Juice: \((8 - 10.5)^2 / 10.5 = 0.595\)

Biology

Coffee: \((10 - 9.6)^2 / 9.6 = 0.017\)
Tea: \((2 - 3.4)^2 / 3.4 = 0.576\)
Juice: \((8 - 7)^2 / 7 = 0.143\)

\(\chi^2 = 0.167+0.029+0.129+0.178+0.159+0.595+0.017+0.576+0.143 = 1.993\)

Degrees of Freedom

\(df = (r-1)(c-1) = (3-1)(3-1) = 4\)

Do the test

With a p-value of 0.74, we fail to reject the null hypothesis.

1 - pchisq(1.993, df = 4)

## [1] 0.7370465

Click to show answer using R

table <- matrix(
  c(22, 9, 19,   # Math
    16, 6,  8,   # CS
    10, 2,  8),  # Biology
  nrow = 3,
  byrow = TRUE
)

rownames(table) <- c("Math", "CS", "Biology")
colnames(table) <- c("Coffee", "Tea", "Juice")

chisq.test(table)

## Warning in chisq.test(table): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  table
## X-squared = 1.9925, df = 4, p-value = 0.7371

5 Exit Ticket

Exit question: When thinking about the association between two variables, regression is a common choice. Can we use regression tools we’ve learned to accomplish what the chi-square test is doing here?

6 Short assessment

A company asks residents from three Iowa cities on their favorite activity. They are trying to determine if there’s a difference in participants’ favorite activity based on where they live.

Answer the following questions:

State the null and alternative hypotheses for the chi-square test of independence in this setting.
Why do we use pooled proportions when calculating expected values under the null hypothesis?
Compute the expected count for Des Moines - Yoga, and explain what it means in this context.
For Des Moines - Yoga, what’s its contribution to the chi-square test statistic (i.e. the term \(\frac{(\text{observed}_{ij} - \text{expected}_{ij})^2}{\text{expected}_{ij}}\))?
Suppose the total chi-square test statistic is 10 (random number I came up with). Finish the rest of the hypothesis testing process utilizing the pchisq function.

Click to show numeric answers for Part 3-5.

\(35 \times 35 / 100 = 12.25\)
\((8-12.25)^2/12.25 \approx 1.47\)
Using \(\alpha = 0.05\), we reject the null hypothesis. However, it is fairly close to the nominal significance level.

1 - pchisq(10, df = 4)

## [1] 0.04042768

Chi-square tests of independence

Zuofu Huang