| Understand the Concept | Apply the Concept in R |
|---|---|
| 1. Recognize when chi-square test of independence is
appropriate. • State the null and alternative hypotheses. • Difference between chi-square test of independence and goodness-of-fit. |
1. Run a chi-square test of independence in R. |
A study asks participants if they prefer to drink coffee, tea or juice for breakfast. Among 60 people, 32 said coffee, 6 said tea and 22 said juice.
Research question: Do 40% of people like coffee, 20% like tea, and 40% like juice?
That is a question that chi-square goodness-of-fit is very good at! i.e. Does the information we observe fit the proportions we were thinking?
\[\chi^2 = \sum_{i=1}^k \frac{(\text{observed}_i - \text{expected}_i)^2}{\text{expected}_i} = \frac{(32-60 \times 0.4)^2}{60 \times 0.4} + \frac{(6-60 \times 0.2)^2}{60 \times 0.2} + \frac{(22-60 \times 0.4)^2}{60 \times 0.4} = \frac{35}{6}\]
Using chi-square distribution with \(k-1 = 2\) degrees of freedom, we have:
1 - pchisq(35/6, 2)
## [1] 0.05411377
chisq.test(x = c(coffee = 32, tea = 6, juice = 22), p = c(0.4,0.2,0.4))
##
## Chi-squared test for given probabilities
##
## data: c(coffee = 32, tea = 6, juice = 22)
## X-squared = 5.8333, df = 2, p-value = 0.05411
A study asks participants if they prefer to drink coffee, tea or juice for breakfast. And they think there may be a difference in the answers among statistics majors and non-statistics majors.
For 60 math majors, 32 said coffee, 6 said tea and 22 said juice.
For 40 non-math majors, 15 said coffee, 6 said tea and 19 said juice.
Now we need a new statistical tool.
You may wonder: What if we use one group as the ground truth, and just compare the second group to the first group?
It’s a common and very tempting thought. The short answer is: We don’t know that’s the ground truth because the first group (that was going to be treated as the truth) has some sampling variability in there as well.
\(H_0:\) There is no association between major and breakfast drink preference.
\(H_0:\) There is an association between major and breakfast drink preference.
Under the null hypothesis, both groups are drawn from a single underlying unknown distribution. The best estimate of this shared distribution is the pooled data using all the information we have. The blue numbers show what the numbers are expected to be if the null hypothesis is true.
Next, we use the same formula to calculate the test statistic
\[\begin{aligned} \chi^2 & = \sum_{i=1}^r \sum_{j=1}^c \frac{(\text{observed}_{ij} - \text{expected}_{ij})^2}{\text{expected}_{ij}} \\ & = \frac{(32-28.2)^2}{28.2} + \frac{(6-7.2)^2}{7.2} + \frac{(22-24.6)^2}{24.6} + \frac{(15-18.8)^2}{18.8} + \frac{(6-4.8)^2}{4.8} + \frac{(19-16.4)^2}{16.4} \\ & \approx 2.47 \end{aligned}\]
Finally, we conduct the hypothesis test. The degrees of freedom is
\[(\text{Number of rows} - 1) \times (\text{Number of columns} - 1)\]
In this case, it is 2. We fail to reject the null hypothesis with one of two ways:
We can first get the critical value and compare to 2.47.
qchisq(0.95, df = 2)
## [1] 5.991465
Alternatively, we can directly get the p-value.
1 - pchisq(2.47, df = 2)
## [1] 0.2908348
Code box:
What is the pchisq function?
Previously we introduced the qt function, which gives
the critical value corresponding to a specific quantile of
the t-distribution. Here, pchisq stands for
the probability that the chisquare distributed
random variable is less than or equal to a specific value. As you might
imagine, there is similarly qchisq, pt, or in
general p/q + name of distribution for many common
distributions.
To make our lives simpler, we can also use this formula when calculating the expected count. See the math box below on why that’s true.
\[\text{Expected count} = \frac{\text{Row total} \times \text{Column total}}{\text{Grand total}}\]
Math box (Optional):
Recall that in a test of independence, the null hypothesis is that the two categorical variables are independent.
If they are indeed independent (i.e. under the null hypothesis), the probability of being in cell \((i,j)\) is:
\[P(\text{Row} \; i \; \cap \text{Column} \; j) = P(\text{Row} \; i) P(\text{Column} \; j)\]
If the total size is \(N = \text{Grand total}\), then the expected count in cell \((i,j)\) is:
\[E_{ij} = N \times P(\text{Row} \; i \; \cap \text{Column} \; j)\]
While we don’t know the probability \(P(\text{Row} \; i)\) or \(P(\text{Column} \; j)\) in the population, the sample proportions are the best unbiased estimates.
So to put it together,
\[E_{ij} = N \times \frac{\text{row total}_i}{N} \times \frac{\text{column total}_j}{N} = \frac{\text{Row total} \times \text{Column total}}{\text{N (Grand total)}}\]
data <- matrix(c(32, 6, 22, 15, 6, 19), nrow = 2, byrow = TRUE)
colnames(data) <- c("Coffee", "Tea", "Juice")
rownames(data) <- c("Math", "NonMath")
chisq.test(data)
## Warning in chisq.test(data): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: data
## X-squared = 2.4671, df = 2, p-value = 0.2913
Code box:
Sometimes you may see a warning message
Warning in chisq.test(data): Chi-squared approximation may be incorrect.
It usually means that R thinks that the expected counts are a bit on the small side to do this exactly. This is because the test statistic is compared to a chi-square distribution approximately, which becomes more valid when expected counts are larger. The rule of thumb is 5 for each expected count, and the warning is triggered because there is a 4.8.
Now that we have conquered a categorical variable with 2 levels vs a categorical variable with 3 levels. Now you try a case where both categorical variables have 3 levels!
For the data below, test if there’s an association between major and drink preference by hand and using R.
Calculate expected count
The expected count formula is: \(E_{ij} = \frac{(\text{row total}_i)(\text{column total}_j)}{N}\).
Math (row total = 50)
CS (row total = 30)
Biology (row total = 20)
Calculate chi-square test statistic
Math
CS
Biology
\(\chi^2 = 0.167+0.029+0.129+0.178+0.159+0.595+0.017+0.576+0.143 = 1.993\)
Degrees of Freedom
\(df = (r-1)(c-1) = (3-1)(3-1) = 4\)
Do the test
With a p-value of 0.74, we fail to reject the null hypothesis.
1 - pchisq(1.993, df = 4)
## [1] 0.7370465
table <- matrix(
c(22, 9, 19, # Math
16, 6, 8, # CS
10, 2, 8), # Biology
nrow = 3,
byrow = TRUE
)
rownames(table) <- c("Math", "CS", "Biology")
colnames(table) <- c("Coffee", "Tea", "Juice")
chisq.test(table)
## Warning in chisq.test(table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table
## X-squared = 1.9925, df = 4, p-value = 0.7371
Exit question: When thinking about the association between two variables, regression is a common choice. Can we use regression tools we’ve learned to accomplish what the chi-square test is doing here?
A company asks residents from three Iowa cities on their favorite activity. They are trying to determine if there’s a difference in participants’ favorite activity based on where they live.
Answer the following questions:
State the null and alternative hypotheses for the chi-square test of independence in this setting.
Why do we use pooled proportions when calculating expected values under the null hypothesis?
Compute the expected count for Des Moines - Yoga, and explain what it means in this context.
For Des Moines - Yoga, what’s its contribution to the chi-square test statistic (i.e. the term \(\frac{(\text{observed}_{ij} - \text{expected}_{ij})^2}{\text{expected}_{ij}}\))?
Suppose the total chi-square test statistic is 10 (random number
I came up with). Finish the rest of the hypothesis testing process
utilizing the pchisq function.
\(35 \times 35 / 100 = 12.25\)
\((8-12.25)^2/12.25 \approx 1.47\)
Using \(\alpha = 0.05\), we reject the null hypothesis. However, it is fairly close to the nominal significance level.
1 - pchisq(10, df = 4)
## [1] 0.04042768