knitr::opts_chunk$set(echo = TRUE,
fig.align = 'center',
fig.width = 4,
fig.height = 4,
message = FALSE, warning = FALSE)
Perhaps a gross oversight on my part, a critically important function
in R for testing the independence of contingency tables in R is the
chisq.test()
function. Given either a table or matrix,
chisq.test()
will perform for us the necessary test:
m <- matrix(c(40, 50, 60, 70), nrow = 2)
chisq.test(m)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: m
## X-squared = 0.0127, df = 1, p-value = 0.91
Note: in the case of 2x2 tables, the statistic found here will be
slightly different than we would get if doing this by hand. This is on
account of the Yates’ correction which is applied by default (see
?chisq.test
for details). We can disable this correction
with an additional argument correct = FALSE
.
Like most objects in R, chisq.test()
with both print out
summary output for a test and return an object. This object can be saved
to a variable name that then shows up in our environment.
## Check this out in your environment
res <- chisq.test(m)
Although beyond the scope of this class, it’s worth briefly noting
that the class of these objects determine how many different
functions work with them (resulting in the very helpful behavior where
the function summary()
, for example, returns appropriate
but different summaries for lm
linear models and
glm
linear models). We can see the class of an object with
the class()
function
class(res)
## [1] "htest"
Underneath this, however, is the type of object that it is.
In R, nearly every single object that isn’t a vector is of type
list
, including data.frames,
linear models, ggplots, and output from statistical tests:
typeof(res)
## [1] "list"
The ubiquity of lists come from their nearly infinite flexibility.
The primary utility for now comes from the fact that we can extract
elements from this list by name. For example, we can investigate
res
in our environment pane or use the str()
(structure) function to quickly illustrate the components available:
str(res)
## List of 9
## $ statistic: Named num 0.0127
## ..- attr(*, "names")= chr "X-squared"
## $ parameter: Named int 1
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.91
## $ method : chr "Pearson's Chi-squared test with Yates' continuity correction"
## $ data.name: chr "m"
## $ observed : num [1:2, 1:2] 40 50 60 70
## $ expected : num [1:2, 1:2] 40.9 49.1 59.1 70.9
## $ residuals: num [1:2, 1:2] -0.142 0.13 0.118 -0.108
## $ stdres : num [1:2, 1:2] -0.25 0.25 0.25 -0.25
## - attr(*, "class")= chr "htest"
From this, we can extract important things, such as the matrix of expected counts we would get under independence
## Expected counts from chi-square model under independence
res[["expected"]]
## [,1] [,2]
## [1,] 40.909 59.091
## [2,] 49.091 70.909
This technique is frequently used in simulation settings where we may want to run sequences of tests or build models, extracting from them important summary information without having to store the entire object (which can be very memory intensive for large models).
We illustrate this with a toy example below where we simulate 20,000 tables under the assumption of independence to show (1) that the distribution of p-values under the null hypothesis is uniformly distributed and that the distribution of \(\chi^2\) statistics for 2x2 tables generated under the null hypothesis do indeed follow a \(\chi^2\) distribution with df = 1.
## Generate p-values and statistics from 20,000 null tables
n <- 2e4
p <- vector("numeric", length = n)
stat <- vector("numeric", length = n)
for (i in seq_len(n)) {
X <- rbinom(1, 100, 0.75)
Y <- rbinom(1, 100, 0.75)
m <- matrix(c(X, Y, 100 - X, 100 - Y), nrow = 2)
cc <- chisq.test(m, correct = FALSE)
p[i] <- cc[["p.value"]]
stat[i] <- cc[["statistic"]]
}
## p-values have uniform distribution under null
par(mfrow = c(1, 2)) # make base R plots side by side
hist(p, breaks = 8)
## Statistic follows chisq(df = 1) under null
hist(stat, probability = TRUE, xlim = c(0, 10))
f <- function(x) dchisq(x, df = 1)
curve(f, from = 0, to = 10, add = TRUE, col = 'red', lw =2)
Here, we introduce loglinear Poisson GLMs in the context of 2x2 contingency tables. As we have no continuous explanatory variables, our data is stored in a similar manner as the binomial data from the previous lab, allowing one row for each combination of row/column variables along with a column for count:
heaven <- read.table("http://www.stat.ufl.edu/~aa/cat/data/HappyHeaven.dat",
header = TRUE)
heaven
## happy heaven count
## 1 not no 32
## 2 not yes 190
## 3 pretty no 113
## 4 pretty yes 611
## 5 very no 51
## 6 very yes 326
We can specify that we are fitting a Possion model in our GLM by
specifying poisson
in the family argument; by default, the
link function used is logorithmic, but we can further specify if we wish
(or, like with linear probability models, we can modify our count model
to be linear as well):
glm(count ~ heaven, family = poisson, data = heaven) %>% summary()
##
## Call:
## glm(formula = count ~ heaven, family = poisson, data = heaven)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.1795 0.0714 58.5 <0.0000000000000002 ***
## heavenyes 1.7492 0.0774 22.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1019.87 on 5 degrees of freedom
## Residual deviance: 295.76 on 4 degrees of freedom
## AIC: 340.4
##
## Number of Fisher Scoring iterations: 4
Part A Create a GLM using the heaven
dataset using happy
as the single explanatory variable and
save the model to a variable called m0
. Considering the
residual deviance:
pchisq
)Part B Repeat Part A, this time including both
happy
and heaven
in your model. Save this
model as a variable called m1
Part C Turn the heaven
dataset first
into a data.frame with 1323 rows (one for each observation) and then
create a table on the variables happy
and
heaven
. Conduct a \(\chi^2\) test of independence on this
table. Does this p-value for this test match the p-value you found in
Part B? What does this say about the relationship between the Poisson
loglinear model and the \(\chi^2\) test
for independence?
Part D Using the method from the beginning of the
lab, identify the portions of m0
and m1
that
contain the deviance. Extract these deviances and perform a likelihood
ratio test on these two models. Confirm that it matches the output of
comparing these models using anova()
. Based on this, which
model would you prefer?
Part E If the \(\chi^2\) test from Part C suggests that the
two variables happy
and heaven
are
independent, i.e., if the odds of believing in heaven are the same for
each level of happiness, why is model m1
so much better
than m0
? In other words, why do I also need coefficients
for the happy
variable if I can already find the odds for
heaven
? (Hint: what specifically is the response variable
for this model?)