knitr::opts_chunk$set(echo = TRUE, 
                      fig.align = 'center', 
                      fig.width = 4, 
                      fig.height = 4, 
                      message = FALSE, warning = FALSE)

Testing independence in R

Perhaps a gross oversight on my part, a critically important function in R for testing the independence of contingency tables in R is the chisq.test() function. Given either a table or matrix, chisq.test() will perform for us the necessary test:

m <- matrix(c(40, 50, 60, 70), nrow = 2)
chisq.test(m)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  m
## X-squared = 0.0127, df = 1, p-value = 0.91

Note: in the case of 2x2 tables, the statistic found here will be slightly different than we would get if doing this by hand. This is on account of the Yates’ correction which is applied by default (see ?chisq.test for details). We can disable this correction with an additional argument correct = FALSE.


Like most objects in R, chisq.test() with both print out summary output for a test and return an object. This object can be saved to a variable name that then shows up in our environment.

## Check this out in your environment
res <- chisq.test(m)

Although beyond the scope of this class, it’s worth briefly noting that the class of these objects determine how many different functions work with them (resulting in the very helpful behavior where the function summary(), for example, returns appropriate but different summaries for lm linear models and glm linear models). We can see the class of an object with the class() function

class(res)
## [1] "htest"

Underneath this, however, is the type of object that it is. In R, nearly every single object that isn’t a vector is of type list, including data.frames, linear models, ggplots, and output from statistical tests:

typeof(res)
## [1] "list"

The ubiquity of lists come from their nearly infinite flexibility. The primary utility for now comes from the fact that we can extract elements from this list by name. For example, we can investigate res in our environment pane or use the str() (structure) function to quickly illustrate the components available:

str(res)
## List of 9
##  $ statistic: Named num 0.0127
##   ..- attr(*, "names")= chr "X-squared"
##  $ parameter: Named int 1
##   ..- attr(*, "names")= chr "df"
##  $ p.value  : num 0.91
##  $ method   : chr "Pearson's Chi-squared test with Yates' continuity correction"
##  $ data.name: chr "m"
##  $ observed : num [1:2, 1:2] 40 50 60 70
##  $ expected : num [1:2, 1:2] 40.9 49.1 59.1 70.9
##  $ residuals: num [1:2, 1:2] -0.142 0.13 0.118 -0.108
##  $ stdres   : num [1:2, 1:2] -0.25 0.25 0.25 -0.25
##  - attr(*, "class")= chr "htest"

From this, we can extract important things, such as the matrix of expected counts we would get under independence

## Expected counts from chi-square model under independence
res[["expected"]]
##        [,1]   [,2]
## [1,] 40.909 59.091
## [2,] 49.091 70.909

This technique is frequently used in simulation settings where we may want to run sequences of tests or build models, extracting from them important summary information without having to store the entire object (which can be very memory intensive for large models).

We illustrate this with a toy example below where we simulate 20,000 tables under the assumption of independence to show (1) that the distribution of p-values under the null hypothesis is uniformly distributed and that the distribution of \(\chi^2\) statistics for 2x2 tables generated under the null hypothesis do indeed follow a \(\chi^2\) distribution with df = 1.

## Generate p-values and statistics from 20,000 null tables
n <- 2e4
p <- vector("numeric", length = n)
stat <- vector("numeric", length = n)

for (i in seq_len(n)) {
  X <- rbinom(1, 100, 0.75)
  Y <- rbinom(1, 100, 0.75)
  m <- matrix(c(X, Y, 100 - X, 100 - Y), nrow = 2)
  cc <- chisq.test(m, correct = FALSE)
  p[i] <- cc[["p.value"]]
  stat[i] <- cc[["statistic"]]
}


## p-values have uniform distribution under null
par(mfrow = c(1, 2)) # make base R plots side by side
hist(p, breaks = 8)

## Statistic follows chisq(df = 1) under null
hist(stat, probability = TRUE, xlim = c(0, 10))
f <- function(x) dchisq(x, df = 1)
curve(f, from = 0, to = 10, add = TRUE, col = 'red', lw =2)

Poisson Loglinear Models

Here, we introduce loglinear Poisson GLMs in the context of 2x2 contingency tables. As we have no continuous explanatory variables, our data is stored in a similar manner as the binomial data from the previous lab, allowing one row for each combination of row/column variables along with a column for count:

heaven <- read.table("http://www.stat.ufl.edu/~aa/cat/data/HappyHeaven.dat", 
                     header = TRUE)
heaven
##    happy heaven count
## 1    not     no    32
## 2    not    yes   190
## 3 pretty     no   113
## 4 pretty    yes   611
## 5   very     no    51
## 6   very    yes   326

We can specify that we are fitting a Possion model in our GLM by specifying poisson in the family argument; by default, the link function used is logorithmic, but we can further specify if we wish (or, like with linear probability models, we can modify our count model to be linear as well):

glm(count ~ heaven, family = poisson, data = heaven) %>% summary()
## 
## Call:
## glm(formula = count ~ heaven, family = poisson, data = heaven)
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)   4.1795     0.0714    58.5 <0.0000000000000002 ***
## heavenyes     1.7492     0.0774    22.6 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1019.87  on 5  degrees of freedom
## Residual deviance:  295.76  on 4  degrees of freedom
## AIC: 340.4
## 
## Number of Fisher Scoring iterations: 4

Questions

Part A Create a GLM using the heaven dataset using happy as the single explanatory variable and save the model to a variable called m0. Considering the residual deviance:

  • What is the null hypothesis for the residual deviance
  • Based on the residual deviance for this model, does it appear to be a good fit? Confirm your answer by find the p-value associated with the deviance (using pchisq)

Part B Repeat Part A, this time including both happy and heaven in your model. Save this model as a variable called m1

Part C Turn the heaven dataset first into a data.frame with 1323 rows (one for each observation) and then create a table on the variables happy and heaven. Conduct a \(\chi^2\) test of independence on this table. Does this p-value for this test match the p-value you found in Part B? What does this say about the relationship between the Poisson loglinear model and the \(\chi^2\) test for independence?

Part D Using the method from the beginning of the lab, identify the portions of m0 and m1 that contain the deviance. Extract these deviances and perform a likelihood ratio test on these two models. Confirm that it matches the output of comparing these models using anova(). Based on this, which model would you prefer?

Part E If the \(\chi^2\) test from Part C suggests that the two variables happy and heaven are independent, i.e., if the odds of believing in heaven are the same for each level of happiness, why is model m1 so much better than m0? In other words, why do I also need coefficients for the happy variable if I can already find the odds for heaven? (Hint: what specifically is the response variable for this model?)