Investigating Model Comparisons

Summary Functions

First, consider the snoring model that we saw last week:

library(dplyr)

heart <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart.dat",
                    header = TRUE)

heart$x <- recode(heart$snoring, never = 0,
                  occasional = 2, nearly_every_night = 4,
                  every_night = 5)
n <- heart$yes + heart$no


fit <- glm(yes/n ~ x, family = binomial, weights = n, data = heart)

The summary() function in R provides output on both the model coefficients, the Wald staistics, and information on the deviance

summary(fit)
## 
## Call:
## glm(formula = yes/n ~ x, family = binomial, data = heart, weights = n)
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)   -3.866      0.166  -23.26 < 0.0000000000000002 ***
## x              0.397      0.050    7.95   0.0000000000000019 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.9045  on 3  degrees of freedom
## Residual deviance:  2.8089  on 2  degrees of freedom
## AIC: 27.06
## 
## Number of Fisher Scoring iterations: 4

Here, null deviance refers to the empty model (where all coefficients but the intercept are set at \(\beta = 0\)), while the residual deviance represents the Deviance from the text, comparing the given model to the saturated model. In the case with a single categorical predictor, the saturated model is simply enough to find: it simply adds a coefficient for each category, thus given an estimate for each binomial random variable separately

fit2 <- glm(yes/n ~ snoring, family = binomial, weights = n, data = heart)
summary(fit2)
## 
## Call:
## glm(formula = yes/n ~ snoring, family = binomial, data = heart, 
##     weights = n)
## 
## Coefficients:
##                           Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)                 -2.010      0.194  -10.34 < 0.0000000000000002 ***
## snoringnearly_every_night   -0.203      0.301   -0.67               0.5011    
## snoringnever                -2.023      0.283   -7.14     0.00000000000091 ***
## snoringoccasional           -0.836      0.261   -3.21               0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.90448103815968750  on 3  degrees of freedom
## Residual deviance: -0.00000000000012457  on 0  degrees of freedom
## AIC: 28.25
## 
## Number of Fisher Scoring iterations: 3

Notice that the null deviance is the same as, in both cases, it is simply refering to the model with no coefficients but the intercept. However, the residual deviance has dropped to 0, indicating that there is no difference between the present model and the saturated one.

Model Comparisons

In order to compare two GLMs, they need to be nested. That is, if we wish to compare model 1 to model 2, we need to make sure that all of the variables in model 2 are also included in model 1. In other words, model 1 needs to be a proper subset of model 2.

We can’t compare fit and fit2 above in this way because they are not nested. The model for fit contains only the variable x while the model for fit2 contains only the variablel for snoring. However, we can create a third model incorporating both of these variables and evaluate if adding the snoring variable offers any marginal benfit to model 1

## Model with snoring and x
fit3 <- glm(yes/n ~ snoring + x, family = binomial, weights = n, data = heart)

To compare two nested models directly, we can use the anova() function

anova(fit, fit3)
## Analysis of Deviance Table
## 
## Model 1: yes/n ~ x
## Model 2: yes/n ~ snoring + x
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1         2       2.81                     
## 2         0       0.00  2     2.81     0.25

This provides a collection of useful information: first, we see the deviances in each model, the difference in deviances, and finally the p-value for the \(\chi^2\) statistic associated with the hypothesis test \(\beta_{snoring} = 0\). Based on this, we fail to reject and conclude that model 3 provides no additional benefit in fit over model 1

Practice

Here is a data.frame representing the coffee/smoking/lung cancer data

## You may need to install this with install.packages("tidyr")
library(tidyr)

df <- expand.grid(coffee = c("yes", "no"),
                  cancer = c("yes", "no"),
                  smoking = c("yes", "no"))

df$Freq <- c(16,4,4,1,12,15,108,135)
df <- pivot_wider(df, names_from = cancer, values_from = Freq)
df$n <- df$yes + df$no

Question 0 Looking at the data.frame for coffee/smoking/cancer, how many binomial observations are present?

Question 1 Create a binomial GLM called model1 with cancer as the response variable and coffee as the predictor. Looking at the summary information, how does the null deviance compare to the residual deviance? What does that suggest about this model?

Question 2 Now create a model called model2 the includes only smoking as a predictor and again compare the null and residual deviance. What does this suggest about our model?

Question 3 Finally, create a model called model3 that includes both smoking and coffee as predictors. What happened to the coefficient for coffee? Use the anova() function to compare model3 to each of model1 and model2 and comment on the results. Are these consistent with what you found in Q1 and Q2?

Question 4 Why can’t we compare all three models at the same time with anova(model1, model2, model3)?