First, consider the snoring model that we saw last week:
library(dplyr)
heart <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart.dat",
header = TRUE)
heart$x <- recode(heart$snoring, never = 0,
occasional = 2, nearly_every_night = 4,
every_night = 5)
n <- heart$yes + heart$no
fit <- glm(yes/n ~ x, family = binomial, weights = n, data = heart)
The summary()
function in R provides output on both the
model coefficients, the Wald staistics, and information on the
deviance
summary(fit)
##
## Call:
## glm(formula = yes/n ~ x, family = binomial, data = heart, weights = n)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.866 0.166 -23.26 < 0.0000000000000002 ***
## x 0.397 0.050 7.95 0.0000000000000019 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.9045 on 3 degrees of freedom
## Residual deviance: 2.8089 on 2 degrees of freedom
## AIC: 27.06
##
## Number of Fisher Scoring iterations: 4
Here, null deviance refers to the empty model (where all coefficients but the intercept are set at \(\beta = 0\)), while the residual deviance represents the Deviance from the text, comparing the given model to the saturated model. In the case with a single categorical predictor, the saturated model is simply enough to find: it simply adds a coefficient for each category, thus given an estimate for each binomial random variable separately
fit2 <- glm(yes/n ~ snoring, family = binomial, weights = n, data = heart)
summary(fit2)
##
## Call:
## glm(formula = yes/n ~ snoring, family = binomial, data = heart,
## weights = n)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.010 0.194 -10.34 < 0.0000000000000002 ***
## snoringnearly_every_night -0.203 0.301 -0.67 0.5011
## snoringnever -2.023 0.283 -7.14 0.00000000000091 ***
## snoringoccasional -0.836 0.261 -3.21 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.90448103815968750 on 3 degrees of freedom
## Residual deviance: -0.00000000000012457 on 0 degrees of freedom
## AIC: 28.25
##
## Number of Fisher Scoring iterations: 3
Notice that the null deviance is the same as, in both cases, it is simply refering to the model with no coefficients but the intercept. However, the residual deviance has dropped to 0, indicating that there is no difference between the present model and the saturated one.
In order to compare two GLMs, they need to be nested. That is, if we wish to compare model 1 to model 2, we need to make sure that all of the variables in model 2 are also included in model 1. In other words, model 1 needs to be a proper subset of model 2.
We can’t compare fit
and fit2
above in this
way because they are not nested. The model for fit
contains only the variable x
while the model for
fit2
contains only the variablel for snoring
.
However, we can create a third model incorporating both of these
variables and evaluate if adding the snoring
variable
offers any marginal benfit to model 1
## Model with snoring and x
fit3 <- glm(yes/n ~ snoring + x, family = binomial, weights = n, data = heart)
To compare two nested models directly, we can use the
anova()
function
anova(fit, fit3)
## Analysis of Deviance Table
##
## Model 1: yes/n ~ x
## Model 2: yes/n ~ snoring + x
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2 2.81
## 2 0 0.00 2 2.81 0.25
This provides a collection of useful information: first, we see the deviances in each model, the difference in deviances, and finally the p-value for the \(\chi^2\) statistic associated with the hypothesis test \(\beta_{snoring} = 0\). Based on this, we fail to reject and conclude that model 3 provides no additional benefit in fit over model 1
Here is a data.frame representing the coffee/smoking/lung cancer data
## You may need to install this with install.packages("tidyr")
library(tidyr)
df <- expand.grid(coffee = c("yes", "no"),
cancer = c("yes", "no"),
smoking = c("yes", "no"))
df$Freq <- c(16,4,4,1,12,15,108,135)
df <- pivot_wider(df, names_from = cancer, values_from = Freq)
df$n <- df$yes + df$no
Question 0 Looking at the data.frame for coffee/smoking/cancer, how many binomial observations are present?
Question 1 Create a binomial GLM called
model1
with cancer as the response variable and coffee as
the predictor. Looking at the summary information, how does the null
deviance compare to the residual deviance? What does that suggest about
this model?
Question 2 Now create a model called
model2
the includes only smoking as a predictor and again
compare the null and residual deviance. What does this suggest about our
model?
Question 3 Finally, create a model called
model3
that includes both smoking and coffee as predictors.
What happened to the coefficient for coffee? Use the
anova()
function to compare model3
to each of
model1
and model2
and comment on the results.
Are these consistent with what you found in Q1 and Q2?
Question 4 Why can’t we compare all three models at
the same time with anova(model1, model2, model3)
?