knitr::opts_chunk$set(echo = TRUE, 
                      fig.align = 'center', 
                      fig.width = 6, 
                      fig.height = 4, 
                      message = FALSE, warning = FALSE)
library(ggplot2)
library(dplyr)

theme_set(theme_bw(base_size = 18))

Introduction

This lab will serve as our introduction to using the glm() function, along with some of the basic considerations necessary to use it with binary data. We will conclude with a demonstration of using predict()

Data Formats

Typically when we have interacted with data in R, we have done so according to a few assumptions, namely:

We may summarize this data in different fashions using dplyr or table(), but rarely do we use these summaries in the same way. Given the ubiquity of tables when considering binary data, however, the glm() function is prepared to accept both.

Consider the Heart dataset from the text evaluating the relationship between snoring and heart disease

heart <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart.dat", 
                    header = TRUE)
heart
##              snoring yes   no
## 1              never  24 1355
## 2         occasional  35  603
## 3 nearly_every_night  21  192
## 4        every_night  30  224

It contains a table of two variables: the amount of snoring reported and the number of individuals within each group who developed heart disease.

We have not yet addressed ordinal variables, though this is a situation in which a natural ordering is present. The text recodes the snoring variable to values 0-5 according to the following scheme:

heart$x <- recode(heart$snoring, never = 0, 
                  occasional = 2, nearly_every_night = 4,
                  every_night = 5)
heart
##              snoring yes   no x
## 1              never  24 1355 0
## 2         occasional  35  603 2
## 3 nearly_every_night  21  192 4
## 4        every_night  30  224 5

This provides a sense of scale: never is distinct from occasional which is distinct from nearly every night, while nearly every night and every night are scaled relatively close to each other (indicated by the unit gap between their numerical values). We just as well could have scaled them from 0 to 3 or any other scale, a decision that typically depends on the context of the problem and the judgement of the investigators.

In addition to tables, the glm() function also takes a standard data.frame where each row makes up an observation. Here is the same data, already provided with a numerical scale:

heart2 <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart2.dat", 
                    header = TRUE)

head(heart2)
##   subject snoring x y
## 1       1   never 0 1
## 2       2   never 0 1
## 3       3   never 0 1
## 4       4   never 0 1
## 5       5   never 0 1
## 6       6   never 0 1

We see this is equivalent to the table we already started with:

with(heart2, table(x,y))
##    y
## x      0    1
##   0 1355   24
##   2  603   35
##   4  192   21
##   5  224   30

glm

In many ways, the glm function works the same as lm() in that it takes a formula relating our outcome to our explanatory variables and it takes an argument for our data. New to glm is the family argument which specifies the random component of a generalized linear model. Note that the random component often implies the link function as well (i.e., identity for normal distribution, logistic for binomial)

## Using the unsummarized data first
fit <- glm(y ~ x, family = binomial, data = heart2)

summary(fit)
## 
## Call:
## glm(formula = y ~ x, family = binomial, data = heart2)
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)   -3.866      0.166  -23.26 < 0.0000000000000002 ***
## x              0.397      0.050    7.95   0.0000000000000019 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 900.83  on 2483  degrees of freedom
## Residual deviance: 837.73  on 2482  degrees of freedom
## AIC: 841.7
## 
## Number of Fisher Scoring iterations: 6

If for any reason we wanted to change the link function for a given family, we can do so directly (though note it is unnecessary here since link = logit is default for the binomial)

fit <- glm(y ~ x, family = binomial(link = logit), data = heart2)

You can see a number of the random components immediately available in R by investigating ?family


When using tabular data with the glm function, extra consideration is needed. In this case, glm() is expecting to have as the response the observed probability for each group. That is, instead of observing if each observation in the “never snores” group has heart disease or not, it asks what proportion of the never snores group has heart disease. We can find this directly by dividing the total number of “yes” values by the total in each group

## Total for each group
n <- heart$yes + heart$no
n
## [1] 1379  638  213  254

In addition to giving the probability for each group, we also need to specify the weights argument. This corresponds to the number of observations in each group

## Using tabular form
fit2 <- glm(yes / n ~ x, family = binomial, weights = n, data = heart)

We will concern ourselves with inference later, but for now it is sufficient to see that these produce the same estimates based on the data:

coef(fit) # long data
## (Intercept)           x 
##    -3.86625     0.39734
coef(fit2) # tabular
## (Intercept)           x 
##    -3.86625     0.39734

The text notes that, by default, R does not provide a link identity function associated with the binomial random component (i.e., binomial(link = "identity") will get us nowhere), but we can work around this using the quasi() family function to build it directly

## Fitting long data with identity link and binomial random component
fit_linear <- glm(y ~ x, data = heart2, 
                  family = quasi(link = "identity", variance = "mu(1-mu)"))
summary(fit_linear)
## 
## Call:
## glm(formula = y ~ x, family = quasi(link = "identity", variance = "mu(1-mu)"), 
##     data = heart2)
## 
## Coefficients:
##             Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)  0.01725    0.00345    5.00 0.0000006280535 ***
## x            0.01978    0.00281    7.05 0.0000000000023 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasi family taken to be 1.0007)
## 
##     Null deviance: 900.83  on 2483  degrees of freedom
## Residual deviance: 834.99  on 2482  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 3

predict

The predict() function is a function that provides predicted response values for a set of explanatory variables, based on a model. predict() is a generic function in R, meaning how it behaves will depend on the type of object that is passed in. In other words, predict() will function differently for an lm type object than it will for a glm.

Question 0: Investigate ?predict and predict.glm to see how they differ. There is nothing you need to record

For a GLM object, the first argument we need is the object argument which takes a fitted glm. Second is newdata, taking a data.frame that must have as its named columns all of the explanatory variables included in the object model. Finally, we need to consider the argument to type which, depending on its value, will return either the value of the link function or values in terms of the response. That is, when type = "link", the predict() function will return the predicted values of

\[ \log \left( \frac{P(Y = 1)}{1 - P(Y = 1)} \right) \] Alternatively, when type = "response", the predicted values will be in terms of the predicted probabilities.

We can see an example of how this works in a simple linear model case predicting miles per gallon in vehicles based on vehicle weight and quarter mile time:

##  built-in R dataset
data(mtcars)

model <- glm(mpg ~ wt + qsec, family = quasi(link = "identity"), data = mtcars)

## Uncomment and run this to confirm the coefficients are the same
# m <- lm(mpg ~ wt + qsec, data = mtcars)

If we wanted to predict the miles per gallon of several different cars based on weight and quarter mile time, we could do so by first creating a new data.frame with this variables and the values we want

newdat <- data.frame(wt = c(3, 4), 
                     qsec = c(16.5, 20))

predict(model, newdat, type = "response")
##      1      2 
## 19.934 18.138

Since we are using the identity link here, this is the same as if we used predict using type = "link" as well

predict(model, newdat, type = "link")
##      1      2 
## 19.934 18.138

Question 1 Recreate the plot from the textbook demonstrating the comparison between the logistic and linear predictions for heart disease based on snoring status.

Note: Here, I have created the ggplot and assigned it to the variable name p

p

I can add a new layer to this with data from a different data.frame by passing in an additional data argument. I will have to override the color aesthetic, though, since I used in it aes() in my original plot

## Add data summary points
heart$prop <- heart$yes / (heart$yes + heart$no) # proportion for each group
p + geom_point(data = heart, aes(x = x, y = prop), color = "black", size = 3)