knitr::opts_chunk$set(echo = TRUE,
fig.align = 'center',
fig.width = 6,
fig.height = 4,
message = FALSE, warning = FALSE)
library(ggplot2)
library(dplyr)
theme_set(theme_bw(base_size = 18))
This lab will serve as our introduction to using the
glm()
function, along with some of the basic considerations
necessary to use it with binary data. We will conclude with a
demonstration of using predict()
Typically when we have interacted with data in R, we have done so according to a few assumptions, namely:
We may summarize this data in different fashions using
dplyr
or table()
, but rarely do we use these
summaries in the same way. Given the ubiquity of tables when considering
binary data, however, the glm()
function is prepared to
accept both.
Consider the Heart dataset from the text evaluating the relationship between snoring and heart disease
heart <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart.dat",
header = TRUE)
heart
## snoring yes no
## 1 never 24 1355
## 2 occasional 35 603
## 3 nearly_every_night 21 192
## 4 every_night 30 224
It contains a table of two variables: the amount of snoring reported and the number of individuals within each group who developed heart disease.
We have not yet addressed ordinal variables, though this is a situation in which a natural ordering is present. The text recodes the snoring variable to values 0-5 according to the following scheme:
heart$x <- recode(heart$snoring, never = 0,
occasional = 2, nearly_every_night = 4,
every_night = 5)
heart
## snoring yes no x
## 1 never 24 1355 0
## 2 occasional 35 603 2
## 3 nearly_every_night 21 192 4
## 4 every_night 30 224 5
This provides a sense of scale: never is distinct from occasional which is distinct from nearly every night, while nearly every night and every night are scaled relatively close to each other (indicated by the unit gap between their numerical values). We just as well could have scaled them from 0 to 3 or any other scale, a decision that typically depends on the context of the problem and the judgement of the investigators.
In addition to tables, the glm()
function also takes a
standard data.frame where each row makes up an observation. Here is the
same data, already provided with a numerical scale:
heart2 <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart2.dat",
header = TRUE)
head(heart2)
## subject snoring x y
## 1 1 never 0 1
## 2 2 never 0 1
## 3 3 never 0 1
## 4 4 never 0 1
## 5 5 never 0 1
## 6 6 never 0 1
We see this is equivalent to the table we already started with:
with(heart2, table(x,y))
## y
## x 0 1
## 0 1355 24
## 2 603 35
## 4 192 21
## 5 224 30
glm
In many ways, the glm
function works the same as
lm()
in that it takes a formula relating our outcome to our
explanatory variables and it takes an argument for our data. New to
glm
is the family
argument which specifies the
random component of a generalized linear model. Note that the
random component often implies the link function as well (i.e., identity
for normal distribution, logistic for binomial)
## Using the unsummarized data first
fit <- glm(y ~ x, family = binomial, data = heart2)
summary(fit)
##
## Call:
## glm(formula = y ~ x, family = binomial, data = heart2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.866 0.166 -23.26 < 0.0000000000000002 ***
## x 0.397 0.050 7.95 0.0000000000000019 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.83 on 2483 degrees of freedom
## Residual deviance: 837.73 on 2482 degrees of freedom
## AIC: 841.7
##
## Number of Fisher Scoring iterations: 6
If for any reason we wanted to change the link function for a given
family, we can do so directly (though note it is unnecessary here since
link = logit
is default for the binomial)
fit <- glm(y ~ x, family = binomial(link = logit), data = heart2)
You can see a number of the random components immediately available
in R by investigating ?family
When using tabular data with the glm
function, extra
consideration is needed. In this case, glm()
is expecting
to have as the response the observed probability for each group. That
is, instead of observing if each observation in the “never snores” group
has heart disease or not, it asks what proportion of the never snores
group has heart disease. We can find this directly by dividing the total
number of “yes” values by the total in each group
## Total for each group
n <- heart$yes + heart$no
n
## [1] 1379 638 213 254
In addition to giving the probability for each group, we also need to
specify the weights
argument. This corresponds to the
number of observations in each group
## Using tabular form
fit2 <- glm(yes / n ~ x, family = binomial, weights = n, data = heart)
We will concern ourselves with inference later, but for now it is sufficient to see that these produce the same estimates based on the data:
coef(fit) # long data
## (Intercept) x
## -3.86625 0.39734
coef(fit2) # tabular
## (Intercept) x
## -3.86625 0.39734
The text notes that, by default, R does not provide a link identity
function associated with the binomial random component (i.e.,
binomial(link = "identity")
will get us nowhere), but we
can work around this using the quasi()
family function to
build it directly
## Fitting long data with identity link and binomial random component
fit_linear <- glm(y ~ x, data = heart2,
family = quasi(link = "identity", variance = "mu(1-mu)"))
summary(fit_linear)
##
## Call:
## glm(formula = y ~ x, family = quasi(link = "identity", variance = "mu(1-mu)"),
## data = heart2)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01725 0.00345 5.00 0.0000006280535 ***
## x 0.01978 0.00281 7.05 0.0000000000023 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasi family taken to be 1.0007)
##
## Null deviance: 900.83 on 2483 degrees of freedom
## Residual deviance: 834.99 on 2482 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
predict
The predict()
function is a function that provides
predicted response values for a set of explanatory variables, based on a
model. predict()
is a generic
function in R, meaning how it behaves will depend on the type of
object that is passed in. In other words, predict()
will
function differently for an lm
type object than it will for
a glm
.
Question 0: Investigate ?predict
and
predict.glm
to see how they differ. There is nothing you
need to record
For a GLM object, the first argument we need is the
object
argument which takes a fitted glm
.
Second is newdata
, taking a data.frame that must
have as its named columns all of the explanatory variables included in
the object
model. Finally, we need to consider the argument
to type
which, depending on its value, will return either
the value of the link function or values in terms of the
response. That is, when type = "link"
, the
predict()
function will return the predicted values of
\[
\log \left( \frac{P(Y = 1)}{1 - P(Y = 1)} \right)
\] Alternatively, when type = "response"
, the
predicted values will be in terms of the predicted probabilities.
We can see an example of how this works in a simple linear model case predicting miles per gallon in vehicles based on vehicle weight and quarter mile time:
## built-in R dataset
data(mtcars)
model <- glm(mpg ~ wt + qsec, family = quasi(link = "identity"), data = mtcars)
## Uncomment and run this to confirm the coefficients are the same
# m <- lm(mpg ~ wt + qsec, data = mtcars)
If we wanted to predict the miles per gallon of several different cars based on weight and quarter mile time, we could do so by first creating a new data.frame with this variables and the values we want
newdat <- data.frame(wt = c(3, 4),
qsec = c(16.5, 20))
predict(model, newdat, type = "response")
## 1 2
## 19.934 18.138
Since we are using the identity link here, this is the same as if we
used predict using type = "link"
as well
predict(model, newdat, type = "link")
## 1 2
## 19.934 18.138
Question 1 Recreate the plot from the textbook demonstrating the comparison between the logistic and linear predictions for heart disease based on snoring status.
Note: Here, I have created the ggplot and assigned it to the variable
name p
p
I can add a new layer to this with data from a different data.frame
by passing in an additional data
argument. I will have to
override the color aesthetic, though, since I used in it
aes()
in my original plot
## Add data summary points
heart$prop <- heart$yes / (heart$yes + heart$no) # proportion for each group
p + geom_point(data = heart, aes(x = x, y = prop), color = "black", size = 3)