library(ggplot2)
library(dplyr)
Up until this point with regression, we have been interested in estimating the value of a numeric outcome (dependent variable) using a linear function of covariates,
\[ \hat{y} = X \hat{\beta}. \]
A consequence of our outcome being both continuous and numeric is that it can take on an indefinite number of values. In the plot below, a valid value of \(\hat{y}\) could be any number on the blue line.
That is, given any value of height in the interval shown, we could estimate a unique value for weight.
This is in sharp contrast to a situation in which our outcome is categorical, characterized by membership in one group or another; examples of this include the outcome of an experiment being success or failure, the presence of absence of side effects in a clinical trial, or whether or not somebody decides to get a flu shot. Unlike a continuous outcome, categorical outcomes are often restricted to two values: yes or no.
When the independent variable of our model is binary, we typically code them has taking values of 1 and 0 (success and failure) with probabilities \(\pi\) and \(1 - \pi\), respectively. More explicitly, we say that \(y_i\) is a success with probability \(\pi_i\). Being a probability, valid values of \(\pi\) are those on the interval \([0, 1]\).
A term we will use frequently in our study of logistic regression will be that of odds. A term that you are likely familiar with to some degree, odds are typically reported as a ratio giving the number of successes to the number of failures. For example, if I were to tell you that the odds of winning something were “3 to 1” or 3:1, this would mean that you should anticipate three successes for every one failure. This is distinct from expressing it as a probability, in which case we would present a ratio of an event to the total. Here, “3 to 1” odds indicates 4 total events (three success, one failure), so our probability of success is 3/4 (three successes out of four trials) while the probability of failure is 1/4. From this, we find that our odds ratio is in fact a ratio of probabilities:
\[ odds = 3:1 \equiv \frac{3}{1} = \frac{3/4}{1/4} = \frac{\pi}{1-\pi} \]
We can also see from this, then, that the probability of success, \(\pi\), is \(\pi = 3/4 = 75\%\) while the probability of failure is \(1 - \pi = 25\%\).
Question 1: Answer the following questions about odds:
Rather than determining success or failure directly, the goal of logistic regression is to estimate the probability of success based on the values of the independent variables. For reasons just beyond the scope of this lab, we will simply note that the form of logistic regression seeks to estimate parameter values \(\beta\) based on the log odds of the probability:
\[ \log \left( \frac{\pi}{1-\pi} \right) = \beta_0 + X_1 \beta_1 + X_2 \beta_2 + \dots + X_p\beta_p \]
Exponentiating both sides brings us back to the odds ratio from before:
\[\begin{align} \frac{\pi}{1-\pi} &= \exp \left( \beta_0 + X_1 \beta_1 + X_2 \beta_2 + \dots + X_p\beta_p \right) \\ &= \exp \left(\beta_0 \right) \times \exp \left(X_1 \beta_1 \right) \times \dots \times \exp \left(X_p \beta_p \right) \end{align}\]
Based on these expression, we have the following interpretation: for any fitted value \(\hat{\beta_j}\), a unit increase in \(X_j\) results in a multiplication of the odds by a factor \(\exp (\hat{\beta_j})\). Note further the parallel between the previous labs: when \(\hat{\beta} = 0\), we assumed that \(X\) was not linearly related to \(y\). Here, when \(\hat{\beta} = 0\), \(\exp(X \hat{\beta}) = \exp(0) = 1\), meaning that changes in \(X\) result in a multiplicative change in the odds of 1 (i.e., no change). As a corollary to this, negative values of \(\beta\) are associated with a decrease in the odds (multiplied by a number less than 1), while positive values of \(\beta\) are associated in an increase in the odds.
Another, equivalent way to describe \(\exp(\hat{\beta})\) is as an odds ratio. The justification for this is provided briefly in the following section.
The algebra in this section is not important in and of itself. Rather, for completeness we will demonstrate why \(\exp(\hat{\beta})\) is called an odds ratio and then provide a formula for solving directly for an estimated probability, given covariates.
Suppose that we are interested in considering the impact of \(\beta_1\), the variable associated with the covariate \(X_1\). From above, we know that
\[ \frac{\pi}{1-\pi} = \exp \left(\beta_0 \right) \times \exp \left(X_1 \beta_1 \right) \]
Let \(\pi^*\) then denote the odds of a person with all of the same values, except instead of \(X_1\) they have \(X_1 + 1\), i.e., the only difference is a unit increase in \(X_1\). The odds would be
\[ \frac{\pi^*}{1-\pi^*} = \exp \left(\beta_0 \right) \times \exp \left((X_1+1) \beta_1 \right) \]
Separating out this last exponential term, we find
\[\begin{align} \frac{\pi^*}{1-\pi^*} &= \exp \left(\beta_0 \right) \times \exp \left((X_1+1) \beta_1 \right) \\ &= \exp \left(\beta_0 \right) \times \exp \left(X_1\beta_1 + \beta_1 \right) \\ &= \left( \exp \left(\beta_0 \right) \times \exp \left(X_1\beta_1\right) \right) \times \exp \left(\beta_1 \right) \\ &= \left( \frac{\pi}{1 - \pi} \right)\times \exp \left(\beta_1 \right) \end{align}\]
Multiplying each side by the inverse of \(\pi / (1 - \pi)\) we finally get
\[ \exp(\beta_1) = \frac{\frac{\pi^*}{1-\pi^*}}{\frac{\pi}{1-\pi}} \]
which is the ratio of two odds of an observation with covariates \(X_1\) and \(X_1 + 1\). Just as before, when \(\exp(\beta_1) > 1\), the odds increase, whereas \(\exp(\beta_1) < 1\) represents a decrease.
We can determine the probability of success based on independent variables and fitted coefficients (solved using algebra)
\[ \hat{\pi_i} = \frac{\exp (X_i\hat{\beta})}{1 + \exp (X_i\hat{\beta})} \]
Though rarely will we need to do this directly.
Let’s bring everything we have discussed to this point into a worked
example. For this, we will be using the Donner
dataset.
The Donner Party was a group of 45 pioneers whose migration to California was delayed by a series of mishaps that ultimately led to the group being stranded in the Sierra Nevada mountains. While trapped in the mountains, a number of members in the party resorted to cannibalism. The goal of our analysis here will be to use the variables in the dataset to determine the probability of having died on this journey.
We start by loading the data and examining the variables present.
# Load data (borrowed from the alr4 package)
Donner <- read.csv("https://collinn.github.io/data/donner.csv")
head(Donner)
## age y sex family.name status
## 1 13 survived Male Breen Family
## 2 1 survived Female Breen Family
## 3 5 survived Male Breen Family
## 4 14 survived Male Breen Family
## 5 40 survived Female Breen Family
## 6 51 survived Male Breen Family
Here, we see their age in 1846, their sex, family name, whether or not they survived, and a status variable indicating whether or not they were a part of a family, were single, or were hired as part of the expedition. Although it is not strictly required for logistic regression in R, we will create a new variable “death” indicating with a 0 or 1 if that particular member was deceased*. Then, we will construct our logistic regression model using sex and age as our independent variables.
*Note: You can use any factor or even a character vector as your outcome for logistic regression in R. The problem is then figuring out which value the model treated as “success”. By explicitly labeling them 0/1, we exercise more control.
To perform logistic regression, we use the function
glm()
(general linear model) with the argument
family = "binomial"
to indicate a binomial outcome.
## Recategorize outcome as 0/1 and remove NA values
dat <- mutate(Donner,
death = ifelse(y == "died", 1, 0)) %>%
na.omit()
fit_log <- glm(death ~ sex + age, data = dat, family = "binomial")
summary(fit_log)
##
## Call:
## glm(formula = death ~ sex + age, family = "binomial", data = dat)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6218 0.5028 -3.23 0.0013 **
## sexMale 1.0680 0.4823 2.21 0.0268 *
## age 0.0356 0.0152 2.34 0.0195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 120.86 on 87 degrees of freedom
## Residual deviance: 108.87 on 85 degrees of freedom
## AIC: 114.9
##
## Number of Fisher Scoring iterations: 4
Just as before, we can use a summary()
function to
display the results (similarly, we can still use nearly all of the
functions from the previous lab, coef()
,
residuals()
, fitted()
, anova()
,
etc.,). Looking at these coefficients and recalling our method of
interpretation from the previous section, we have:
Variable | \(\hat{\beta}\) | \(\exp(\hat{\beta})\) | Interpretation |
---|---|---|---|
Intercept | -1.6218 | 0.19754 | None (\(age = 0\) is not in our dataset) |
sex (Male) | 1.06798 | 2.90949 | Being male had 2.9 times the odds of death compared to female |
Age | 0.03561 | 1.03625 | Odds of death increased by 1.03 for each additional year |
Also like the numeric linear model, we can use the
predict()
function to get fitted values from new data
points. In this case, we need to be sure to pass
type = "response"
to be sure that we are getting results in
terms of probabilities rather than log odds.
## Create new data.frame with independent variables
newdf <- expand.grid(sex = c("Male", "Female"),
age = seq(min(dat$age), 90)) # Extended range for plot aesthetic
yhat <- predict(fit_log, newdf, type = "response")
## Add prediction values to "new" data
newdf$yhat <- yhat
## Plot original values along with prediction
ggplot(newdf, aes(age, yhat, color = sex)) + geom_line() +
geom_point(data = dat, mapping = aes(age, death, color = sex), shape = 1, size = 4) +
ylab("Probability of Death")
Looking at this plot, we see represented as circles the observed values of our dataset, marked as either a 0 or 1 according to whether or not the individual had died, while the solid lines indicate the probability of death as a continuous function of age.
Question 2: Refit a regression model for the Donner
dataset, this time using lm()
to construct a linear model.
Additionally, include the covariate “status” from the data, a
categorical variable indicating if the member was a part of a family,
was a single traveler, or was hired as a guide by the group.
Examining the coefficients of this model, assume that we have a single male who is 35 years old.
For the remainder of this lab, we will continue our process of building logistic regression models, interpreting results, and investigating alternate metrics for determining the validity of our models.
Question 3: The code below loads game data from the
Golden State Warriors’ 2016-2016 season. Read in the data below and
consider win_binary
to be our outcome of interest.
gsw <- read.csv("https://remiller1450.github.io/data/GSWarriors.csv")
gsw$win_binary <- ifelse(gsw$Win == "W", 1, 0)
Part A: Fit a logistic regression model using the
formula win_binary ~ FG3A
, noting that “FG3A” is the number
of three-point shots attempted in a game. Find the coefficients
for this model, exponentiate them, and the interpret the variable
associated with FG3A.
Part B: Now fit a logistic model using the formula
win_binary ~ FG3A + FG3
where FG3 represents the number of
three point shots made. Again, find and exponentiate the
coefficients and then interpret. Why is the exponentiated coefficient
for FG3A less than 1? Hint: Recall from previous labs that the
coefficient, when interpreted in isolation, assumes that all other
variables are held constant.
When our outcome was a numeric variable, calculating the error in our model was simple enough: we could take the observed value \(y\), subtract the predicted value \(\hat{y}\), and the resulting magnitude of difference gave us the magnitude of the error. In the categorical case, our situation is different: each outcome is marked as being either a 0 or a 1, but not both. Further, our model doesn’t directly classify the predicted outcome but rather, gives us an estimated probability of membership in either group.
For example, consider consider these three people from the Donner party dataset
(newdat <- dat[c(12, 37, 45), ])
## age y sex family.name status death
## 12 45 died Female J_Donner Family 1
## 37 57 died Male Graves Family 1
## 45 27 survived Male Hired Hired 0
We have a woman aged 45 and a male aged 57 who both died, as well as
a male aged 27 who survived. Using the predict()
function,
we can find the estimated probabilities of death for each of them.
# Using the Donner Party data to fit the model
fit <- glm(death ~ sex + age, data = dat, family = "binomial")
# Probability of death for each
newdat$prob_death <- predict(fit, newdat, type = "response")
newdat
## age y sex family.name status death prob_death
## 12 45 died Female J_Donner Family 1 0.49511
## 37 57 died Male Graves Family 1 0.81392
## 45 27 survived Male Hired Hired 0 0.60049
That is, the 45 year old woman had a probability of death of 49%, the 57 year old male 81%, and the 27 year old male a 60% of death. Using these probabilities, we need to predict the classification for each of these observations. The most obvious solution is to use a cutoff of 50%: those with a probability greater than 50% we will classify as death, while those with less will be classified as survived. Our accuracy, then, is defined as the total correct divided by the total number of observations
\[ \text{Accuracy} = \frac1n \sum (y_i = \hat{y}_i) \]
# Classify greater than 0.5 as having died
newdat <- newdat %>%
mutate(classify_dead = as.numeric(prob_death >= 0.5))
## Our accuracy for this subset
sum(newdat$death == newdat$classify_dead) / nrow(newdat)
## [1] 0.33333
Of the three values we predicted, only 1 was correct resulting in an accuracy of 33%.
To get the accuracy for the whole dataset, we can do the same thing
(this time using the fitted()
function to get fit values
(probabilities) from our model)
## Round the probabilities to 1 or 0 first
yhat <- ifelse(fitted(fit) > 0.5, 1, 0)
## Compute the proportion that are correct
# sum(dat$death == yhat) / nrow(dat)
## This does the same thing
mean(dat$death == yhat)
## [1] 0.67045
We see that our overall accuracy is around 67%
Accuracy is often presented alongside a confusion matrix, a two-way table of the model’s predicted categories along with their true categories.
## Change 0/1 to survived/died to
yhat <- ifelse(yhat == 0, "survived", "died")
## Make table with dat$y which also contains survived/died
table(dat$y, yhat)
## yhat
## died survived
## died 21 18
## survived 11 38
Here, the rows represent the true values, while the columns are what was predicted. Taking marginal sums of this table gives us totals. For example, the row for “died” indicates that within the Donner party dataset, 21 + 18 = 39 people actually died. Of these, 21 were correctly predicted to have died while 18 were incorrectly predicted to have survived. Likewise, we see that 11 + 38 = 49 people survived, of these 11 we predicted had died while 38 we correctly predicted had lived.
The following is also noteworthy: of the 39 people who died, we correctly predicted 21/39 = 54% of them. Of the people who survived, we correctly predicted 38/49 = 78%. Even though our overall model had an accuracy of 67%, we see that our accuracy within each of the outcomes is considerably different.
It’s of critical importance to note here that whether or not we predicted an individual would live or die was based entirely on our classification method, the threshold that we set classification. That is, we said that we would classify those with probabilities greater than 0.5 of death as being dead. Changing this cutoff value would result in a new confusion matrix.
Question 4: Using the same logistic regression
model as above (y ~ sex + age
), change the threshold of
probabilities for survival outcomes to be \(0.6\) instead of \(0.5\). How does this change the accuracy
for our model? What about the accuracy for each of the survival groups?
(Hint: Build a confusion matrix)
Repeat the experiment again, this time changing the threshold to \(0.4\): what is the accuracy of this model and for each of the survival groups?
Based on this, what considerations would you make if correctly predicting who died was more important to you than overall accuracy?
One of the things we should have seen in the previous section is accuracy, as a single metric, fails to illustrate crucial aspects of our model. This results from the fact that there are multiple ways to be right as well as multiple ways to be wrong. We differentiate these possibilities with the following terms (using success = 1 and failure = 0):
These are further summarized in the following table:
From these quantities, we derive two important metrics:
\[ \text{Sensitivity} = \frac{\# \text{True Positives}}{\# \text{True Positives} + \# \text{False Negatives}} \]
\[ \text{Specificity} = \frac{\# \text{True Negatives}}{\# \text{True Negatives} + \# \text{False Positives}} \]
In fact, these are values that we have already seen when we previously calculated the accuracy in our model of classifying those who died along with the accuracy with those who survived. As we also saw in Question 4, there is a state of tension between these two: taking efforts to improve sensitivity acts to the detriment of specificity and vice versa.
Question 5: To consider an interesting and extreme example of this, suppose there is some rare disease the infects 1 in 10,000 people that can be treated with simple antibiotics if it is detected early, but otherwise results in a gruesome and horrifying death. A group of doctors are considering using one of two alternatives, Test A and Test B, to identify individuals at risk.
Test A, no matter what, will return that the test is negative, even if the person has the disease. It would have a confusion matrix that looks like this:
Tested Positive | Tested Negative | |
---|---|---|
Has Disease | 0 | 1 |
Does not have disease | 0 | 9999 |
Test B, on the other hand, will always detect the disease, if it is present, but will give a false positive if the patient does not have the disease about 3% of the time. The confusion matrix for Test B would look like this:
Tested Positive | Tested Negative | |
---|---|---|
Has Disease | 1 | 0 |
Does not have disease | 300 | 9699 |
Based on these tables, researchers determine that Test A has an accuracy of 99.99% while Test B only has an accuracy of 97%. As such, they decide to go with Test A.
Why is this incorrect? What metrics should they have considered instead?
There are two things up to this point that we should feel confident asserting:
We have already seen above that sensitivity, being the number of true positives divided by the total number of actual positives, gives us our true positive rate. Our false positive rate can be found as being \(1 - \text{Specificity}\):
\[\begin{align} 1 - \text{Specificity} &= 1 - \frac{\# \text{True Negatives (TN)}}{\# \text{True Negatives (TN)} + \# \text{False Positives (FP)}} \\ \\ &= \frac{ \text{TN} + \text{FP}}{ \text{TN} + \text{FP}} - \frac{ \text{TN}}{ \text{TN} + \text{FP}} \\ \\ &= \frac{ \text{FP}}{ \text{TN} + \text{FP}} \end{align}\]
Using the false positive rate, or \(1 - \text{Specificity}\), is more natural than using specificity alone in that it makes the tension between these two metrics a bit more intuitive: if we lower our threshold for what we consider a success, I should anticipate that I will be more likely to correctly identify true successes (increasing my true positive rate), but I should also anticipate that I will be more likely to incorrectly identify those that were actually failures (increasing my false positive rate).
Going back to our Donner party data, let’s create a function that takes as an argument a particular threshold and returns both the true positive and the false positive rates:
metrics <- function(thresh = 0.5) {
yhat <- ifelse(fitted(fit) > thresh, 1, 0)
yhat <- ifelse(yhat == 0, "survived", "died")
# Create factor so that table won't drop empty levels
yhat <- factor(yhat, levels = c("died", "survived"))
tt <- table(dat$y, yhat)
res <- c("Sensitivity" = tt[1,1] / sum(tt[1, ]),
`1 - Specificity` = 1 - tt[2, 2] / sum(tt[2, ]))
res
}
metrics(0.5)
## Sensitivity 1 - Specificity
## 0.53846 0.22449
Using a threshold of 0.5, we get the same sensitivity and 1-specificity values that we saw earlier. I can use these function to create a table for a variety of thresholds, each incremented by 10%
Threshold | Sensitivity | 1 - Specificity |
---|---|---|
0.0 | 1.00000 | 1.00000 |
0.1 | 1.00000 | 1.00000 |
0.2 | 0.89744 | 0.87755 |
0.3 | 0.89744 | 0.59184 |
0.4 | 0.74359 | 0.46939 |
0.5 | 0.53846 | 0.22449 |
0.6 | 0.30769 | 0.16327 |
0.7 | 0.12821 | 0.04082 |
0.8 | 0.10256 | 0.00000 |
0.9 | 0.00000 | 0.00000 |
1.0 | 0.00000 | 0.00000 |
As the threshold decreases, we see that my true positive rate increases (a good thing) along with increases to the false positive rate (a bad thing). This, then, is the tension between sensitivity and specificity.
We can plot the relationship between the true and false positive rates, annotating the line with the value of the threshold
This plot is known as a receiver operating characteristic (ROC) curve, the name resulting from the fact that it was created during WWII by electrical and radar engineers for detecting enemy combatants in the battle field.
We can interpret this plot as follows: for a threshold value of \(t = 0.3\), we see that we have a true positive rate of about 90% and a false positive rate of about 60%. Increasing our threshold to \(t = 0.2\) doesn’t change our true positive rate (still about 90%), while it drastically increases the false positive rate, all the way to nearly 90%, nearly the same as our true positive rate. This makes a threshold of 0.3 much better than a threshold of 0.2.
Rather than computing ROC values ourselves, we can use the
plotROC
package in R which adds geom_roc()
to
the suite of ggplot functions. Note that this functions uses the
aesthetics d
and m
for “disease” and “marker”,
respectively
library(plotROC)
## Add fitted values (probabilities) to our data
dat$fitval <- fitted(fit)
ggplot(dat) + geom_roc(aes(d = death, m = fitval))
A nice aspect of the ggplot version of ROC is that it allows you to color by groups, giving us the ability to compare ROC curves from both male and female populations
ggplot(dat) + geom_roc(aes(d = death, m = fitval, color = sex))
Question 6: Based on the ROC plot by sex above, is it obvious that our model performs better at predicting either male or female? How would you decide which sex had the better predictions?
As we just saw in question 6, it can be difficult to simply look at two ROC curves and say that one is better than the other. We will start this section with a few observations.
First, we might say that the ideal case is one in which our true positive rate is 100% while our false positive rate is 0%. We will indicate place on the graph with a blue dot.
The worst possible situation would be the converse: we have a 100% false positive rate with a 0% true positive rate. We will mark this spot in red. Together, these represent the two extremes.
The most average case comes from simple guessing, a situation in which our probability of correctly identifying a positive is the same as our probability of falsely identifying a positive. We call this situation no better than chance, and we will represent it with an orange dot.
In fact, there are many such instances in which the true positive rate is equal to the false positive rate, enough that our “no better than chance” can be represented by a line
This adds quite a bit of context to our interpretation of the ROC curve. Now we see that a threshold of 0.5 performs much better than chance, while a threshold of 0.2 does as well as random guessing.
The quality of a model, however, is not determined by any particular point on the ROC curve, but rather on the quality of the ROC curve as a whole. To this end, we need a metric that summarizes it. Such a metric is implied by the orange line above.
Note that the orange line bisects a square with sides of length 1. As such, the area under the curve (AUC) is equal to 0.5. Alternatively, consider again our best case scenario, where we achieve a true positive rate of 100% and a false positive rate of 0%. This would have an AUC of 1.0.
Our curve happens to be somewhere in between “best scenario” and “no
better than chance”. While there are several packages that will compute
the AUC of the ROC for us, we can use the calc_auc()
function provided by the plotROC
package. To use it, save
the ggplot with ROC to a variable and then compute AUC.
library(plotROC)
roc <- ggplot(dat) + geom_roc(aes(d = death, m = fitval))
roc_sex <- ggplot(dat, aes(d = death, m = fitval, color = sex)) + geom_roc()
gridExtra::grid.arrange(roc, roc_sex, nrow = 1)
# AUC for overall ROC curve
calc_auc(roc)
## PANEL group AUC
## 1 1 -1 0.69911
# AUC for each sex's ROC curve
calc_auc(roc_sex)
## PANEL group sex AUC
## 1 1 1 Female 0.6080
## 2 1 2 Male 0.5819
In general, the AUC of an ROC curve can be used as a reliable metric for assessing the quality of prediction in a logistic regression model.
Question 7: Return to question 3 where we fit the
two models predicting win probabilities, one with the variable
FG3A
and the other with both FG3A
and
FG3
. Create ROC curves for each of these models and find
the AUC.
gsw <- read.csv("https://remiller1450.github.io/data/GSWarriors.csv")
gsw$win_binary <- ifelse(gsw$Win == "W", 1, 0)