Consider a fitted line on simulated data.
We can rotate this plot so that the fitted line is nearly horizontal
And compare this with what we would find on a residual plot
Let’s begin by assuming that we have an outcome \(y\) with two covariates, \(X_1\) and \(X_2\), their relationship being
\[ y = X_1 \beta_1 + X_2 \beta_2 + \epsilon \] where \(\beta_1 = 1\) and \(\beta_2 = -1\).
# Simulating our data
set.seed(122)
x1 <- rnorm(100)
x2 <- rnorm(100)
y <- x1 - x2 + rnorm(100, sd = 0.1)
df <- data.frame(x1, x2, y)
We can visualize the relationship of these three variables together
using plotly along with type = "scatter3d"
.
Using marginal plots, we can also visualize the univariate relationships between the outcome and a particular covariate. As anticipated, \(y\) appears to have a positive association with \(X_1\) and a negative association with \(X_2\)
Suppose we start by fitting a model with only a single covariate, \(X_1\). Note that
\[ y = X_1 \beta_1 \qquad \Leftrightarrow \qquad y = X_1 \beta_1 + X_2 \times 0 \]
In other words, we can fit a line relating \(X_1\) to \(y\) and plot it in 3D by setting the \(X_2\) coordinate to zero.
Playing with the plotly graph, we see that the line demonstrates a positive association between \(X_1\) and our outcome. Before moving on to consider residuals, let’s first see if we can identify any direct association with \(X_2\) using just the plot we have.
Viewing the marginal plot of \(X_1\) and \(y\), we see that larger residuals (above the fitted line) are shaded darker, corresponding tho smaller values of \(X_2\). Conversely, smaller residuals, those below the fitted line, are shaded lighter, corresponding to larger values of \(X_2\).
We can see this same relationship if we consider the marginal plot of \(X_2\), this time coloring our points by the size of their residual:
This time we see a similar phenomenon where residuals are colored
according to their magnitude. Rotating the plot and considering the
relationship between \(X_2\) and
the color of the points (rather than their position on the
y-axis), we see a similar sort of linear relationship. Indeed, this is
the same plot we would get if we simply plotted the values of \(X_2\) against the residuals of the model
lm(y ~ x1)
Logistic regression is unique in that our outcomes are binary categories, while our model is created to predict probabilities of association within those categories. As a result of this, there are two things that make interpreting logistic regression results less straightforward than linear regression results:
We will examine each of these a bit more closely
Logistic regression is performed in terms of odds which are distinct from probabilities though are directly related to them by the equation
\[ odds = \frac{\pi}{1 - \pi} \] where \(\pi\) is our probability of success, i.e., \(P(Y = 1) = \pi\). If our probability of success were 60%, we could compute the odds directly (using fractions for simplification):
\[ \frac{0.6}{1 - 0.6} = \frac{6/10}{1 - 6/10} = \frac{6/10}{4/10} = \frac{6}{4} \] Here, we say the odds are “six to four”, or “six successes for every four losses”. Of course, we could also start with odds and convert them to probability of success.
An interesting thing occurs when we multiply our odds by a constant. In the case above, suppose that we are interested in multiplying our odds by 2, so instead of being 6:4 or “six to four”, it is now 12:4 or “twelve to four”. Converting this back to a probability, we get
\[ \frac{12}{4} = \frac{12 / (12+4)}{4 / (12+4)} = \frac{12/16}{4/16} = \frac{0.75}{0.25} \] We see that doubling the odds changed the probability of success from 60% to 75%, a net change of 15%.
Now suppose we double the odds again, this time to be 24:4
\[ \frac{24}{4} = \frac{24 / (24+4)}{4 / (24+4)} = \frac{24/28}{4/28} = \frac{0.857}{0.143} \] This time, the probability of success went from 75% to 86%, a change of 11% rather than the 15% change we saw previously.
When we perform logistic regression, we are finding constant factors (our covariates) that act multiplicatively on our odds.
Logistic regression is modeled in terms of log odds, so that our covariates are written
\[
\log \left( \frac{\pi}{1-\pi} \right) = \beta_0 + X_1\beta_1 +
X_2\beta_2 + \dots
\] Taking the exponential on both sides will remove the
log
from our variable
\[ e^{\log \left( \frac{\pi}{1-\pi} \right)} = e^{\beta_0 + X_1\beta_1 + X_2\beta_2 + \dots} \] Additionally, because \(e^{x+y} = e^xe^y\), this reduces to (\(\exp\) means \(e\))
\[ \frac{\pi}{1-\pi} = \exp(\beta_0) \cdot \exp(X_1 \beta_1) \cdot \exp(X_2 \beta_2) \dots \] We now have our odds expressed in terms of the exponential of our covariates.
Now let’s suppose that we have a simplified model with only one covariate and no intercept so that our odds looks like this:
\[ \frac{\pi}{1-\pi} = \exp(X_1 \beta_1) \] What role does \(\beta_1\) play here?
When we have interpreted regression results, we interpret the value of \(\beta_1\) being some measure of changed associated with a unit change in \(X_1\). Using the fact that
\[ e^{(X_1 + 1) \beta_1} = e^{X_1 \beta_1 + \beta_1} = e^{X_1\beta_1} e^{\beta_1} \] We see that
\[ e^{(X_1 + 1) \beta_1} = e^{X_1\beta_1} e^{\beta_1} = \left(\frac{\pi}{1-\pi} \right) e^{\beta_1} \] It follows that \(e^{\beta_1}\) is the constant factor that dictates how our odds changes in response to a change in \(X_1\). If \(e^{\beta_1} = 2\), then we are precisely in our situation above when we doubled our odds. What does this say about changes in \(X\)? Is the difference between \(X\) and \(X+1\) the same as the difference between \(X+1\) and \(X+2\)?
For this example, we will use the problem from the homework, looking at credit card defaults with scaled balance and income.
Default <- read.csv("https://collinn.github.io/data/default.csv")
Default$default_num <- ifelse(Default$default == "Yes", 1, 0)
ggplot(Default, aes(balance, income, color = default)) + geom_point(alpha = 0.25)
Default <- mutate(Default,
balance.z = as.numeric(scale(balance)),
income.z = as.numeric(scale(income)))
fits <- glm(default_num ~ balance.z + income.z, Default, family = binomial)
First, let’s consider a plot showing the predicted probability of default as a function of the standardized credit card balance
How does the probability of default change if we increase our balance by one standard deviation?
Looking at a summary of the results, we see that both variables are highly significant, but this is otherwise nearly impossible to translate directly:
summary(fits)
##
## Call:
## glm(formula = default_num ~ balance.z + income.z, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.1256 0.1876 -32.66 < 0.0000000000000002 ***
## balance.z 2.7316 0.1100 24.84 < 0.0000000000000002 ***
## income.z 0.2775 0.0665 4.17 0.00003 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Taking the exponential of the parameters will give us the constants associated with each of the covariates, telling us their multiplicative impact on the odds of default.
exp(coef(fits))
## (Intercept) balance.z income.z
## 0.0021863 15.3572599 1.3198549
Here, we see that increasing the balance by a standard deviation results in a 15 times increase in the odds! By itself, though, it tells us nothing about the probability of default. To do that, we can consider the inverse logit function.
The inverse logit function is derived by solving for \(\pi\) in the expression of the odds. Doing so yields
\[ \pi = \frac{\exp(X \beta)}{1 + \exp(X\beta)} \] We can write this in R as
invlogit <- function(x) { exp(x) / (1 + exp(x)) }
Since we are using exp()
inside the function, we don’t
need to exponentiate our coefficients first. Going back to our model, we
have \(X\beta\) values of
coef(fits)
## (Intercept) balance.z income.z
## -6.12557 2.73159 0.27752
Consider an individual who has an average income and and average
credit card balance (so that balance.z = 0
and
income.z = 0
). Their probability of default is computed
as
invlogit(-6.1255 + 0 * 2.73159 + 0 * 0.27752)
## [1] 0.0021816
As we see, the probability of default is about 0.2%. If a person with average income were to increase their balance by one standard deviation ($483 in our dataset), their probability of default would be
invlogit(-6.1255 + 1 * 2.73159 + 0 * 0.27752)
## [1] 0.032486
Compare the difference between these two with the plot above.
As could also be anticipated by the plot, the difference from 1 to 2 and 2 to 3 is even sharper.
invlogit(-6.1255 + 2 * 2.73159 + 0 * 0.27752)
## [1] 0.34022
invlogit(-6.1255 + 3 * 2.73159 + 0 * 0.27752)
## [1] 0.88788
Default <- read.csv("https://collinn.github.io/data/default.csv")
Default$default_num <- ifelse(Default$default == "Yes", 1, 0)
Default <- mutate(Default,
balance.z = as.numeric(scale(balance)),
income.z = as.numeric(scale(income)))
Default <- Default[sample(1:nrow(Default), size = 5000), ]
fits <- glm(default_num ~ balance.z + income.z, Default, family = binomial)
x1 <- seq(min(Default$balance.z), max(Default$balance.z), length.out = 100)
x2 <- seq(min(Default$income.z), max(Default$income.z), length.out = 100)
gg <- expand.grid(balance.z = x1,
income.z = x2)
yhat <- predict(fits, gg, type = "response")
yhat <- matrix(yhat, nrow = 100, ncol = 100, byrow = TRUE)
plot_ly(data = Default, type = "scatter3d", mode = "markers",
x = ~balance.z, y = ~income.z, z = ~default_num, color = ~factor(default_num)) %>%
add_surface(x = x1, y = x2, z = yhat, colorscale = "Blues") %>%
add_trace(x = x1, y = rep(x2[45], 100), z = yhat[45,], mode = "lines",
line = list(color = "red", width = 8), inherit = FALSE) %>%
layout(scene = list(aspectmode = "manual",
aspectratio = list(x = 1, y = 1, z = 1)),
title = "Inverse Logistic when income.z = 0")
We summarize with a few points: