Logistic Model

library(dplyr)
library(ggplot2)
library(pROC)
theme_set(theme_bw(base_size = 16))

crab <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Crabs.dat", 
                   header = TRUE)

## Make vector of factor
crab$colorf <- factor(crab$color)

fit <- glm(y ~ width, family = binomial, data = crab)
summary(fit)
## 
## Call:
## glm(formula = y ~ width, family = binomial, data = crab)
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)  -12.351      2.629   -4.70 0.0000026 ***
## width          0.497      0.102    4.89 0.0000010 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 194.45  on 171  degrees of freedom
## AIC: 198.5
## 
## Number of Fisher Scoring iterations: 4
exp(coef(fit))
##  (Intercept)        width 
## 0.0000043262 1.6441615976

Confusion Matrix / Classification Table

## Make vector of factor
crab$colorf <- factor(crab$color)

fit <- glm(y ~ width + colorf, family = binomial, data = crab)

## Sample proportion crabs with sat
(prop <- sum(crab$y) / nrow(crab))
## [1] 0.64162
predval <- as.numeric(fitted(fit) > prop)

xtabs(~ crab$y + predval)
##       predval
## crab$y  0  1
##      0 43 19
##      1 36 75
## Try again with pi0 = 50
predval50 <- as.numeric(fitted(fit) > 0.5)
xtabs(~ crab$y + predval50)
##       predval50
## crab$y  0  1
##      0 31 31
##      1 15 96

In most cases, we could also just as well use table

table(crab$y, predval50)
##    predval50
##      0  1
##   0 31 31
##   1 15 96

ROC

rocplot <- roc(y ~ fitted(fit), data = crab)

with(rocplot, cbind(thresholds, specificities, sensitivities)) %>% head()
##      thresholds specificities sensitivities
## [1,]       -Inf      0.000000       1.00000
## [2,]   0.076877      0.016129       1.00000
## [3,]   0.123589      0.016129       0.99099
## [4,]   0.149141      0.032258       0.99099
## [5,]   0.158344      0.048387       0.99099
## [6,]   0.188201      0.080645       0.99099
with(rocplot, cbind(thresholds, specificities, sensitivities)) %>% tail()
##        thresholds specificities sensitivities
## [104,]    0.94295             1      0.045045
## [105,]    0.94844             1      0.036036
## [106,]    0.96101             1      0.027027
## [107,]    0.97254             1      0.018018
## [108,]    0.98061             1      0.009009
## [109,]        Inf             1      0.000000
plot.roc(rocplot, legacy.axes = TRUE)

auc(rocplot)
## Area under the curve: 0.771

We can also use the plotROC package which creates slightly nicer plots with estimates for \(\pi_0\) annotated on the chart. Note the aesthetics m for \(\hat{y}\) and d for \(y\).

library(plotROC)

## Put fitted values back in data.frame
crab$fitval <- fitted(fit)

ggplot(crab) + geom_roc(aes(d = y, m = fitval))

ggplot(crab, aes(color = colorf)) + geom_roc(aes(d = y, m = fitval)) +
  scale_color_brewer(palette = "Set1")

This package also has its own suite of additional functions. To use them, we must assign the ggplot to a new variable

m1 <- ggplot(crab) + geom_roc(aes(d = y, m = fitval))
m2 <- ggplot(crab, aes(color = colorf)) + geom_roc(aes(d = y, m = fitval))

calc_auc(m1)
##   PANEL group     AUC
## 1     1    -1 0.77136
calc_auc(m2) # AUC for each level of factor
##   PANEL group colorf     AUC
## 1     1     1      1 0.51852
## 2     1     2      2 0.71014
## 3     1     3      3 0.83654
## 4     1     4      4 0.59048

An inconvenient note: when using the plotROC package, there is a bug in calc_auc if you specify any aesthetics in geom_roc except for d and m. For example, this code would not work because colorf is specified in geom_roc instead of ggplot():

m2 <- ggplot(crab) + geom_roc(aes(d = y, m = fitval, color = colorf))
calc_auc(m2)
## Error in `rlang::quo_get_expr()`:
## ! `quo` must be a quosure

Exercises

The code below loads game data from the Golden State Warriors’ 2015-2016 season. The outcome of the game will be our response.

gsw <- read.csv("https://remiller1450.github.io/data/GSWarriors.csv")
gsw$win_binary <- ifelse(gsw$Win == "W", 1, 0) # response var

Part A Fit a logistic regression model using only FG3A as a predictor. This variable represents the number of three-point shots attempted in a game. Interpret the fitted coefficient

Part B Fit a logistic model again, this time using both FG3A (attempts) and FG3 where FG3 represents the number of attempts actually made. Examine the coefficients and explain why the sign of FG3A has changed. (Recall: coefficients are interpreted in isolation, assuming all other values are held constant)

Part C Create ROC curves associated with each of the models created.