Introduction

Start by installing the MASS package with install.packages("MASS")

library(ggplot2)
library(MASS)
library(xtable)
library(dplyr)
library(AER)
library(COUNT)

theme_set(theme_bw(base_size = 18))

crab <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Crabs.dat", 
                     header = TRUE)

Poisson Regression

fit_log <- glm(sat ~ width, family = poisson, data = crab)
summary(fit_log)
## 
## Call:
## glm(formula = sat ~ width, family = poisson, data = crab)
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)   -3.305      0.542   -6.09         0.0000000011 ***
## width          0.164      0.020    8.22 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.2
## 
## Number of Fisher Scoring iterations: 6

This gives us the link function

\[ \log \hat{\mu} = \hat{\alpha} + \hat{\beta} \times \text{width} \] or

\[ \hat{\mu} = e^{\hat{\alpha}} \left( e^{\hat{\beta}} \right)^{\text{width}} \]

Typically, we can fit a Poisson model with a link function in the following way, but this particular dataset fails to converge:

## This fails to converge
fit_lm <- glm(sat~ width, family = quasi(link=identity, variance = "mu"), data = crab)
## Error: no valid set of coefficients has been found: please supply starting values

We can cheat this by fitting a standard linear model to estimate starting coefficients, using those as starting parameters for the ML algorithm here:

## Get reasonable starting parameters (but with normal error)
fit_lm <- lm(sat ~ width,  data = crab)

## Use these to fit poisson glm with identity
fit_identity <- glm(sat~ width, family = quasi(link=identity, variance = "mu"), data = crab, 
             start = coef(fit_lm))
summary(fit_identity)
## 
## Call:
## glm(formula = sat ~ width, family = quasi(link = identity, variance = "mu"), 
##     data = crab, start = coef(fit_lm))
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -11.5256     1.2060   -9.56 <0.0000000000000002 ***
## width         0.5493     0.0528   10.40 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasi family taken to be 3.1721)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 557.71  on 171  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 22

Just as with before, we can use the predict function to visualize each of these

y_log <- predict(fit_log, newdata = crab, type = "response")
y_id <- predict(fit_identity, newdata = crab, type = "response")

df <- data.frame(width = crab$width, 
                 y = c(y_log, y_id), 
                 model = rep(c("log", "id"), each = nrow(crab)))

ggplot(df, aes(width, y, color = model)) + 
  geom_point(data = crab, aes(width, y = sat), color = "black", alpha = 0.5) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = c("#B00000", "#0072B2"))

Negative Binomial

The easiest way to fit a negative binomial is with the glm.nb() function provided in the MASS package:

fit_nb <- glm.nb(sat ~ width, data = crab)
summary(fit_nb)
## 
## Call:
## glm.nb(formula = sat ~ width, data = crab, init.theta = 0.90456808, 
##     link = log)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.0525     1.1714   -3.46  0.00054 ***
## width         0.1921     0.0441    4.36 0.000013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)
## 
##     Null deviance: 213.05  on 172  degrees of freedom
## Residual deviance: 195.81  on 171  degrees of freedom
## AIC: 757.3
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.905 
##           Std. Err.:  0.161 
## 
##  2 x log-likelihood:  -751.291

From here, we notice a few things:

  • The coefficients for the model are similar to those in the Poisson log model
  • The standard error for width here is 0.04, nearly twice that in the Poisson log model. This is on account of the extra variability caused by overdispersion
  • The model reports a value of \(\theta = 0.905\), corresponding to a dispersion parameter of \(D = 1/\theta = 1.105\), indicating that the true variability is larger by a factor of \(\mu^2\)

Poisson Rate Regression

cancer <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Cancer.dat", 
                     header = TRUE)


fit <- glm(count ~ factor(histology) + factor(stage) + factor(time), 
           family = poisson, offset = log(risktime), data = cancer)
summary(fit)
## 
## Call:
## glm(formula = count ~ factor(histology) + factor(stage) + factor(time), 
##     family = poisson, data = cancer, offset = log(risktime))
## 
## Coefficients:
##                    Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)         -3.0093     0.1665  -18.07 <0.0000000000000002 ***
## factor(histology)2   0.1624     0.1220    1.33              0.1829    
## factor(histology)3   0.1075     0.1475    0.73              0.4658    
## factor(stage)2       0.4700     0.1744    2.69              0.0071 ** 
## factor(stage)3       1.3243     0.1521    8.71 <0.0000000000000002 ***
## factor(time)2       -0.1275     0.1491   -0.85              0.3926    
## factor(time)3       -0.0797     0.1635   -0.49              0.6258    
## factor(time)4        0.1189     0.1711    0.70              0.4869    
## factor(time)5       -0.6651     0.2606   -2.55              0.0107 *  
## factor(time)6       -0.3502     0.2435   -1.44              0.1504    
## factor(time)7       -0.1752     0.2499   -0.70              0.4832    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 175.718  on 62  degrees of freedom
## Residual deviance:  43.923  on 52  degrees of freedom
## AIC: 251.7
## 
## Number of Fisher Scoring iterations: 5

However, consider what happens when we don’t offset:

fit_nooff <- glm(count ~ factor(histology) + factor(stage) + factor(time), 
           family = poisson, data = cancer)

summary(fit_nooff)
## 
## Call:
## glm(formula = count ~ factor(histology) + factor(stage) + factor(time), 
##     family = poisson, data = cancer)
## 
## Coefficients:
##                    Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)           2.316      0.159   14.53 < 0.0000000000000002 ***
## factor(histology)2   -0.511      0.122   -4.20    0.000027063984741 ***
## factor(histology)3   -1.019      0.145   -7.04    0.000000000001939 ***
## factor(stage)2        0.270      0.174    1.55              0.12108    
## factor(stage)3        1.329      0.148    9.00 < 0.0000000000000002 ***
## factor(time)2        -0.519      0.149   -3.49              0.00049 ***
## factor(time)3        -0.788      0.163   -4.85    0.000001244789911 ***
## factor(time)4        -0.904      0.169   -5.34    0.000000093711368 ***
## factor(time)5        -1.963      0.259   -7.58    0.000000000000035 ***
## factor(time)6        -1.800      0.241   -7.46    0.000000000000088 ***
## factor(time)7        -1.851      0.247   -7.50    0.000000000000063 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 419.708  on 62  degrees of freedom
## Residual deviance:  79.989  on 52  degrees of freedom
## AIC: 287.8
## 
## Number of Fisher Scoring iterations: 5

Exercises

For this, you’ll want to install the following packages by copying and pasting this into your console (MASS includes functions, the other two include datasets):

install.packages(c("MASS", "AER", "COUNT"))

Count Regression

This problem uses the PhDPublications dataset from the AER package. A full list of variables is included in ?PhDPublications, where out outcome is the the number of articles PhD students in biochemistry published in the last three years of their PhD.

library(AER)
data("PhDPublications")
phd <- PhDPublications
  1. Construct a Poisson regression model that includes covariates for the number of children and an indicator for marital status. Which covariate appears to help most with productivity? Which appears to help the least?

  2. Now include in your model a covariate for the number of papers produced by the PhD mentor Based on the summary data, does there appear to be a relationship between published papers and mentor publications? Use the anova function to compare the models and argue which you prefer based on change in deviance.

  3. For each additional paper published by a mentor, how much can we expect the expected count to increase for a biochemistry PhD student?

  4. Predict the number of articles published in the last three years for a PhD student who is married, has two kids, and has a mentor who published 3 articles in the previous years.

Rate Regression

This problem with use the fasttrakg dataset provided by the COUNT package:

library(COUNT)
data(fasttrakg)
fasttrakg$killip <- as.factor(fasttrakg$killip)

FASTRAK is the name of the Canadian National Cardiovascular Disease registray, with present set including years 1996-1998. Observations in this set are grouped by covariate patterns from individual observations. Relevant variables for this problem include die and cases, giving the number of deaths and total cases for each covariate patter, respectively, as well as anterior, an indicator for myocardial infarction in the front or rear of the heart, and killip, a factor variable that indicates the severity of the infarction on a 1-4 scale (with 4 being the most severe).

  1. Fit a standard poisson regression on the fasttrakg dataset with the number dead serving as our outcome. Include anterior and killip as the covariates. What happens to the coefficients as the Killip level increases? Does this mean that more severe events appear to kill more or fewer people?

  2. Fit the poisson model again, this time including an offset term using the logorithm of the number of cases. Again examine the coefficients. How does this compare with what we found in (1)?

  3. Use dplyr to group by Killip level and find the number of total cases and dead for each level. Based on this summary, what explains the discrepancy between what we found in (1) and what we found in (2)? With this in mind, which model appears more appropriate for assessing the association between Killip level and death?

Negative Binomial

This problem uses the rwm dataset from the COUNT package. It includes German health registry information for the yeaas 1984-1988. We will be interested in using this data to predict the number of visits to a doctor a patient may make in that same four year period.

library(COUNT)
data(rwm)
  1. Fit regular regression with poisson including all of the covariates in the dataset. Which factors seem to increase rate of visiting doctor? Which seem to decrease it?

  2. Fit a second model, this time assuming a negative binomial distribution. What is the value of \(\theta\)? What does this say about the existence of overdispersion?

  3. As we have seen before, some functions are provided as generics which will change depending on the type of object that is passed into them. Investigate the residuals() function for GLMs by reviewing ?residuals.glm() Use this to plot the Pearson residuals for both the Poisson and NB models. How do the residuals change?

  4. We noted in class that the negative binomial often comes about when we observe heterogeneity in our observations at a fixed set of coefficients. Typically, this is in the form of missing covariates that would further specify our individual observations. A special case of this arises when our data is a mixture of two separate distributions – a regular Poisson distribution along with a zero-inflated distribution. In short, this occurs when a subset of the observations observe a zero value far more frequently than would be predicted in standard count models.

  • Construct a bar chart in ggplot that shows the distribution of doctor visits across the dataset. What do you notice about the number of zero values?

  • Subset the dataset to remove the zeros and refit the negative binomial model. What has happened to the theta value? What does this indicate about inflated zeros and its effect on overdispersion?