Start by installing the MASS
package with
install.packages("MASS")
library(ggplot2)
library(MASS)
library(xtable)
library(dplyr)
library(AER)
library(COUNT)
theme_set(theme_bw(base_size = 18))
crab <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Crabs.dat",
header = TRUE)
fit_log <- glm(sat ~ width, family = poisson, data = crab)
summary(fit_log)
##
## Call:
## glm(formula = sat ~ width, family = poisson, data = crab)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.305 0.542 -6.09 0.0000000011 ***
## width 0.164 0.020 8.22 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.2
##
## Number of Fisher Scoring iterations: 6
This gives us the link function
\[ \log \hat{\mu} = \hat{\alpha} + \hat{\beta} \times \text{width} \] or
\[ \hat{\mu} = e^{\hat{\alpha}} \left( e^{\hat{\beta}} \right)^{\text{width}} \]
Typically, we can fit a Poisson model with a link function in the following way, but this particular dataset fails to converge:
## This fails to converge
fit_lm <- glm(sat~ width, family = quasi(link=identity, variance = "mu"), data = crab)
## Error: no valid set of coefficients has been found: please supply starting values
We can cheat this by fitting a standard linear model to estimate starting coefficients, using those as starting parameters for the ML algorithm here:
## Get reasonable starting parameters (but with normal error)
fit_lm <- lm(sat ~ width, data = crab)
## Use these to fit poisson glm with identity
fit_identity <- glm(sat~ width, family = quasi(link=identity, variance = "mu"), data = crab,
start = coef(fit_lm))
summary(fit_identity)
##
## Call:
## glm(formula = sat ~ width, family = quasi(link = identity, variance = "mu"),
## data = crab, start = coef(fit_lm))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.5256 1.2060 -9.56 <0.0000000000000002 ***
## width 0.5493 0.0528 10.40 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasi family taken to be 3.1721)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 557.71 on 171 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 22
Just as with before, we can use the predict
function to
visualize each of these
y_log <- predict(fit_log, newdata = crab, type = "response")
y_id <- predict(fit_identity, newdata = crab, type = "response")
df <- data.frame(width = crab$width,
y = c(y_log, y_id),
model = rep(c("log", "id"), each = nrow(crab)))
ggplot(df, aes(width, y, color = model)) +
geom_point(data = crab, aes(width, y = sat), color = "black", alpha = 0.5) +
geom_line(linewidth = 1) +
scale_color_manual(values = c("#B00000", "#0072B2"))
The easiest way to fit a negative binomial is with the
glm.nb()
function provided in the MASS
package:
fit_nb <- glm.nb(sat ~ width, data = crab)
summary(fit_nb)
##
## Call:
## glm.nb(formula = sat ~ width, data = crab, init.theta = 0.90456808,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.0525 1.1714 -3.46 0.00054 ***
## width 0.1921 0.0441 4.36 0.000013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)
##
## Null deviance: 213.05 on 172 degrees of freedom
## Residual deviance: 195.81 on 171 degrees of freedom
## AIC: 757.3
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.905
## Std. Err.: 0.161
##
## 2 x log-likelihood: -751.291
From here, we notice a few things:
width
here is 0.04, nearly twice
that in the Poisson log model. This is on account of the extra
variability caused by overdispersioncancer <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Cancer.dat",
header = TRUE)
fit <- glm(count ~ factor(histology) + factor(stage) + factor(time),
family = poisson, offset = log(risktime), data = cancer)
summary(fit)
##
## Call:
## glm(formula = count ~ factor(histology) + factor(stage) + factor(time),
## family = poisson, data = cancer, offset = log(risktime))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.0093 0.1665 -18.07 <0.0000000000000002 ***
## factor(histology)2 0.1624 0.1220 1.33 0.1829
## factor(histology)3 0.1075 0.1475 0.73 0.4658
## factor(stage)2 0.4700 0.1744 2.69 0.0071 **
## factor(stage)3 1.3243 0.1521 8.71 <0.0000000000000002 ***
## factor(time)2 -0.1275 0.1491 -0.85 0.3926
## factor(time)3 -0.0797 0.1635 -0.49 0.6258
## factor(time)4 0.1189 0.1711 0.70 0.4869
## factor(time)5 -0.6651 0.2606 -2.55 0.0107 *
## factor(time)6 -0.3502 0.2435 -1.44 0.1504
## factor(time)7 -0.1752 0.2499 -0.70 0.4832
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 175.718 on 62 degrees of freedom
## Residual deviance: 43.923 on 52 degrees of freedom
## AIC: 251.7
##
## Number of Fisher Scoring iterations: 5
However, consider what happens when we don’t offset:
fit_nooff <- glm(count ~ factor(histology) + factor(stage) + factor(time),
family = poisson, data = cancer)
summary(fit_nooff)
##
## Call:
## glm(formula = count ~ factor(histology) + factor(stage) + factor(time),
## family = poisson, data = cancer)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.316 0.159 14.53 < 0.0000000000000002 ***
## factor(histology)2 -0.511 0.122 -4.20 0.000027063984741 ***
## factor(histology)3 -1.019 0.145 -7.04 0.000000000001939 ***
## factor(stage)2 0.270 0.174 1.55 0.12108
## factor(stage)3 1.329 0.148 9.00 < 0.0000000000000002 ***
## factor(time)2 -0.519 0.149 -3.49 0.00049 ***
## factor(time)3 -0.788 0.163 -4.85 0.000001244789911 ***
## factor(time)4 -0.904 0.169 -5.34 0.000000093711368 ***
## factor(time)5 -1.963 0.259 -7.58 0.000000000000035 ***
## factor(time)6 -1.800 0.241 -7.46 0.000000000000088 ***
## factor(time)7 -1.851 0.247 -7.50 0.000000000000063 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 419.708 on 62 degrees of freedom
## Residual deviance: 79.989 on 52 degrees of freedom
## AIC: 287.8
##
## Number of Fisher Scoring iterations: 5
For this, you’ll want to install the following packages by copying and pasting this into your console (MASS includes functions, the other two include datasets):
install.packages(c("MASS", "AER", "COUNT"))
This problem uses the PhDPublications
dataset from the
AER
package. A full list of variables is included in
?PhDPublications
, where out outcome is the the number of
articles PhD students in biochemistry published in the last three years
of their PhD.
library(AER)
data("PhDPublications")
phd <- PhDPublications
Construct a Poisson regression model that includes covariates for the number of children and an indicator for marital status. Which covariate appears to help most with productivity? Which appears to help the least?
Now include in your model a covariate for the number of papers
produced by the PhD mentor Based on the summary data, does there appear
to be a relationship between published papers and mentor publications?
Use the anova
function to compare the models and argue
which you prefer based on change in deviance.
For each additional paper published by a mentor, how much can we expect the expected count to increase for a biochemistry PhD student?
Predict the number of articles published in the last three years for a PhD student who is married, has two kids, and has a mentor who published 3 articles in the previous years.
This problem with use the fasttrakg
dataset provided by
the COUNT
package:
library(COUNT)
data(fasttrakg)
fasttrakg$killip <- as.factor(fasttrakg$killip)
FASTRAK is the name of the Canadian National Cardiovascular Disease
registray, with present set including years 1996-1998. Observations in
this set are grouped by covariate patterns from individual observations.
Relevant variables for this problem include die
and
cases
, giving the number of deaths and total cases for each
covariate patter, respectively, as well as anterior
, an
indicator for myocardial infarction in the front or rear of the heart,
and killip
, a factor variable that indicates the severity
of the infarction on a 1-4 scale (with 4 being the most severe).
Fit a standard poisson regression on the fasttrakg
dataset with the number dead serving as our outcome. Include
anterior
and killip
as the covariates. What
happens to the coefficients as the Killip level increases? Does this
mean that more severe events appear to kill more or fewer
people?
Fit the poisson model again, this time including an offset term using the logorithm of the number of cases. Again examine the coefficients. How does this compare with what we found in (1)?
Use dplyr
to group by Killip level and find the
number of total cases and dead for each level. Based on this summary,
what explains the discrepancy between what we found in (1) and what we
found in (2)? With this in mind, which model appears more appropriate
for assessing the association between Killip level and death?
This problem uses the rwm
dataset from the
COUNT
package. It includes German health registry
information for the yeaas 1984-1988. We will be interested in using this
data to predict the number of visits to a doctor a patient may make in
that same four year period.
library(COUNT)
data(rwm)
Fit regular regression with poisson including all of the covariates in the dataset. Which factors seem to increase rate of visiting doctor? Which seem to decrease it?
Fit a second model, this time assuming a negative binomial distribution. What is the value of \(\theta\)? What does this say about the existence of overdispersion?
As we have seen before, some functions are provided as
generics which will change depending on the type of object that
is passed into them. Investigate the residuals()
function
for GLMs by reviewing ?residuals.glm()
Use this to plot the
Pearson residuals for both the Poisson and NB models. How do the
residuals change?
We noted in class that the negative binomial often comes about when we observe heterogeneity in our observations at a fixed set of coefficients. Typically, this is in the form of missing covariates that would further specify our individual observations. A special case of this arises when our data is a mixture of two separate distributions – a regular Poisson distribution along with a zero-inflated distribution. In short, this occurs when a subset of the observations observe a zero value far more frequently than would be predicted in standard count models.
Construct a bar chart in ggplot that shows the distribution of doctor visits across the dataset. What do you notice about the number of zero values?
Subset the dataset to remove the zeros and refit the negative binomial model. What has happened to the theta value? What does this indicate about inflated zeros and its effect on overdispersion?