library(ggplot2)
library(dplyr)
This lab will be centered around extending the content of the previous lab, now allowing for linear regression with more than one covariate (independent variable).
Briefly, we recall that our model with a single covariate is of the form
\[ y = \alpha + X \beta + \epsilon \] where \(\alpha\) indicated our intercept, \(\beta\) our slope, and \(\epsilon\) a random error term with variance equal to \(\sigma\). A way to think of this is “all of our observations \(y\) are equal to the line \(\alpha + X\beta\), plus some error (\(\epsilon\))”. With regression, we are estimating the values of our coefficients, the estimates being notated as \(\hat{\alpha}\) and \(\hat{\beta}\), which are then used to construct our “fitted values”,
\[ \hat{y} = \hat{\alpha} + X \hat{\beta} \] Ultimately, our goal is to construct an estimate of \(\hat{\beta}\), along with a measurement of uncertainty (also called variability) denoted \(s_{\beta}^2\). It is these two values that we use for hypothesis testing.
In terms of the qualities of our models themselves, there are two primary metrics from the previous lab that we should keep in mind going forward; fortunately, the math behind them is less important that their interpretation. The first, the mean squared error, or MSE, is an estimate of the variability in our model. More formally, this is an estimate of the variance term from \(\epsilon\), with \(MSE = \hat{\sigma}^2\).
The other term we should recall is \(R^2\), also called the “coefficient of determination”, which tells us as a percentage how much of the total variability in our outcome can be explained by our model. \(R^2\) values close to 1 indicate that we have estimated most of the variability (or uncertainty), while values close to 0 indicate that we have estimated very little.
Adding additional variables to our model increases the dimension. When we had only a single \(y\) and a single \(X\), we were in two dimensions, while our fitted “line”, being a line, was one dimension less.
This analogy extends when we add a second covariate, bringing our total dimensions up to three. Accordingly, our fitted “line” becomes a plane, which exists in two dimensions
When dealing with multivariate regression, we often write our formula in the condensed form,
\[ y = X\beta + \epsilon \] which is notation borrowed from linear algebra. This expands to
\[ y = \beta_0 + X_1\beta_1 + X_2\beta_2 + \dots + \epsilon \] where \(X_i\) denotes the \(i\)th covariate. (Note that \(\alpha\) has been replaced with \(\beta_0\) for notational consistency.)
Performing multivariate regression in R is simple enough and extends
directly from what we did in the previous lab. For an outcome
y
and covariates x1
and x2
, our
linear model would simply be
fit <- lm(y, x1 + x2, data = df)
There is a cost associated with including multiple variables in our model, as each additional variable adds some degree of model uncertainty. As such, we want to be sure that any additional variables we do add are contributing more to the model than they are taking away.
When comparing models with multiple variables, there are primarily two methods we can use for efficacy. The first is to compare the adjusted \(R^2\) which calculates a penalty with adding more variables; models with larger adjusted \(R^2\) values are generally preferred to those with lower. That being said, use your judgement when doing so. It would not be ideal to add several additional variables for marginal improvement in adjusted \(R^2\).
The second method uses the anova()
function which works
well for models that are nested, i.e., comparing models with
all the same variables plus extra. For example, our first model here has
mpg has an outcome with only weight as a covariate, the second model
contains weight and displacement, while the third contains weight,
displacement, and horsepower
fit1 <- lm(mpg ~ wt, mtcars)
fit2 <- lm(mpg ~ wt + disp, mtcars)
fit3 <- lm(mpg ~ wt + disp + hp, mtcars)
anova(fit1, fit2, fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + disp
## Model 3: mpg ~ wt + disp + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 278
## 2 29 247 1 31.6 4.54 0.042 *
## 3 28 195 1 51.7 7.42 0.011 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the output, each line represents one of the models given
as input. The RSS (residual sum of squares) gives an estimate of the
unexplained error in our model (smaller is better), while the
Pr(> F)
column is the outcome of a hypothesis test
(F-test) indicating if there is evidence that a particular model
outperforms the one immediately preceding it.
The interpretation of a model with multiple covariates follows the same as was done in simple linear regression. For example, given a fitted model
\[ \hat{y} = \hat{\beta_0} + X_1\hat{\beta_1} + X_2\hat{\beta_2} \] we would interpret the value \(\hat{\beta_2}\) as being “a unit change in \(X_2\) results in a \(\hat{\beta_2}\) change in \(\hat{y}\) assuming the value of \(X_1\) is fixed.”
Consider the iris dataset in R, which we have used once before, which records the measurements of various aspects of the iris flower
In this model, we are interested in making inference on the length of the sepal, given sepal width and petal length and width
fit_iris <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris)
summary(fit_iris)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8282 -0.2199 0.0187 0.1971 0.8457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.8560 0.2508 7.40 0.0000000000099 ***
## Sepal.Width 0.6508 0.0666 9.77 < 0.0000000000000002 ***
## Petal.Length 0.7091 0.0567 12.50 < 0.0000000000000002 ***
## Petal.Width -0.5565 0.1275 -4.36 0.0000241287569 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.315 on 146 degrees of freedom
## Multiple R-squared: 0.859, Adjusted R-squared: 0.856
## F-statistic: 296 on 3 and 146 DF, p-value: <0.0000000000000002
We could write this model algebraically as
\[ \hat{y} = 1.85 + 0.65 X_1 + 0.71 X_2 - 0.56 X_3 \] where \(\hat{y}\) is sepal length, and \(X_1\), \(X_2\), and \(X_3\) are sepal width, petal length, and petal width, respectively. We could interpret this as: with other values of \(X\) being fixed, a centimeter change in petal width corresponds to a -0.56 centimeter change in sepal length.
Question 1: Consider the rock
dataset
builtin to R (?rock
) which takes measurements on 48 rock
samples from a petroleum reservoir. Create a scatter plot comparing each
of the variables with the variable area
and make
predictions on which you think would have the strongest linear
relationship with area. Confirm this by constructing a linear model.
For this portion of the lab, we are going to be using the
mtcars
dataset built into R.
Consider a scenario in which we are interested in relating the weight of a vehicle to its fuel economy, measured in miles per gallons. A good first step in any investigation of this sort is to plot the variables to visualize the relationship; this will help us identify potentially useful models.
ggplot(mtcars, aes(wt, mpg)) + geom_point()
From this, a linear relationship seems appropriate, and we can
construct this using lm()
fit <- lm(mpg ~ wt, mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.543 -2.365 -0.125 1.410 6.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.285 1.878 19.86 < 0.0000000000000002 ***
## wt -5.344 0.559 -9.56 0.00000000013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.05 on 30 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.745
## F-statistic: 91.4 on 1 and 30 DF, p-value: 0.000000000129
First, we see that this model has an adjusted \(R^2\) of 0.45, indicating that quite a bit of variability in fuel economy has been captured from considering weight alone. We also see a coefficient associated with weight of -5.344, which we interpret as saying that every metric ton increase in the weight of a vehicle is associated with a decrease in fuel economy of about 5.3 miles per gallon. Note that this is also consistent with what we had seen in the ggplot above.
We now move on to adding additional variables to our model. In doing so, we need to concern ourselves with issues of multicollinearity, a phenomenon in which two or more of our variables are highly correlated. This can cause issues for several reasons:
To illustrate this, suppose we are still interested in estimating miles per gallon, but in addition to weight, we also include engine displacement and the number of carburetors in our model. As we can see, these are both clearly associated with mpg
Consider then our model with all of these terms included
fit_disp <- lm(mpg ~ wt + disp + carb, mtcars)
summary(fit_disp)
##
## Call:
## lm(formula = mpg ~ wt + disp + carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.074 -1.839 -0.352 1.310 5.684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.49063 2.01903 17.58 <0.0000000000000002 ***
## wt -2.87249 1.09765 -2.62 0.014 *
## disp -0.01697 0.00853 -1.99 0.056 .
## carb -0.79718 0.33286 -2.39 0.024 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.7 on 28 degrees of freedom
## Multiple R-squared: 0.818, Adjusted R-squared: 0.799
## F-statistic: 42 on 3 and 28 DF, p-value: 0.00000000017
Most immediately, we see that the coefficient associated with weight has been cut nearly in half! It’s not that the relationship between weight and fuel economy has changed, rather that the effect has been distributed across more variables.
In terms of interpretation of this new coefficient, we might be tempted to say “a unit change in weight (1000 lbs) is associated with a -2.87 change in fuel economy, everything else being fixed.” However, consider a plot showing the relationship between weight and the other variables:
From this, we see that changing weight without changing the value of displacement is nearly impossible; every unit change in weight is associated with nearly a 100-200 change in vehicle displacement. This is an example of our two covariates expressing multicollinearity.
Alternatively, let’s consider a variable in our model that is not associated with weight. Here, we look at “qsec”, which is the number of seconds for the vehicle to travel 1/4 mile:
ggplot(mtcars, aes(wt, qsec)) + geom_point()
Here, there is no immediately discerable relationship between the two. When we add this variable to the model, our estimate for weight remains nearly unchanged
fit_qsec <- lm(mpg ~ wt + qsec, mtcars)
summary(fit_qsec)
##
## Call:
## lm(formula = mpg ~ wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.396 -2.143 -0.213 1.492 5.749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.746 5.252 3.76 0.00077 ***
## wt -5.048 0.484 -10.43 0.000000000025 ***
## qsec 0.929 0.265 3.51 0.00150 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.6 on 29 degrees of freedom
## Multiple R-squared: 0.826, Adjusted R-squared: 0.814
## F-statistic: 69 on 2 and 29 DF, p-value: 0.00000000000939
Not only does this model better satisfy our assumptions, we see that
the adjusted \(R^2\) value of
fit_qsec
is larger than fit_disp
, using one
fewer variable.
A good way to determine if an additional variable would contribute to a model is to consider a plot of model residuals and the variable in question: if a variable would be a good candidate for inclusion in the model, it would show a linear relationship (i.e., help explain) with the residuals.
# Our original model
fit <- lm(mpg ~ wt, mtcars)
par(mfrow = c(1, 2))
# Plot against displacement
plot(mtcars$disp, residuals(fit), main = "Displacement vs Residuals")
abline(h = 0, col = 'red', lty = 'dashed')
# Plot against qsec
plot(mtcars$qsec, residuals(fit), main = "qsec vs Residuals")
abline(h = 0, col = 'red', lty = 'dashed')
In the displacement plot, we see that the residuals tend to be scattered around zero for all values of displacement, whereas for the qsec plot, we see a clear linear relationship between the residuals and qsec.
Question 2: For this question, we are going to use the colleges dataset with a goal of determining the covariates most responsible for determining the average 10 year salary of colleges graduates.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
colleges <- na.omit(colleges)
The plot below gives a quick way to visualize the relationship between each of the individual covariates that we will be using in the model. To interpret this, begin by considering the outcome variable located in the bottom left corner. In each of the plots above it, Salary10yr_median represents the x-axis, while to each of the plots to the left, it is the y-axis.
plot(colleges[, c("Adm_Rate", "Debt_median", "Avg_Fac_Salary", "ACT_median", "Net_Tuition", "Salary10yr_median")])
In doing this, adhere to the following guidelines:
anova()
function to compare modelsYour answer should include your models with one, two, and three variables, as well the residual plots you used to justify the inclusion of additional variables. You do not need to include models that you considered and then ultimately rejected.
Often, we wish to include categorical variables in our regression
analysis. Consider the mpg
dataset from
ggplot2
, where we group by drive train (four wheel drive,
front, and rear) and compute the mean highway mileage in each group:
# as.data.frame because tibble rounds
mpg %>% group_by(drv) %>%
summarize(mean(hwy)) %>% as.data.frame()
## drv mean(hwy)
## 1 4 19.175
## 2 f 28.160
## 3 r 21.000
Compare this with what we find in a regression model:
lm(hwy ~ drv, mpg) %>% summary()
##
## Call:
## lm(formula = hwy ~ drv, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.16 -2.17 -1.00 1.96 15.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.175 0.404 47.5 <0.0000000000000002 ***
## drvf 8.986 0.567 15.8 <0.0000000000000002 ***
## drvr 1.825 0.913 2.0 0.047 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.1 on 231 degrees of freedom
## Multiple R-squared: 0.531, Adjusted R-squared: 0.527
## F-statistic: 131 on 2 and 231 DF, p-value: <0.0000000000000002
A few things stand out immediately. First, we see that the intercept is identical to the mean highway fuel economy for four-wheel drive vehicles. Next, we see that we have model estimates for front-wheel drive and rear-wheel drive, but nothing explicitly for four.
As you might have guessed, adding the values of the intercept to each of the coefficients results in the group mean for each variable, i.e., 19.17 + 8.99 is equal to 28.16, the mean fuel economy for front-wheel drive, while 19.17 + 1.83 is equal to 21, the mean fuel economy for rear-wheel drive.
This is an example of the creation of a reference group,
which gets absorbed into the intercept. We can inhibit this behavior by
including -1
in our model, which removes the intercept
entirely:
lm(hwy ~ -1 + drv, mpg) %>% summary()
##
## Call:
## lm(formula = hwy ~ -1 + drv, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.16 -2.17 -1.00 1.96 15.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## drv4 19.175 0.404 47.5 <0.0000000000000002 ***
## drvf 28.160 0.398 70.8 <0.0000000000000002 ***
## drvr 21.000 0.819 25.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.1 on 231 degrees of freedom
## Multiple R-squared: 0.972, Adjusted R-squared: 0.971
## F-statistic: 2.64e+03 on 3 and 231 DF, p-value: <0.0000000000000002
Doing so results in coefficient values that are equal to each group mean.
You’ll notice that in addition to the covariates changing between models, so too does the significance of each covariate. To understand this, consider again the particular hypothesis that this is looking to test. What is the null hypothesis for the covariates when the intercept is included? What is the null hypothesis when it is not?
Categorical variables can be included to either control for
variability between groups (as is often the case when sex is included as
a covariate) or it may be a variable of interest itself. For example, in
the ToothGrowth
dataset which examines the growth of
odontoblasts in guinea pigs given vitamin C as either ascorbic acid or
as orange juice, we might be interested in understanding both
the impact on dose as well as delivery method.
fit <- lm(len ~ supp + dose, ToothGrowth)
summary(fit)
##
## Call:
## lm(formula = len ~ supp + dose, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.272 1.282 7.23 0.00000000131233453 ***
## suppVC -3.700 1.094 -3.38 0.0013 **
## dose 9.764 0.877 11.14 0.00000000000000063 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.24 on 57 degrees of freedom
## Multiple R-squared: 0.704, Adjusted R-squared: 0.693
## F-statistic: 67.7 on 2 and 57 DF, p-value: 0.000000000000000872
By it’s nature, categorical variables of this kind result in a vertical shift of the intercept according to the group in question:
cc <- coef(lm(len ~ supp + dose, ToothGrowth))
ggplot(ToothGrowth, aes(dose, len, color = supp)) + geom_point() +
geom_abline(intercept = cc[1], slope = cc[3], color = '#F8766D') +
geom_abline(intercept = cc[1] + cc[2], slope = cc[3], color = '#00BFC4') +
theme_bw()
Question 3: Again using the mtcars
dataset, suppose that we are constructing a model to investigate the
relationship between weight and miles per gallon, but that we also want
to include the number of carburetors as a control variable. Construct
two models, one using carburetors as it is stored by default in the
dataset (as an integer) and another treating carburetors as a factor
variable. Then answer the following questions: