Preamble

library(ggplot2)
library(dplyr)

This lab will be centered around extending the content of the previous lab, now allowing for linear regression with more than one covariate (independent variable).

Quick Review

Briefly, we recall that our model with a single covariate is of the form

\[ y = \alpha + X \beta + \epsilon \] where \(\alpha\) indicated our intercept, \(\beta\) our slope, and \(\epsilon\) a random error term with variance equal to \(\sigma\). A way to think of this is “all of our observations \(y\) are equal to the line \(\alpha + X\beta\), plus some error (\(\epsilon\))”. With regression, we are estimating the values of our coefficients, the estimates being notated as \(\hat{\alpha}\) and \(\hat{\beta}\), which are then used to construct our “fitted values”,

\[ \hat{y} = \hat{\alpha} + X \hat{\beta} \] Ultimately, our goal is to construct an estimate of \(\hat{\beta}\), along with a measurement of uncertainty (also called variability) denoted \(s_{\beta}^2\). It is these two values that we use for hypothesis testing.

In terms of the qualities of our models themselves, there are two primary metrics from the previous lab that we should keep in mind going forward; fortunately, the math behind them is less important that their interpretation. The first, the mean squared error, or MSE, is an estimate of the variability in our model. More formally, this is an estimate of the variance term from \(\epsilon\), with \(MSE = \hat{\sigma}^2\).

The other term we should recall is \(R^2\), also called the “coefficient of determination”, which tells us as a percentage how much of the total variability in our outcome can be explained by our model. \(R^2\) values close to 1 indicate that we have estimated most of the variability (or uncertainty), while values close to 0 indicate that we have estimated very little.

Adding Variables to Model

Adding additional variables to our model increases the dimension. When we had only a single \(y\) and a single \(X\), we were in two dimensions, while our fitted “line”, being a line, was one dimension less.

This analogy extends when we add a second covariate, bringing our total dimensions up to three. Accordingly, our fitted “line” becomes a plane, which exists in two dimensions

When dealing with multivariate regression, we often write our formula in the condensed form,

\[ y = X\beta + \epsilon \] which is notation borrowed from linear algebra. This expands to

\[ y = \beta_0 + X_1\beta_1 + X_2\beta_2 + \dots + \epsilon \] where \(X_i\) denotes the \(i\)th covariate. (Note that \(\alpha\) has been replaced with \(\beta_0\) for notational consistency.)

Performing multivariate regression in R is simple enough and extends directly from what we did in the previous lab. For an outcome y and covariates x1 and x2, our linear model would simply be

fit <- lm(y, x1 + x2, data = df)

Adjusted \(R^2\) and ANOVA

There is a cost associated with including multiple variables in our model, as each additional variable adds some degree of model uncertainty. As such, we want to be sure that any additional variables we do add are contributing more to the model than they are taking away.

When comparing models with multiple variables, there are primarily two methods we can use for efficacy. The first is to compare the adjusted \(R^2\) which calculates a penalty with adding more variables; models with larger adjusted \(R^2\) values are generally preferred to those with lower. That being said, use your judgement when doing so. It would not be ideal to add several additional variables for marginal improvement in adjusted \(R^2\).

The second method uses the anova() function which works well for models that are nested, i.e., comparing models with all the same variables plus extra. For example, our first model here has mpg has an outcome with only weight as a covariate, the second model contains weight and displacement, while the third contains weight, displacement, and horsepower

fit1 <- lm(mpg ~ wt, mtcars)
fit2 <- lm(mpg ~ wt + disp, mtcars)
fit3 <- lm(mpg ~ wt + disp + hp, mtcars)

anova(fit1, fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + disp
## Model 3: mpg ~ wt + disp + hp
##   Res.Df RSS Df Sum of Sq    F Pr(>F)  
## 1     30 278                           
## 2     29 247  1      31.6 4.54  0.042 *
## 3     28 195  1      51.7 7.42  0.011 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the output, each line represents one of the models given as input. The RSS (residual sum of squares) gives an estimate of the unexplained error in our model (smaller is better), while the Pr(> F) column is the outcome of a hypothesis test (F-test) indicating if there is evidence that a particular model outperforms the one immediately preceding it.

Lab

Adding covariates

The interpretation of a model with multiple covariates follows the same as was done in simple linear regression. For example, given a fitted model

\[ \hat{y} = \hat{\beta_0} + X_1\hat{\beta_1} + X_2\hat{\beta_2} \] we would interpret the value \(\hat{\beta_2}\) as being “a unit change in \(X_2\) results in a \(\hat{\beta_2}\) change in \(\hat{y}\) assuming the value of \(X_1\) is fixed.”

Consider the iris dataset in R, which we have used once before, which records the measurements of various aspects of the iris flower

In this model, we are interested in making inference on the length of the sepal, given sepal width and petal length and width

fit_iris <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris)
summary(fit_iris)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8282 -0.2199  0.0187  0.1971  0.8457 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    1.8560     0.2508    7.40      0.0000000000099 ***
## Sepal.Width    0.6508     0.0666    9.77 < 0.0000000000000002 ***
## Petal.Length   0.7091     0.0567   12.50 < 0.0000000000000002 ***
## Petal.Width   -0.5565     0.1275   -4.36      0.0000241287569 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.315 on 146 degrees of freedom
## Multiple R-squared:  0.859,  Adjusted R-squared:  0.856 
## F-statistic:  296 on 3 and 146 DF,  p-value: <0.0000000000000002

We could write this model algebraically as

\[ \hat{y} = 1.85 + 0.65 X_1 + 0.71 X_2 - 0.56 X_3 \] where \(\hat{y}\) is sepal length, and \(X_1\), \(X_2\), and \(X_3\) are sepal width, petal length, and petal width, respectively. We could interpret this as: with other values of \(X\) being fixed, a centimeter change in petal width corresponds to a -0.56 centimeter change in sepal length.

Question 1: Consider the rock dataset builtin to R (?rock) which takes measurements on 48 rock samples from a petroleum reservoir. Create a scatter plot comparing each of the variables with the variable area and make predictions on which you think would have the strongest linear relationship with area. Confirm this by constructing a linear model.

Multicollinearity

For this portion of the lab, we are going to be using the mtcars dataset built into R.

Consider a scenario in which we are interested in relating the weight of a vehicle to its fuel economy, measured in miles per gallons. A good first step in any investigation of this sort is to plot the variables to visualize the relationship; this will help us identify potentially useful models.

ggplot(mtcars, aes(wt, mpg)) + geom_point()

From this, a linear relationship seems appropriate, and we can construct this using lm()

fit <- lm(mpg ~ wt, mtcars)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.543 -2.365 -0.125  1.410  6.873 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   37.285      1.878   19.86 < 0.0000000000000002 ***
## wt            -5.344      0.559   -9.56        0.00000000013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.05 on 30 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
## F-statistic: 91.4 on 1 and 30 DF,  p-value: 0.000000000129

First, we see that this model has an adjusted \(R^2\) of 0.45, indicating that quite a bit of variability in fuel economy has been captured from considering weight alone. We also see a coefficient associated with weight of -5.344, which we interpret as saying that every metric ton increase in the weight of a vehicle is associated with a decrease in fuel economy of about 5.3 miles per gallon. Note that this is also consistent with what we had seen in the ggplot above.

We now move on to adding additional variables to our model. In doing so, we need to concern ourselves with issues of multicollinearity, a phenomenon in which two or more of our variables are highly correlated. This can cause issues for several reasons:

In terms of model efficiency, we are adding an entire additional variable (recall that fewer variables are preferred) which is contributing redundant information
It impacts our interpretation of model coefficients. If \(X_1\) and \(X_2\) are highly correlated, then interpreting \(\beta_1\) as a change in \(y\) given a unit change in \(X_1\) assuming that \(X_2\) is fixed would fail as any changes in \(X_1\) would be strongly associated with changes in \(X_2\) as well.

To illustrate this, suppose we are still interested in estimating miles per gallon, but in addition to weight, we also include engine displacement and the number of carburetors in our model. As we can see, these are both clearly associated with mpg

Consider then our model with all of these terms included

fit_disp <- lm(mpg ~ wt + disp + carb, mtcars) 
summary(fit_disp)

## 
## Call:
## lm(formula = mpg ~ wt + disp + carb, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.074 -1.839 -0.352  1.310  5.684 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 35.49063    2.01903   17.58 <0.0000000000000002 ***
## wt          -2.87249    1.09765   -2.62               0.014 *  
## disp        -0.01697    0.00853   -1.99               0.056 .  
## carb        -0.79718    0.33286   -2.39               0.024 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.7 on 28 degrees of freedom
## Multiple R-squared:  0.818,  Adjusted R-squared:  0.799 
## F-statistic:   42 on 3 and 28 DF,  p-value: 0.00000000017

Most immediately, we see that the coefficient associated with weight has been cut nearly in half! It’s not that the relationship between weight and fuel economy has changed, rather that the effect has been distributed across more variables.

In terms of interpretation of this new coefficient, we might be tempted to say “a unit change in weight (1000 lbs) is associated with a -2.87 change in fuel economy, everything else being fixed.” However, consider a plot showing the relationship between weight and the other variables:

From this, we see that changing weight without changing the value of displacement is nearly impossible; every unit change in weight is associated with nearly a 100-200 change in vehicle displacement. This is an example of our two covariates expressing multicollinearity.

Alternatively, let’s consider a variable in our model that is not associated with weight. Here, we look at “qsec”, which is the number of seconds for the vehicle to travel 1/4 mile:

ggplot(mtcars, aes(wt, qsec)) + geom_point()

Here, there is no immediately discerable relationship between the two. When we add this variable to the model, our estimate for weight remains nearly unchanged

fit_qsec <- lm(mpg ~ wt + qsec, mtcars)
summary(fit_qsec)

## 
## Call:
## lm(formula = mpg ~ wt + qsec, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.396 -2.143 -0.213  1.492  5.749 
## 
## Coefficients:
##             Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)   19.746      5.252    3.76        0.00077 ***
## wt            -5.048      0.484  -10.43 0.000000000025 ***
## qsec           0.929      0.265    3.51        0.00150 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.6 on 29 degrees of freedom
## Multiple R-squared:  0.826,  Adjusted R-squared:  0.814 
## F-statistic:   69 on 2 and 29 DF,  p-value: 0.00000000000939

Not only does this model better satisfy our assumptions, we see that the adjusted \(R^2\) value of fit_qsec is larger than fit_disp, using one fewer variable.

Considering residuals

A good way to determine if an additional variable would contribute to a model is to consider a plot of model residuals and the variable in question: if a variable would be a good candidate for inclusion in the model, it would show a linear relationship (i.e., help explain) with the residuals.

# Our original model
fit <- lm(mpg ~ wt, mtcars)

par(mfrow = c(1, 2))

# Plot against displacement
plot(mtcars$disp, residuals(fit), main = "Displacement vs Residuals")
abline(h = 0, col = 'red', lty = 'dashed')

# Plot against qsec
plot(mtcars$qsec, residuals(fit), main = "qsec vs Residuals")
abline(h = 0, col = 'red', lty = 'dashed')

In the displacement plot, we see that the residuals tend to be scattered around zero for all values of displacement, whereas for the qsec plot, we see a clear linear relationship between the residuals and qsec.

Question 2: For this question, we are going to use the colleges dataset with a goal of determining the covariates most responsible for determining the average 10 year salary of colleges graduates.

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
colleges <- na.omit(colleges)

The plot below gives a quick way to visualize the relationship between each of the individual covariates that we will be using in the model. To interpret this, begin by considering the outcome variable located in the bottom left corner. In each of the plots above it, Salary10yr_median represents the x-axis, while to each of the plots to the left, it is the y-axis.

plot(colleges[, c("Adm_Rate", "Debt_median", "Avg_Fac_Salary", "ACT_median", "Net_Tuition", "Salary10yr_median")])

In doing this, adhere to the following guidelines:

Only use the variables that are included in the plot above
Begin by adding a single variable in the model. Before adding another variable, plot it against the residuals of the first plot
You may use the anova() function to compare models
You are permitted (and encouraged) to change the variables you use. That is, even if you choose Debt_median as your first variable, you are permitted to remove it later and start over
Your final model should include no more than 3 covariate, though it is allowed to contain fewer.

Your answer should include your models with one, two, and three variables, as well the residual plots you used to justify the inclusion of additional variables. You do not need to include models that you considered and then ultimately rejected.

Categorical variables

Often, we wish to include categorical variables in our regression analysis. Consider the mpg dataset from ggplot2, where we group by drive train (four wheel drive, front, and rear) and compute the mean highway mileage in each group:

# as.data.frame because tibble rounds
mpg %>% group_by(drv) %>% 
  summarize(mean(hwy)) %>% as.data.frame()

##   drv mean(hwy)
## 1   4    19.175
## 2   f    28.160
## 3   r    21.000

Compare this with what we find in a regression model:

lm(hwy ~ drv, mpg) %>% summary()

## 
## Call:
## lm(formula = hwy ~ drv, data = mpg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11.16  -2.17  -1.00   1.96  15.84 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   19.175      0.404    47.5 <0.0000000000000002 ***
## drvf           8.986      0.567    15.8 <0.0000000000000002 ***
## drvr           1.825      0.913     2.0               0.047 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.1 on 231 degrees of freedom
## Multiple R-squared:  0.531,  Adjusted R-squared:  0.527 
## F-statistic:  131 on 2 and 231 DF,  p-value: <0.0000000000000002

A few things stand out immediately. First, we see that the intercept is identical to the mean highway fuel economy for four-wheel drive vehicles. Next, we see that we have model estimates for front-wheel drive and rear-wheel drive, but nothing explicitly for four.

As you might have guessed, adding the values of the intercept to each of the coefficients results in the group mean for each variable, i.e., 19.17 + 8.99 is equal to 28.16, the mean fuel economy for front-wheel drive, while 19.17 + 1.83 is equal to 21, the mean fuel economy for rear-wheel drive.

This is an example of the creation of a reference group, which gets absorbed into the intercept. We can inhibit this behavior by including -1 in our model, which removes the intercept entirely:

lm(hwy ~ -1 + drv, mpg) %>% summary()

## 
## Call:
## lm(formula = hwy ~ -1 + drv, data = mpg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11.16  -2.17  -1.00   1.96  15.84 
## 
## Coefficients:
##      Estimate Std. Error t value            Pr(>|t|)    
## drv4   19.175      0.404    47.5 <0.0000000000000002 ***
## drvf   28.160      0.398    70.8 <0.0000000000000002 ***
## drvr   21.000      0.819    25.6 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.1 on 231 degrees of freedom
## Multiple R-squared:  0.972,  Adjusted R-squared:  0.971 
## F-statistic: 2.64e+03 on 3 and 231 DF,  p-value: <0.0000000000000002

Doing so results in coefficient values that are equal to each group mean.

You’ll notice that in addition to the covariates changing between models, so too does the significance of each covariate. To understand this, consider again the particular hypothesis that this is looking to test. What is the null hypothesis for the covariates when the intercept is included? What is the null hypothesis when it is not?

Categorical variables can be included to either control for variability between groups (as is often the case when sex is included as a covariate) or it may be a variable of interest itself. For example, in the ToothGrowth dataset which examines the growth of odontoblasts in guinea pigs given vitamin C as either ascorbic acid or as orange juice, we might be interested in understanding both the impact on dose as well as delivery method.

fit <- lm(len ~ supp + dose, ToothGrowth)
summary(fit)

## 
## Call:
## lm(formula = len ~ supp + dose, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    9.272      1.282    7.23 0.00000000131233453 ***
## suppVC        -3.700      1.094   -3.38              0.0013 ** 
## dose           9.764      0.877   11.14 0.00000000000000063 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.24 on 57 degrees of freedom
## Multiple R-squared:  0.704,  Adjusted R-squared:  0.693 
## F-statistic: 67.7 on 2 and 57 DF,  p-value: 0.000000000000000872

By it’s nature, categorical variables of this kind result in a vertical shift of the intercept according to the group in question:

cc <- coef(lm(len ~ supp + dose, ToothGrowth))

ggplot(ToothGrowth, aes(dose, len, color = supp)) + geom_point() +
  geom_abline(intercept = cc[1], slope = cc[3], color = '#F8766D') + 
  geom_abline(intercept = cc[1] + cc[2], slope = cc[3], color = '#00BFC4') +
  theme_bw()

Question 3: Again using the mtcars dataset, suppose that we are constructing a model to investigate the relationship between weight and miles per gallon, but that we also want to include the number of carburetors as a control variable. Construct two models, one using carburetors as it is stored by default in the dataset (as an integer) and another treating carburetors as a factor variable. Then answer the following questions:

How is the interpretation of the model treating carburetors as an integer different than treating it as a factor?
Given your answer in (1), how would you explain the difference in significance levels regarding carburetors between the two models? Hint: In the factor model, where is the estimate for the effect on mpg when the number of carburetors is 1? Given this, what is the null hypothesis regarding the coefficient for each factor level?

Multivariate Linear Regression