library(dplyr)
library(ggplot2)

# Pretty plots
theme_set(theme_bw())

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Re-introduction to Linear Regression in R

We were first introduced to linear regression in Lab 4. We continue this topic today, re-demonstrating the key functions in R that are used for linear regression, as well as some new tools for doing so.

Simple Linear Regression

Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one covariate, assumed to be of the form

\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):

\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between our predicted \(y\) of any observation and the true value,

\[ r_i = \hat{y}_i - y_i \] Ultimately, the predicted values for \(\beta\) are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n r_i^2\).


For practice, we will use the mtcars dataset that comes built-in to R. You can use ?mtcars to get a description of each of the data variables

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Similar to ANOVA and the t-test functions, the lm() function (Linear Model) operates using a formula syntax with an outcome and predictor variables. Here, we use lm() to estimate a vehicles miles per gallon (mpg) using the size of the engine (disp)

## First fit model with lm
fit <- lm(mpg ~ disp, mtcars)

## Summary function to display output
summary(fit)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.892 -2.202 -0.963  1.627  7.231 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 29.59985    1.22972   24.07 < 0.0000000000000002 ***
## disp        -0.04122    0.00471   -8.75        0.00000000094 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.25 on 30 degrees of freedom
## Multiple R-squared:  0.718,  Adjusted R-squared:  0.709 
## F-statistic: 76.5 on 1 and 30 DF,  p-value: 0.000000000938

From here, we get a lot of useful summary information. Try to identify the following:

  1. The slope estimate (\(\beta\)) is \(\hat{\beta} = -0.041\), indicating that for each additional cubic inch in displacement, the vehicle is expected to have a reduction of 0.04 miles per gallon
  2. Based on the p-value reported in the summary, we have evidence that the relationship between mpg and displacement is statistically significant, indicating a rejection of the null hypothesis that there is no linear relationship between the two
  3. We have a multiple R-squared of 0.718, indicating that about 70% of the total variance in mpg is explained with vehicle displacement

It’s worth observing that the effect size, or the value of \(\hat{\beta}\), isn’t particularly large: without more context, it is difficult to see if this is a consequence of the effect size being small or if it is a consequence of differences in scale between mpg and engine displacement. We can investigate this further by creating a plot where we find evidence for the latter. Additionally, recall that we can also use the geom_smooth() function with ggplot to illustrate how our regression line will look

ggplot(mtcars, aes(disp, mpg)) + 
  geom_point() + 
  geom_smooth(method = lm, se = FALSE) # remove standard error

The apparent strength of this relationship corroborates what we found in the linear model summary information above.


Question 1: For this question, we will be using the mtcars dataset to investigate the relationship between horsepower (hp) and vehicle weight (wt)


Question 2: Some infants are born with congenital heart defects and require surgery very early in life. One approach to performing this surgery is known as “circulatory arrest.” A downside of this procedure, however, is that it cuts off the flow of blood to the brain, possibly resulting in brain damage. An alternative procedure, “low-flow bypass” maintains circulation to the brain, but does so with an external pump that potentially causes other sorts of injuries to the brain.

To investigate the treatments, surgeons at Harvard Medical School conducted a randomized controlled trial. In the trial, 70 infants received low-flow bypass surgery and 73 received the circulatory arrest approach. The researchers looked at two outcomes: the Psychomotor Development Index (PDI), which measures physiological development, and the Mental Development Index (MDI), which measures mental development. For both indices, higher scores indicate greater levels of development.

infant <- read.delim("https://github.com/IowaBiostat/data-sets/raw/main/infant-heart/infant-heart.txt")

Multivariate Regression

We have seen previously that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:

data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()
## 
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    9.272      1.282    7.23 0.00000000131233453 ***
## dose           9.764      0.877   11.14 0.00000000000000063 ***
## suppVC        -3.700      1.094   -3.38              0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.24 on 57 degrees of freedom
## Multiple R-squared:  0.704,  Adjusted R-squared:  0.693 
## F-statistic: 67.7 on 2 and 57 DF,  p-value: 0.000000000000000872

Written as an equation, this has the form

\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:

  1. An intercept of 9.27 (note however that this is not a meaningful term as our independent variable dose never takes on the value 0)
  2. A slope of 9.76, indicating that every additional milligram of vitamin C resulted in 9.76 millimeter change in odontoblast length
  3. A -3.7 associated with the indicator for ascorbic acid. As this is an indicator term, it effectively shifts the regression line vertically, according to the sign. Additionally, because this is an indicator for ascorbic acid, it tells us that orange juice has become the reference variable

A plot of this regression looks like this:

Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.

Question 3: Consider the dog dataset we used in class

dogs <- read.csv("https://collinn.github.io/data/dogs.csv")

Question 4: For this question, we will be creating a linear model to predict enrollment at primary undergraduate institutions in the United States

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")