library(dplyr)
library(ggplot2)
# Pretty plots
theme_set(theme_bw())
college <- read.csv("https://collinn.github.io/data/college2019.csv")
We were first introduced to linear regression in Lab 5. We continue this topic today, re-demonstrating the key functions in R that are used for linear regression, as well as some new tools for doing so.
A point of confusion in the last lab was the question: what is meant by: “create a linear regression model”
Any time we are asked to construct a model, we will need to use the
lm()
function (explained again below) to create a model and
provide us with estimates of coefficient and model diagnostics. To be
clear, creating a fitted line in ggplot does not constitute a
linear model, though it does allow us to visualize it.
Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one covariate, assumed to be of the form
\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):
\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between our predicted \(y\) of any observation and the true value,
\[ r_i = \hat{y}_i - y_i \] Ultimately, the predicted values for \(\beta\) are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n r_i^2\). As a call-back to ANOVA, this is akin to minimizing within-group variability.
For practice, we will use the mtcars
dataset that comes
built-in to R. You can use ?mtcars
to get a description of
each of the data variables
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Similar to ANOVA and the t-test functions, the lm()
function (Linear Model) operates using
a formula syntax with an outcome and predictor variables. Here, we use
lm()
to estimate a vehicles miles per gallon
(mpg
) using the size of the engine (disp
)
## First fit model with lm
fit <- lm(mpg ~ disp, mtcars)
## Summary function to display output
summary(fit)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.892 -2.202 -0.963 1.627 7.231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.59985 1.22972 24.07 < 0.0000000000000002 ***
## disp -0.04122 0.00471 -8.75 0.00000000094 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.25 on 30 degrees of freedom
## Multiple R-squared: 0.718, Adjusted R-squared: 0.709
## F-statistic: 76.5 on 1 and 30 DF, p-value: 0.000000000938
From here, we get a lot of useful summary information. Try to identify the following:
It’s worth observing that the effect size, or the value of
\(\hat{\beta}\), isn’t particularly
large: without more context, it is difficult to see if this is a
consequence of the effect size being small or if it is a consequence of
differences in scale between mpg and engine displacement. We can
investigate this further by creating a plot where we find evidence for
the latter. Additionally, recall that we can also use the
geom_smooth()
function with ggplot to illustrate how our
regression line will look
ggplot(mtcars, aes(disp, mpg)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) # remove standard error
The apparent strength of this relationship corroborates what we found in the linear model summary information above.
Question 1: For this question, we will be using the
mtcars
dataset to investigate the relationship between
horsepower (hp
) and vehicle weight (wt
)
geom_smooth(method = lm)
to add a linear regression line.
How does this relationship appear?Question 2: Some infants are born with congenital heart defects and require surgery very early in life. One approach to performing this surgery is known as “circulatory arrest.” A downside of this procedure, however, is that it cuts off the flow of blood to the brain, possibly resulting in brain damage. An alternative procedure, “low-flow bypass” maintains circulation to the brain, but does so with an external pump that potentially causes other sorts of injuries to the brain.
To investigate the treatments, surgeons at Harvard Medical School conducted a randomized controlled trial. In the trial, 70 infants received low-flow bypass surgery and 73 received the circulatory arrest approach. The researchers looked at two outcomes: the Psychomotor Development Index (PDI), which measures physiological development, and the Mental Development Index (MDI), which measures mental development. For both indices, higher scores indicate greater levels of development.
infant <- read.delim("https://github.com/IowaBiostat/data-sets/raw/main/infant-heart/infant-heart.txt")
We have seen previously that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:
data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()
##
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.272 1.282 7.23 0.00000000131233453 ***
## dose 9.764 0.877 11.14 0.00000000000000063 ***
## suppVC -3.700 1.094 -3.38 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.24 on 57 degrees of freedom
## Multiple R-squared: 0.704, Adjusted R-squared: 0.693
## F-statistic: 67.7 on 2 and 57 DF, p-value: 0.000000000000000872
Written as an equation, this has the form
\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:
dose
never takes on the
value 0)A plot of this regression looks like this:
Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.
Question 3: Consider the dog dataset we used in class
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
speed
as the dependent variable and include both
color
and size
as the independent variables.
Looking at the summary output, what variable(s) have become the
reference variables?