Lab 9 - Inference with Regression

library(dplyr)
library(ggplot2)

# Pretty plots
theme_set(theme_bw())

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Re-introduction to Linear Regression in R

We were first introduced to linear regression in Lab 4. We continue this topic today, re-demonstrating the key functions in R that are used for linear regression, as well as some new tools for doing so.

Simple Linear Regression

Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one covariate, assumed to be of the form

\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):

\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between our predicted \(y\) of any observation and the true value,

\[ r_i = \hat{y}_i - y_i \] Ultimately, the predicted values for \(\beta\) are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n r_i^2\).

For practice, we will use the mtcars dataset that comes built-in to R. You can use ?mtcars to get a description of each of the data variables

data(mtcars)
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Similar to ANOVA and the t-test functions, the lm() function (Linear Model) operates using a formula syntax with an outcome and predictor variables. Here, we use lm() to estimate a vehicles miles per gallon (mpg) using the size of the engine (disp)

## First fit model with lm
fit <- lm(mpg ~ disp, mtcars)

## Summary function to display output
summary(fit)

## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.892 -2.202 -0.963  1.627  7.231 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 29.59985    1.22972   24.07 < 0.0000000000000002 ***
## disp        -0.04122    0.00471   -8.75        0.00000000094 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.25 on 30 degrees of freedom
## Multiple R-squared:  0.718,  Adjusted R-squared:  0.709 
## F-statistic: 76.5 on 1 and 30 DF,  p-value: 0.000000000938

From here, we get a lot of useful summary information. Try to identify the following:

The slope estimate (\(\beta\)) is \(\hat{\beta} = -0.041\), indicating that for each additional cubic inch in displacement, the vehicle is expected to have a reduction of 0.04 miles per gallon
Based on the p-value reported in the summary, we have evidence that the relationship between mpg and displacement is statistically significant, indicating a rejection of the null hypothesis that there is no linear relationship between the two
We have a multiple R-squared of 0.718, indicating that about 70% of the total variance in mpg is explained with vehicle displacement

It’s worth observing that the effect size, or the value of \(\hat{\beta}\), isn’t particularly large: without more context, it is difficult to see if this is a consequence of the effect size being small or if it is a consequence of differences in scale between mpg and engine displacement. We can investigate this further by creating a plot where we find evidence for the latter. Additionally, recall that we can also use the geom_smooth() function with ggplot to illustrate how our regression line will look

ggplot(mtcars, aes(disp, mpg)) + 
  geom_point() + 
  geom_smooth(method = lm, se = FALSE) # remove standard error

The apparent strength of this relationship corroborates what we found in the linear model summary information above.

Question 1: For this question, we will be using the mtcars dataset to investigate the relationship between horsepower (hp) and vehicle weight (wt)

Part A: What is the null hypothesis relating vehicle weight to horsepower?
Part B: Create a scatter plot showing the relationship between the outcome and predictor variables. Use geom_smooth(method = lm) to add a linear regression line. How does this relationship appear?
Part C: Construct a linear model testing your hypothesis from Part A. What do you find? Is it consistent with your plot in Part B?

Question 2: Some infants are born with congenital heart defects and require surgery very early in life. One approach to performing this surgery is known as “circulatory arrest.” A downside of this procedure, however, is that it cuts off the flow of blood to the brain, possibly resulting in brain damage. An alternative procedure, “low-flow bypass” maintains circulation to the brain, but does so with an external pump that potentially causes other sorts of injuries to the brain.

To investigate the treatments, surgeons at Harvard Medical School conducted a randomized controlled trial. In the trial, 70 infants received low-flow bypass surgery and 73 received the circulatory arrest approach. The researchers looked at two outcomes: the Psychomotor Development Index (PDI), which measures physiological development, and the Mental Development Index (MDI), which measures mental development. For both indices, higher scores indicate greater levels of development.

infant <- read.delim("https://github.com/IowaBiostat/data-sets/raw/main/infant-heart/infant-heart.txt")

Part A: Suppose we are first interested in investigating the effects of Treatment on PDI. What would be our null hypothesis in this case?
Part B: Perform a two-sample t-test to investigate your hypothesis in Part A. What do you conclude?
Part C: Now perform the same test using linear regression. From this, what would you conclude?
Part D: Again looking at the linear model summary output, find and report the value for the multiple \(R^2\). Interpret what this means
Part E: Based on your answer for Part D, does it appear that Treatment status would be a good way to predict an individual’s PDI? Does this information change the conclusions that we drew about the difference between Treatment groups in Part B and Part C?

Multivariate Regression

We have seen previously that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:

data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()

## 
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    9.272      1.282    7.23 0.00000000131233453 ***
## dose           9.764      0.877   11.14 0.00000000000000063 ***
## suppVC        -3.700      1.094   -3.38              0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.24 on 57 degrees of freedom
## Multiple R-squared:  0.704,  Adjusted R-squared:  0.693 
## F-statistic: 67.7 on 2 and 57 DF,  p-value: 0.000000000000000872

Written as an equation, this has the form

\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:

An intercept of 9.27 (note however that this is not a meaningful term as our independent variable dose never takes on the value 0)
A slope of 9.76, indicating that every additional milligram of vitamin C resulted in 9.76 millimeter change in odontoblast length
A -3.7 associated with the indicator for ascorbic acid. As this is an indicator term, it effectively shifts the regression line vertically, according to the sign. Additionally, because this is an indicator for ascorbic acid, it tells us that orange juice has become the reference variable

A plot of this regression looks like this:

Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.

Question 3: Consider the dog dataset we used in class

dogs <- read.csv("https://collinn.github.io/data/dogs.csv")

Part A: Construct a linear model with speed as the dependent variable and include both color and size as the independent variables. Looking at the summary output, what variable(s) have become the reference variables?
Part B: We saw previously in ANOVA that there were no statistically significant differences in speed between small, medium, or large dogs. Explain why you think the summary output here indicates that each of these groups is statistically significant (hint: reference variable)
Part C: Based on this model, what speed would you predict from a medium sized white dog?

Question 4: For this question, we will be creating a linear model to predict enrollment at primary undergraduate institutions in the United States

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Part A: Create a linear model with Enrollment as our dependent variable and ACT_median and Private as our independent variables. Using the summary information, explain how each of these variables impacts predicted enrollment
Part B: Create two scatter plots investigating the relationship between ACT_median and both Debt_median and Salary10yr_median. Which appears to be most correlated with ACT_median? Which do you think would be a better variable to include in our linear model?
Part C: Create two additional linear models, one adding Debt_median to the model, the other adding Salary10yr_median. Is what you found consistent with what you chose in Part B? What metric(s) would you use in support of your choice.

Lab 9 - Inference with Regression

2024-05-03

Re-introduction to Linear Regression in R

Simple Linear Regression

Multivariate Regression