library(ggplot2)
library(dplyr)

theme_set(theme_bw())

Question 0: Type the following: I will not copy and paste text from this lab into my R Markdown document because if I do it will not be able to render.

ANOVA

This lab is intended to introduce to ANOVA with the aov() function in R that, similar to the t.test() function, allows us conduct a hypothesis test with our observed data. Also like the t.test() function, aov() utilizes a syntax that requires a formula of the form outcome ~ group where outcome is a quantitative variable and group must be categorical. For example, to compare city miles per gallon based on class of vehicle in the mpg dataset, we would simply do

## Create the model
aov_model <- aov(cty ~ class, data = mpg)

## Summary information provides the table
summary(aov_model)
##              Df Sum Sq Mean Sq F value              Pr(>F)    
## class         6   2295     383    45.1 <0.0000000000000002 ***
## Residuals   227   1925       8                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this, we can find our degrees of freedom, the value of our F statistic, and a p-value.

Practice

Question 1: This question will be using the dogs dataset from class:

## Read in dogs
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")

For each of the variables, breed, color, and size, do the following:

  • Create a jitter plot showing the distribution of speed for each of the categories with an additional marker for the mean. (We have done this before in this lab)

  • Comment on the relationship between the within group and between group variability. Do the group means appear to be similar?

  • Perform an ANOVA to test whether or not there appears to be evidence of an association between the categorical variable and speed. Report the F statistic and p-value, and indicate whether you would reject or fail to reject at \(\alpha = 0.05\)

Post-Hoc Testing

As noted previously, ANOVA is specifically concerned with testing the null hypothesis of equality between means for multiple groups,

\[ H_0: \mu_1 = \mu_2 = \dots = \mu_k \] Should we perform an ANOVA and reject our null hypothesis, we only know that at least two of our group means are different. Post-hoc pairwise testing (Latin for “after this” or “after the fact”) can be done to determine which of our pairwise differences are likely responsible.

Consider again our car dataset in which we wish to test for equality in average city miles per gallon between different drive trains. This is done simply with the aov() function

## Create the model
aov_model <- aov(cty ~ drv, data = mpg)

summary(aov_model)
##              Df Sum Sq Mean Sq F value              Pr(>F)    
## drv           2   1879     939    92.7 <0.0000000000000002 ***
## Residuals   231   2342      10                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, we see information on the squared error from the grouping and our residuals, along with an F-statistic and a p-value. If we were testing at the \(\alpha = 0.05\) level, we would reject this test as \(p-val = 0.0053\).

To determine which pairwise classes of vehicle had a difference in city miles-per-gallon, we can use the TukeyHSD() function (Tukey honest statistical difference) on the model object we created above:

## Pass in output from aov() function
comp <- TukeyHSD(aov_model)
comp
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ drv, data = mpg)
## 
## $drv
##        diff     lwr     upr   p adj
## f-4  5.6416  4.6025  6.6807 0.00000
## r-4 -0.2501 -1.9246  1.4244 0.93389
## r-f -5.8917 -7.5615 -4.2219 0.00000

There are a few things to note here:

  1. First, we see that it gives us a point estimate of the difference in means as the first column in the output
  2. Next, we get a confidence interval for the difference for lower and upper bounds. By default, this is a 95% confidence interval, but we can change this in the TukeyHSD() function by passing in an argument for conf.level
  3. Finally, we see that the last column gives us an adjusted p-value. That is, rather than adjusting \(\alpha^* = \alpha/m\) and comparing the original p-values, it adjust the p-values that we can compare with our regular \(\alpha\). In either case, the conclusions that we should come to will be the same.

Finally, we can use TukeyHSD() with the plot function to help visualize our confidence intervals for the difference

plot(comp)

We see that the difference between rear and front and 4-wheel drive and front are the only ones that do not contain 0.

Question 2: Consider the ANOVA models you created in Questions 1 Of the ones in which there was evidence to reject the null hypothesis, perform a post-hoc test to determine between which individual groups there was a statistically significant difference

Regression

This portion of the lab has three goals:

  • Introduce you to the linear model function lm()
  • Visualizing linear regression in ggplot
  • Using model output to perform statistical inference

Simple Linear Regression

Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one covariate, assumed to be of the form

\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):

\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between our predicted \(y\) of any observation and the true value,

\[ r_i = \hat{y}_i - y_i \] Ultimately, the predicted values for \(\beta\) are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n r_i^2\). As a call-back to ANOVA, this is akin to minimizing within-group variability.


For practice, we will use the mtcars dataset that comes built-in to R. You can use ?mtcars to get a description of each of the data variables

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Similar to ANOVA and the t-test functions, the lm() function (Linear Model) operates using a formula syntax with an outcome and predictor variables. Here, we use lm() to estimate a vehicles miles per gallon (mpg) using the size of the engine (disp)

## First fit model with lm
fit <- lm(mpg ~ disp, mtcars)

## Summary function to display output
summary(fit)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.892 -2.202 -0.963  1.627  7.231 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 29.59985    1.22972   24.07 < 0.0000000000000002 ***
## disp        -0.04122    0.00471   -8.75        0.00000000094 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.25 on 30 degrees of freedom
## Multiple R-squared:  0.718,  Adjusted R-squared:  0.709 
## F-statistic: 76.5 on 1 and 30 DF,  p-value: 0.000000000938

From here, we get a lot of useful summary information. Try to identify the following:

  1. The slope estimate (\(\beta\)) is \(\hat{\beta} = -0.041\), indicating that for each additional cubic inch in displacement, the vehicle is expected to have a reduction of 0.04 miles per gallon
  2. Based on the p-value reported in the summary, we have evidence that the relationship between mpg and displacement is statistically significant, indicating a rejection of the null hypothesis that there is no linear relationship between the two
  3. We have a multiple R-squared of 0.718, indicating that about 70% of the total variance in mpg is explained with vehicle displacement

It’s worth observing that the effect size, or the value of \(\hat{\beta}\), doesn’t seem particularly large: without more context, it is difficult to see if this is a consequence of the effect size being small or if it is a consequence of differences in scale between mpg and engine displacement. We can investigate this further by creating a plot where we find evidence for the latter. We can add a visualization of simple linear regression by using geom_smooth(method = lm) as a layer in a ggplot.

ggplot(mtcars, aes(disp, mpg)) + 
  geom_point() + 
  geom_smooth(method = lm, se = FALSE) # remove standard error

The apparent strength of this relationship corroborates what we found in the linear model summary information above.

Question 3: For this question, we will be using the mtcars dataset to investigate the relationship between horsepower (hp) and vehicle weight (wt)

  • Part A: What is the null hypothesis relating vehicle weight to horsepower?
  • Part B: Create a scatter plot showing the relationship between the outcome and predictor variables. Use geom_smooth(method = lm) to add a linear regression line. How does this relationship appear?
  • Part C: Construct a linear model (lm()) testing your hypothesis from Part A. What do you find? Is it consistent with your plot in Part B?

Question 4: Some infants are born with congenital heart defects and require surgery very early in life. One approach to performing this surgery is known as “circulatory arrest.” A downside of this procedure, however, is that it cuts off the flow of blood to the brain, possibly resulting in brain damage. An alternative procedure, “low-flow bypass” maintains circulation to the brain, but does so with an external pump that potentially causes other sorts of injuries to the brain.

To investigate the treatments, surgeons at Harvard Medical School conducted a randomized controlled trial. In the trial, 70 infants received low-flow bypass surgery and 73 received the circulatory arrest approach. The researchers looked at two outcomes: the Psychomotor Development Index (PDI), which measures physiological development, and the Mental Development Index (MDI), which measures mental development. For both indices, higher scores indicate greater levels of development.

infant <- read.delim("https://github.com/IowaBiostat/data-sets/raw/main/infant-heart/infant-heart.txt")
  • Part A: Suppose we are first interested in investigating the effects of Treatment on PDI. What type of statistical test would we use and what would be our null hypothesis?
  • Part B: Using t.test(), perform a t-test to investigate your hypothesis in Part A. What do you conclude?
  • Part C: Now perform the same test using linear regression. From this, what would you conclude?
  • Part D: Again looking at the linear model summary output, find and report the value for the multiple \(R^2\). Interpret what this means
  • Part E: Based on your answer for Part D, does it appear that Treatment status would be a good way to predict an individual’s PDI? Does this information change the conclusions that we drew about the difference between Treatment groups in Part B and Part C?

Multivariate Regression

We have seen previously that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. There were two separate variables of interest in the study: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:

## Load data
data(ToothGrowth)

## We add additional variables to the model with +
lm(len ~ dose + supp, ToothGrowth) %>% summary()
## 
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    9.272      1.282    7.23 0.00000000131233453 ***
## dose           9.764      0.877   11.14 0.00000000000000063 ***
## suppVC        -3.700      1.094   -3.38              0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.24 on 57 degrees of freedom
## Multiple R-squared:  0.704,  Adjusted R-squared:  0.693 
## F-statistic: 67.7 on 2 and 57 DF,  p-value: 0.000000000000000872

Written as an equation, this has the form

\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \]

That is, we find:

  1. An intercept of 9.27 (note however that this is not a meaningful term as our independent variable dose never takes on the value 0)
  2. A slope of 9.76, indicating that every additional milligram of vitamin C resulted in 9.76 millimeter change in odontoblast length
  3. A -3.7 associated with the indicator for ascorbic acid. As this is an indicator term, it effectively shifts the regression line vertically, according to the sign. Additionally, because this is an indicator for ascorbic acid, it tells us that orange juice has become the reference variable

A plot of this regression looks like this:

Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.

Question 5: Consider the dog dataset we used in class

dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
  • Part A: Construct a linear model with speed as the dependent variable and include both color and size as the independent variables. Looking at the summary output, what variable(s) have become the reference variables?
  • Part B: We saw previously in ANOVA that there were no statistically significant differences in speed between small, medium, or large dogs. Explain why you think the summary output here indicates that each of these groups is statistically significant (hint: reference variable)
  • Part C: Based on this model, what speed would you predict from a medium sized white dog?