library(ggplot2)
library(dplyr)
theme_set(theme_bw())
Question 0: Type the following: I will not copy and paste text from this lab into my R Markdown document because if I do it will not be able to render.
This lab is intended to introduce to ANOVA with the
aov() function in R that, similar to the
t.test() function, allows us conduct a hypothesis test with
our observed data. Also like the t.test() function,
aov() utilizes a syntax that requires a formula of the form
outcome ~ group where outcome is a
quantitative variable and group must be categorical. For
example, to compare city miles per gallon based on class of vehicle in
the mpg dataset, we would simply do
## Create the model
aov_model <- aov(cty ~ class, data = mpg)
## Summary information provides the table
summary(aov_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## class 6 2295 383 45.1 <0.0000000000000002 ***
## Residuals 227 1925 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this, we can find our degrees of freedom, the value of our F statistic, and a p-value.
Question 1: This question will be using the
dogs dataset from class:
## Read in dogs
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
For each of the variables, breed, color,
and size, do the following:
Create a jitter plot showing the distribution of speed for each of the categories with an additional marker for the mean. (We have done this before in this lab)
Comment on the relationship between the within group and between group variability. Do the group means appear to be similar?
Perform an ANOVA to test whether or not there appears to be evidence of an association between the categorical variable and speed. Report the F statistic and p-value, and indicate whether you would reject or fail to reject at \(\alpha = 0.05\)
As noted previously, ANOVA is specifically concerned with testing the null hypothesis of equality between means for multiple groups,
\[ H_0: \mu_1 = \mu_2 = \dots = \mu_k \] Should we perform an ANOVA and reject our null hypothesis, we only know that at least two of our group means are different. Post-hoc pairwise testing (Latin for “after this” or “after the fact”) can be done to determine which of our pairwise differences are likely responsible.
Consider again our car dataset in which we wish to test for equality
in average city miles per gallon between different drive trains. This is
done simply with the aov() function
## Create the model
aov_model <- aov(cty ~ drv, data = mpg)
summary(aov_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## drv 2 1879 939 92.7 <0.0000000000000002 ***
## Residuals 231 2342 10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, we see information on the squared error from the grouping and our residuals, along with an F-statistic and a p-value. If we were testing at the \(\alpha = 0.05\) level, we would reject this test as \(p-val = 0.0053\).
To determine which pairwise classes of vehicle had a difference in
city miles-per-gallon, we can use the TukeyHSD() function
(Tukey honest statistical difference) on the model object
we created above:
## Pass in output from aov() function
comp <- TukeyHSD(aov_model)
comp
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ drv, data = mpg)
##
## $drv
## diff lwr upr p adj
## f-4 5.6416 4.6025 6.6807 0.00000
## r-4 -0.2501 -1.9246 1.4244 0.93389
## r-f -5.8917 -7.5615 -4.2219 0.00000
There are a few things to note here:
TukeyHSD() function by passing in an
argument for conf.levelFinally, we can use TukeyHSD() with the plot function to
help visualize our confidence intervals for the difference
plot(comp)
We see that the difference between rear and front and 4-wheel drive and
front are the only ones that do not contain 0.
Question 2: Consider the ANOVA models you created in Questions 1 Of the ones in which there was evidence to reject the null hypothesis, perform a post-hoc test to determine between which individual groups there was a statistically significant difference
This portion of the lab has three goals:
lm()Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one covariate, assumed to be of the form
\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):
\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between our predicted \(y\) of any observation and the true value,
\[ r_i = \hat{y}_i - y_i \] Ultimately, the predicted values for \(\beta\) are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n r_i^2\). As a call-back to ANOVA, this is akin to minimizing within-group variability.
For practice, we will use the mtcars dataset that comes
built-in to R. You can use ?mtcars to get a description of
each of the data variables
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Similar to ANOVA and the t-test functions, the lm()
function (Linear Model) operates using
a formula syntax with an outcome and predictor variables. Here, we use
lm() to estimate a vehicles miles per gallon
(mpg) using the size of the engine (disp)
## First fit model with lm
fit <- lm(mpg ~ disp, mtcars)
## Summary function to display output
summary(fit)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.892 -2.202 -0.963 1.627 7.231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.59985 1.22972 24.07 < 0.0000000000000002 ***
## disp -0.04122 0.00471 -8.75 0.00000000094 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.25 on 30 degrees of freedom
## Multiple R-squared: 0.718, Adjusted R-squared: 0.709
## F-statistic: 76.5 on 1 and 30 DF, p-value: 0.000000000938
From here, we get a lot of useful summary information. Try to identify the following:
It’s worth observing that the effect size, or the value of
\(\hat{\beta}\), doesn’t seem
particularly large: without more context, it is difficult to see if this
is a consequence of the effect size being small or if it is a
consequence of differences in scale between mpg and engine displacement.
We can investigate this further by creating a plot where we find
evidence for the latter. We can add a visualization of simple linear
regression by using geom_smooth(method = lm) as a layer in
a ggplot.
ggplot(mtcars, aes(disp, mpg)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) # remove standard error
The apparent strength of this relationship corroborates what we found in the linear model summary information above.
Question 3: For this question, we will be using the
mtcars dataset to investigate the relationship between
horsepower (hp) and vehicle weight (wt)
geom_smooth(method = lm) to add a linear regression line.
How does this relationship appear?lm()) testing your hypothesis from Part A. What do you
find? Is it consistent with your plot in Part B?Question 4: Some infants are born with congenital heart defects and require surgery very early in life. One approach to performing this surgery is known as “circulatory arrest.” A downside of this procedure, however, is that it cuts off the flow of blood to the brain, possibly resulting in brain damage. An alternative procedure, “low-flow bypass” maintains circulation to the brain, but does so with an external pump that potentially causes other sorts of injuries to the brain.
To investigate the treatments, surgeons at Harvard Medical School conducted a randomized controlled trial. In the trial, 70 infants received low-flow bypass surgery and 73 received the circulatory arrest approach. The researchers looked at two outcomes: the Psychomotor Development Index (PDI), which measures physiological development, and the Mental Development Index (MDI), which measures mental development. For both indices, higher scores indicate greater levels of development.
infant <- read.delim("https://github.com/IowaBiostat/data-sets/raw/main/infant-heart/infant-heart.txt")
t.test(), perform a
t-test to investigate your hypothesis in Part A. What do you
conclude?We have seen previously that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. There were two separate variables of interest in the study: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:
## Load data
data(ToothGrowth)
## We add additional variables to the model with +
lm(len ~ dose + supp, ToothGrowth) %>% summary()
##
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.272 1.282 7.23 0.00000000131233453 ***
## dose 9.764 0.877 11.14 0.00000000000000063 ***
## suppVC -3.700 1.094 -3.38 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.24 on 57 degrees of freedom
## Multiple R-squared: 0.704, Adjusted R-squared: 0.693
## F-statistic: 67.7 on 2 and 57 DF, p-value: 0.000000000000000872
Written as an equation, this has the form
\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \]
That is, we find:
dose never takes on the
value 0)A plot of this regression looks like this:
Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.
Question 5: Consider the dog dataset we used in class
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
speed as the dependent variable and include both
color and size as the independent variables.
Looking at the summary output, what variable(s) have become the
reference variables?