library(ggplot2)
library(dplyr)
theme_set(theme_bw())
This lab is intended to introduce to ANOVA with the
aov()
function in R that, similar to the
t.test()
function, allows us conduct a hypothesis test with
our observed data. Also like the t.test()
function,
aov()
utilizes a syntax that
requires a formula of the form outcome ~ group
. For
example, to compare city miles per gallon based on class of vehicle in
the mpg
dataset, we would simply do
aov(cty ~ class, data = mpg) %>% summary()
## Df Sum Sq Mean Sq F value Pr(>F)
## class 6 2295 383 45.1 <0.0000000000000002 ***
## Residuals 227 1925 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this, we can find our degrees of freedom, the value of our F statistic, and a p-value.
Don’t forget to include %>% summary()
in order to
make our output more useful.
This lab asks you to construct several boxplots. These are fine, but
as an alternative you might also consider using a jitter plot to help
visualize the individual observations. Using a height
or
width
argument in geom_jitter()
helps prevent
the points from overlapping across categories.
ggplot(mpg, aes(cty, class)) + geom_jitter(height = 0.15, size = 2)
Here is a curated version of the data collected in class
planes <- read.csv("https://collinn.github.io/data/sta209_s25_planes.csv")
Note that some minor modifications have been made; colors have been grouped and designs with too few observations have been dropped.
Using the dataset provided, answer the following questions:
Question 1: How many observations are in our dataset? What variables are in this dataset? What values do they have?
Question 2: What is our outcome variable in this data? Which variables do you think will be the most helpful in explaining observed variance? Which do you think will be the least helpful?
These next questions will help us begin exploring our dataset.
Question 3: Create boxplots comparing distance with each of the categorical variables (Design, Color, Hand, and Section). Offer a brief description of what you see in each of these plots
Question 4: Suppose you were to recreate this study again from the beginning. What other factors in the design of our experiment would you consider changing or controlling for? In other words, can you identify any additional sources of variability that may influence flight distance that are missing?
Here, we will practice analyzing data using the aov()
function
Question 5: When performing ANOVA, what form does our null hypothesis take? Give an example by stating the null hypothesis relating flight distance to class section.
Question 6: Perform an ANOVA analyzing the relationship between flight distance and paper color:
Question 7: Perform an ANOVA analyzing the relationship between flight distance and handedness:
Question 8: Perform an ANOVA analyzing the relationship between flight distance and design:
Question 9: Based on Questions 6-8, which variable would you use in a model to try and determine the flight distance of a particular plane? Does this match your expectations from the beginning of the lab?
As noted previously, ANOVA is specifically concerned with testing the null hypothesis of equality between means for multiple groups,
\[ H_0: \mu_1 = \mu_2 = \dots = \mu_k \] Should we perform an ANOVA and reject our null hypothesis, we only know that at least two of our group means are different. Post-hoc pairwise testing (Latin for “after this” or “after the fact”) can be done to determine which of our pairwise differences are likely responsible.
Consider again our dog dataset in which we wish to test for equality
in average speed between different colored dogs. This is done simply
with the aov()
function
## Read in dogs
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
## This will assign the results to a variable called model
model <- aov(speed ~ color, dogs)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## color 3 1652 551 4.3 0.0053 **
## Residuals 396 50746 128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, we see information on the squared error from the grouping and our residuals, along with an F-statistic and a p-value. If we were testing at the \(\alpha = 0.05\) level, we would reject this test as \(p-val = 0.0053\).
To determine which pairwise colors had a difference, we can use the
TukeyHSD()
function (Tukey honest statistical difference)
on the model
object we created above:
## Pass in output from aov() function
comp <- TukeyHSD(model)
comp
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = speed ~ color, data = dogs)
##
## $color
## diff lwr upr p adj
## brown-black 1.9612 -2.65664 6.57906 0.69237
## white-black 4.2360 -0.38182 8.85388 0.08529
## yellow-black -2.3968 -5.97373 1.18021 0.31012
## white-brown 2.2748 -3.56635 8.11599 0.74672
## yellow-brown -4.3580 -9.41657 0.70063 0.11889
## yellow-white -6.6328 -11.69139 -1.57418 0.00437
There are a few things to note here:
TukeyHSD()
function by passing in an
argument for conf.level
From this output, we see the only statistically significant difference in between yellow and white.
Finally, we can plot the output from the TukeyHSD()
function with a call to the base R function plot()
## Pass in output from TukeyHSD() function
plot(comp)
Note here again, the only confidence interval that does not contain 0 (our null hypothesis for pairwise tests) is that between yellow and white, consistent with the output we observed above.
Question 10: Consider the ANOVA models you created in Questions 6-8. Of the ones in which there was evidence to reject the null hypothesis, perform a post-hoc test to determine between which individual groups there was a statistically significant difference