library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## Better histograms
gh <- function(bins = 10) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
This lab is intended to explore the paper airplane data generated in
class. A part of this lab will involve performing an ANOVA. We will be
doing this with the R function aov()
. This function works
identically to the t.test()
function in that the syntax
requires a formula of the form outcome ~ group
. For
example, to compare city miles per gallon based on class of vehicle in
the mpg
dataset, we would simply do
aov(cty ~ class, data = mpg) %>% summary()
## Df Sum Sq Mean Sq F value Pr(>F)
## class 6 2295 383 45.1 <0.0000000000000002 ***
## Residuals 227 1925 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this, we can find our degrees of freedom, the value of our F statistic, and a p-value.
Don’t forget to include %>% summary()
in order to
make our output more useful.
For this lab, it will also be helpful to use the table()
function. Recall that we can do one or two-way tables with the following
syntax:
## One way
table(mpg$class)
##
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
## Two way
with(mpg, table(class, drv))
## drv
## class 4 f r
## 2seater 0 0 5
## compact 12 35 0
## midsize 3 38 0
## minivan 0 11 0
## pickup 33 0 0
## subcompact 4 22 9
## suv 51 0 11
Here is a curated version of the data collected in class
planes <- read.csv("https://collinn.github.io/data/sta209_f24_planes.csv")
Using the dataset provided, answer the following questions:
Question 1: How many observations are in our dataset? What variables are in this dataset? What values do they have?
Question 2: What is our outcome variable in this data? What variables would we use to try to explain this outcome?
Question 3: Of the predictor variables identified in Question 2, which do you think will be the most helpful in explaining observed variance? Which do you think will be the least helpful?
These next questions will help us begin exploring our dataset.
Question 4: First, create a boxplot of distance without using any categories. Do you notice anything in this dataset that may be a point of concern?
Question 5: Create boxplots comparing distance with each of the three categorical variables (Design, Paper, and Color). Offer a brief description of what you see in each of these plots
Question 6: Create a two-way table analyzing the relationship between Design and Paper. What do you notice?
Question 7: Now create a two-way table analyzing the relationship between Design and Color. How does this differ based on what we found in Question 6?
Question 8: Based on what you found in Question 6 and Question 7, can you anticipate any problems we may have trying to analyze the relationship between Distance and either paper weight or color?
Question 9: Suppose you were to recreate this study again from the beginning. What other factors in the design of our experiment would you consider changing or controlling for? In other words, can you identify any additional sources of variability that may influence flight distance that are
Here, we will practice analyzing data using the aov()
function
Question 10: State the null hypothesis for testing flight distance across any of the group variables in our dataset.
Question 11: Perform an ANOVA analyzing the relationship between flight distance and paper weight:
Question 12: Perform an ANOVA analyzing the relationship between flight distance and paper color:
Question 13: Perform an ANOVA analyzing the relationship between flight distance and design:
Question 14: Based on Questions 11-13, which variable would you use in a model to try and determine the flight distance of a particular plane? Does this match your expectations from the beginning of the lab?
As noted previously, ANOVA is specifically concerned with testing the null hypothesis of equality between means for multiple groups,
\[ H_0: \mu_1 = \mu_2 = \dots = \mu_k \] Should we perform an ANOVA and reject our null hypothesis, we only know that at least two of our group means are different. Post-hoc pairwise testing (Latin for “after this” or “after the fact”) can be done to determine which of our pairwise differences are likely responsible.
Consider again our dog dataset in which we wish to test for equality
in average speed between different colored dogs. This is done simply
with the aov()
function
## Read in dogs
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
## This will assign the results to a variable called model
model <- aov(speed ~ color, dogs)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## color 3 1652 551 4.3 0.0053 **
## Residuals 396 50746 128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, we see information on the squared error from the grouping and our residuals, along with an F-statistic and a p-value. If we were testing at the \(\alpha = 0.05\) level, we would reject this test as \(p-val = 0.0053\).
To determine which pairwise colors had a difference, we can use the
TukeyHSD()
function (Tukey honest statistical difference)
on the model
object we created above:
## Pass in output from aov() function
comp <- TukeyHSD(model)
comp
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = speed ~ color, data = dogs)
##
## $color
## diff lwr upr p adj
## brown-black 1.9612 -2.65664 6.57906 0.69237
## white-black 4.2360 -0.38182 8.85388 0.08529
## yellow-black -2.3968 -5.97373 1.18021 0.31012
## white-brown 2.2748 -3.56635 8.11599 0.74672
## yellow-brown -4.3580 -9.41657 0.70063 0.11889
## yellow-white -6.6328 -11.69139 -1.57418 0.00437
There are a few things to note here:
TukeyHSD()
function by passing in an
argument for conf.level
From this output, we see the only statistically significant difference in between yellow and white.
Finally, we can plot the output from the TukeyHSD()
function with a call to the base R function plot()
## Pass in output from TukeyHSD() function
plot(comp)
Note here again, the only confidence interval that does not contain 0 (our null hypothesis for pairwise tests) is that between yellow and white, consistent with the output we observed above.
Question 15: Consider the ANOVA models you created in Questions 11-13. Of the ones in which there was evidence to reject the null hypothesis, perform a post-hoc test to determine between which individual groups there was a statistically significant difference (i.e., between Green and Blue paper, if there was evidence to identify groups based on Color)