library(ggplot2)
library(dplyr)

theme_set(theme_bw())

## Better histograms
gh <- function(bins = 10) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

Introduction

This lab is intended to explore the paper airplane data generated in class. A part of this lab will involve performing an ANOVA. We will be doing this with the R function aov(). This function works identically to the t.test() function in that the syntax requires a formula of the form outcome ~ group. For example, to compare city miles per gallon based on class of vehicle in the mpg dataset, we would simply do

aov(cty ~ class, data = mpg) %>% summary()
##              Df Sum Sq Mean Sq F value              Pr(>F)    
## class         6   2295     383    45.1 <0.0000000000000002 ***
## Residuals   227   1925       8                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this, we can find our degrees of freedom, the value of our F statistic, and a p-value.

Don’t forget to include %>% summary() in order to make our output more useful.

For this lab, it will also be helpful to use the table() function. Recall that we can do one or two-way tables with the following syntax:

## One way
table(mpg$class)
## 
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62
## Two way
with(mpg, table(class, drv))
##             drv
## class         4  f  r
##   2seater     0  0  5
##   compact    12 35  0
##   midsize     3 38  0
##   minivan     0 11  0
##   pickup     33  0  0
##   subcompact  4 22  9
##   suv        51  0 11

Plane Data

Here is a curated version of the data collected in class

planes <- read.csv("https://collinn.github.io/data/sta209_f24_planes.csv")

Introductory Questions

Using the dataset provided, answer the following questions:

Question 1: How many observations are in our dataset? What variables are in this dataset? What values do they have?

Question 2: What is our outcome variable in this data? What variables would we use to try to explain this outcome?

Question 3: Of the predictor variables identified in Question 2, which do you think will be the most helpful in explaining observed variance? Which do you think will be the least helpful?

Data Exploration

These next questions will help us begin exploring our dataset.

Question 4: First, create a boxplot of distance without using any categories. Do you notice anything in this dataset that may be a point of concern?

Question 5: Create boxplots comparing distance with each of the three categorical variables (Design, Paper, and Color). Offer a brief description of what you see in each of these plots

Question 6: Create a two-way table analyzing the relationship between Design and Paper. What do you notice?

Question 7: Now create a two-way table analyzing the relationship between Design and Color. How does this differ based on what we found in Question 6?

Question 8: Based on what you found in Question 6 and Question 7, can you anticipate any problems we may have trying to analyze the relationship between Distance and either paper weight or color?

Question 9: Suppose you were to recreate this study again from the beginning. What other factors in the design of our experiment would you consider changing or controlling for? In other words, can you identify any additional sources of variability that may influence flight distance that are

One-way ANOVA

Here, we will practice analyzing data using the aov() function

Question 10: State the null hypothesis for testing flight distance across any of the group variables in our dataset.

Question 11: Perform an ANOVA analyzing the relationship between flight distance and paper weight:

  • What are the two degrees of freedom for this test?
  • What value does the F-statistic take?
  • If you were testing our null hypothesis at \(\alpha = 0.05\), what decision would you make?

Question 12: Perform an ANOVA analyzing the relationship between flight distance and paper color:

  • What are the two degrees of freedom for this test?
  • What value does the F-statistic take?
  • If you were testing our null hypothesis at \(\alpha = 0.05\), what decision would you make?

Question 13: Perform an ANOVA analyzing the relationship between flight distance and design:

  • What are the two degrees of freedom for this test?
  • What value does the F-statistic take?
  • If you were testing our null hypothesis at \(\alpha = 0.05\), what decision would you make?

Question 14: Based on Questions 11-13, which variable would you use in a model to try and determine the flight distance of a particular plane? Does this match your expectations from the beginning of the lab?

Post-Hoc Testing

As noted previously, ANOVA is specifically concerned with testing the null hypothesis of equality between means for multiple groups,

\[ H_0: \mu_1 = \mu_2 = \dots = \mu_k \] Should we perform an ANOVA and reject our null hypothesis, we only know that at least two of our group means are different. Post-hoc pairwise testing (Latin for “after this” or “after the fact”) can be done to determine which of our pairwise differences are likely responsible.

Consider again our dog dataset in which we wish to test for equality in average speed between different colored dogs. This is done simply with the aov() function

## Read in dogs
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")

## This will assign the results to a variable called model
model <- aov(speed ~ color, dogs)
summary(model)
##              Df Sum Sq Mean Sq F value Pr(>F)   
## color         3   1652     551     4.3 0.0053 **
## Residuals   396  50746     128                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, we see information on the squared error from the grouping and our residuals, along with an F-statistic and a p-value. If we were testing at the \(\alpha = 0.05\) level, we would reject this test as \(p-val = 0.0053\).

To determine which pairwise colors had a difference, we can use the TukeyHSD() function (Tukey honest statistical difference) on the model object we created above:

## Pass in output from aov() function
comp <- TukeyHSD(model)
comp
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = speed ~ color, data = dogs)
## 
## $color
##                 diff       lwr      upr   p adj
## brown-black   1.9612  -2.65664  6.57906 0.69237
## white-black   4.2360  -0.38182  8.85388 0.08529
## yellow-black -2.3968  -5.97373  1.18021 0.31012
## white-brown   2.2748  -3.56635  8.11599 0.74672
## yellow-brown -4.3580  -9.41657  0.70063 0.11889
## yellow-white -6.6328 -11.69139 -1.57418 0.00437

There are a few things to note here:

  1. First, we see that it gives us a point estimate of the difference in means as the first column in the output
  2. Next, we get a confidence interval for the difference for lower and upper bounds. By default, this is a 95% confidence interval, but we can change this in the TukeyHSD() function by passing in an argument for conf.level
  3. Finally, we see that the last column gives us an adjusted p-value. That is, rather than adjusting \(\alpha^* = \alpha/3\) and comparing the original p-values, it adjust the p-values that we can compare with our regular \(\alpha\). In either case, the conclusions that we should come to will be the same.

From this output, we see the only statistically significant difference in between yellow and white.

Finally, we can plot the output from the TukeyHSD() function with a call to the base R function plot()

## Pass in output from TukeyHSD() function
plot(comp)

Note here again, the only confidence interval that does not contain 0 (our null hypothesis for pairwise tests) is that between yellow and white, consistent with the output we observed above.

Question 15: Consider the ANOVA models you created in Questions 11-13. Of the ones in which there was evidence to reject the null hypothesis, perform a post-hoc test to determine between which individual groups there was a statistically significant difference (i.e., between Green and Blue paper, if there was evidence to identify groups based on Color)