STA-290 HW 8

library(ggplot2)
library(dplyr)

# Prettier graphs
theme_set(theme_bw())

Question 1

Reconsider the anorexia data that we investigated in Homework 7:

anorexia <- read.csv("https://collinn.github.io/data/anorexia.txt")

Part A: Use the mutate function to again create a variable called Diff that records the difference in pre and post weights
Part B: State the null hypothesis for testing the difference and pre and post weights for each of the groups considered in the dataset
Part C: Perform an ANOVA for the hypothesis stated in Part B. What do you conclude?
Part D: Use post-hoc testing to determine if there are any pairwise differences between these groups. How do your findings here compare with the conclusions you had in Homework 7?

Question 2

This question will again consider the mtcars dataset built into R

data(mtcars)

We will be investigating the relationship between the weight of a car (independent variable) and its miles per gallon (dependent variable). In addition to this, we will also be using the number of carburetors as a second independent variable.

Part A: Create a linear model predicting mpg with the covariates wt and carb. Based on the results, does it appear that the number of carburetors has a relationship with fuel economy (mpg)?
Part B: By default, carb is stored in the dataset as an integer value. Use the mutate function to create a new variable in the mtcars dataset called carb_factor that is equal to carb_factor = fator(carb). This will turn the new variable into a categorical one instead of an integer
Part C: Create a new linear model, this time predicting mpg with wt and carb_factor. What has changed this time? Specifically, what do the covariates in the new model represent, and how is this different from what we saw in Part A? (Hint: how do the estimates for factor_carb change as the number of carburetors increases?)
Part D: Based on your assessment in Part C, which of these two models do you think is more appropriate for predicting miles per gallon? In other words, does the number of carburetors appear to make more sense as a continuous variable or a categorical one?

Question 3

Included below are data from 70 Hollywood films released between 2007 and 2001. Movies in this dataset include Action and Comedy films from two major studios, Fox and Paramount. The plot below illustrates the total sales over a film’s opening weekend, with a color aesthetic to indicate the number of theaters in which the film was shown: dark red corresponds to a film showing in a large number of theaters, while dark blue indicates that it was shown in relatively few theaters.

movies <- read.csv("https://collinn.github.io/data/hollywood.csv")

movies <- subset(movies, LeadStudio %in% c("Paramount", "Fox") & 
                               Genre %in% c("Action", "Comedy"))

ggplot(movies, aes(Genre, OpeningWeekend, color = TheatersOpenWeek)) + 
  geom_jitter(width = 0.1, size = 3) + 
  facet_wrap(~LeadStudio) + 
  scale_color_continuous(type = "viridis", option = "H") +
  geom_smooth(method = lm, se = FALSE) + 
  scale_y_continuous(breaks = seq(0, 140, by = 10))

Below is summary information for a linear regression model with revenue from the opening weekend (OpeningWeekend) serving as the dependent variable and with film studio (LeadStudio) and genre (Genre) serving as the independent variables.

> lm(OpeningWeekend ~ Genre + LeadStudio, movies) %>% summary()

Coefficients:
                    Estimate Std. Error t value      Pr(>|t|)    
(Intercept)            36.15       4.81    7.52 0.00000000018 ***
GenreComedy           -25.69       5.78   -4.44 0.00003427689 ***
LeadStudioParamount    14.69       5.94    2.47         0.016 *  

Residual standard error: 24.2 on 67 degrees of freedom
Multiple R-squared:  0.288, Adjusted R-squared:  0.266 
F-statistic: 13.5 on 2 and 67 DF,  p-value: 0.0000116

You will use these plots and summary data to answer the following questions:

Part A: Provide an interpretation of the intercept of this model.

Part B: Again using the summary information, find the predicted opening weekend revenue for each genre/studio combination (i.e., predicted opening revenue for a Comedy film from Fox)

Part C: We are now interested in determining if the variable for the number of theaters showing a film on opening weekend (TheatersOpenWeek) should be included in our model. We will do this by plotting the residuals of the model above against the missing variables. Determine which of the plots below shows the correct association between the model residuals and the number of theaters on opening weekend. Include 1-2 sentences to justify your answer.

fit <- lm(OpeningWeekend ~ Genre + LeadStudio, movies)
movies$Residuals <- fit$residuals
p1 <- ggplot(movies, aes(TheatersOpenWeek, Residuals)) + 
  geom_point(size = 2) + xlab("Theaters on Opening Weekend") +
  ggtitle("Plot A")
p2 <- ggplot(movies, aes(TheatersOpenWeek, -Residuals)) + 
  geom_point(size = 2) + xlab("Theaters on Opening Weekend") +
  ylab("Residuals") + ggtitle("Plot B")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Part D: Below is the updated model for predicting revenue on opening weekend, now including the variable for the number of theaters:

> lm(OpeningWeekend ~ Genre + LeadStudio + TheatersOpenWeek, movies) %>% summary()


Coefficients:
                     Estimate Std. Error t value    Pr(>|t|)    
(Intercept)         -21.48761    9.95081   -2.16      0.0345 *  
GenreComedy         -13.69726    4.99129   -2.74      0.0078 ** 
LeadStudioParamount   8.74991    4.82823    1.81      0.0745 .  
TheatersOpenWeek      0.01804    0.00287    6.28 0.000000031 ***

Residual standard error: 19.3 on 66 degrees of freedom
Multiple R-squared:  0.554, Adjusted R-squared:  0.534 
F-statistic: 27.3 on 3 and 66 DF,  p-value: 0.0000000000133

Consider the two linear models from this problem, both with and without the variable TheatersOpenWeek. Based on the summary information in the output, which would you prefer to use to predict revenue on opening weekend? Briefly justify your answer.