library(ggplot2)
library(dplyr)

# Prettier graphs
theme_set(theme_bw())

Question 1

Diarrhea is a major public health concern in many underdeveloped countries, in particular for babies, of whom millions die each year from dehydration. The following data comes from a controlled double-blind study of the use of bismuth salicylate (the active ingredient in Pepto Bismol) as therapy for Peruvian infants with diarrhea, with 85 babies receiving bismuth salicylate and 84 receiving placebo. To control for body size, the outcome variable is the the ratio of the volume of stool output per kilogram of body weight (ml/kg)

diarrhea <- read.csv("https://github.com/IowaBiostat/data-sets/raw/main/diarrhea/diarrhea.txt", sep = "\t")
  1. Using ggplot, create a box plot demonstrating the distribution of outcomes for each of our two groups.

  2. Conduct a t-test against the null hypothesis that there is no difference in outcome between treatment and placebo groups.

  3. Determine a 95% confidence interval for the true difference in output between babies in the control and treatment groups. Based on this, what conclusions would you draw regarding the use of bismuth salicylate as treatment for infant diarrhea. Explain.

Question 2

The following data include the results of two interventions and a control for young female anorexia patients. Include in this data are pre and post weights for 29 individuals in Cognitive Behavioral Therapy ("CBT"), Family Treatment ("FT"), and Control ("Cont"). Although these data are paired, rather than considering the efficacy within each group, we will be interested in assessing the difference in differences between them.

anorexia <- read.csv("https://collinn.github.io/data/anorexia.txt")
  1. mutate the data set to include a new variable called Diff that is the difference between the post weight and pre weight observations.
  2. For each of the 3 pairwise group of studies (i.e., “CBT and Control” or “CBT and FT”, and “FT and Control”):
  1. Use filter to create a subset of the original data, excluding the study type that is not in the pair (i.e., for “CBT and Control”, you will exclude “FT”).
  2. Perform a two-sample t-test, looking at the Diff value you created in (1) and comparing it between Treatment types at the \(\alpha = 0.05\) level
  3. Record whether you would Reject or Fail to Reject the null hypothesis and the associated p-value.
  1. In the course of our study, we conducted three separate hypotheses, but decided each of them individually at level \(\alpha = 0.05\). Conduct the necessary adjustment to control the Family-Wise Error Rate at \(\alpha = 0.05\). How does this impact the conclusions you made in (2)?

Question 3

In professional basketball games during the 2009-2010 season, when Kobe Bryant of the Los Angeles Lakers shot a pair of free throws, 8 times he missed both, 152 times he made both, 33 times he made only the first shot, and 37 times he made only the second. Is it possible that the successive free throws are independent, or is there evidence to suggest a “hot streak” effect? The data are tabulated in the freethrow data frame below:

# Create freethrow data (copy and paste this into your own R session)
freethrow <- matrix(c(152,33,37,8), nrow = 2, byrow = TRUE)
rownames(freethrow) <- c("Make 1st", "Miss 1st")
colnames(freethrow) <- c("Make 2nd", "Miss 2nd")
print(freethrow)
##          Make 2nd Miss 2nd
## Make 1st      152       33
## Miss 1st       37        8
  1. What is the null hypothesis of this experiment?
  2. Conduct a \(\chi^2\) test of independence at level \(\alpha = 0.05\). What conclusions would you draw from this data?
  3. Test your hypothesis again, this time investigating a difference in proportions. Do you come to the same conclusion?
  4. Do you believe an independence of two categorical variables necessarily imply that there is no difference in proportion? Explain your answer.