This assignment will involve reading in a csv file and performing an analysis. In this first section, we will cover some of the previous lectures, as well as some of the tools we might need. We will use random data for this example
enzyme <- data.frame(A = rnorm(12, 3, 2),
B = rnorm(12, 5, 2))
boxplot(enzyme)
head(enzyme)
## A B
## 1 3.5773 1.9581
## 2 2.1496 9.1969
## 3 4.2003 6.8108
## 4 1.3208 3.9997
## 5 1.8753 5.2256
## 6 7.0209 3.0849
By looking at the first 6 rows, we see that this dataset has two columns, labeled A and B.
In R syntax, we subset a dataset with data[row, column]
. For example, to grab a particular column, we can either type the column names in quotes, or input a number indicating which column we want. We can grab A and B individually subsetting as such
A <- enzyme[, "A"]
B <- enzyme[, "B"]
A
## [1] 3.5773 2.1496 4.2003 1.3208 1.8753 7.0209 6.4058 0.9171 3.5765 7.1262
## [11] 3.7714 2.8221
B
## [1] 1.9581 9.1969 6.8108 3.9997 5.2256 3.0849 4.6084 5.0210 6.1392 6.0966
## [11] 6.5437 3.5769
## We could also do this, but I don't recommend it
# A <- enzyme[, 1]
# B <- enzyme[, 2]
This is helpful if we want to compute a statistic for a specific column. For example, the mean and standard deviation of column A can be found
A <- enzyme[, "A"]
mean(A)
## [1] 3.7303
sd(A)
## [1] 2.1378
Though we do not need to do so in this homework, it’s helpful to know that we can subset rows in a similar way
## grab second row
enzyme[2, ]
## A B
## 2 2.1496 9.1969
Having grabbed each column separately, we can use these values to perform a t-test
## Assuming equal variance is two sample t-test
t.test(A, B, var.equal = TRUE)
##
## Two Sample t-test
##
## data: A and B
## t = -1.75, df = 22, p-value = 0.094
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.18743 0.27105
## sample estimates:
## mean of x mean of y
## 3.7303 5.1885
## Assuming var not equal is Welch's t-test
t.test(A, B, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: A and B
## t = -1.75, df = 21.8, p-value = 0.094
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.18834 0.27197
## sample estimates:
## mean of x mean of y
## 3.7303 5.1885
If the data were paired, we would pass an additional argument to indicate this
t.test(A, B, paired = TRUE)
##
## Paired t-test
##
## data: A and B
## t = -1.65, df = 11, p-value = 0.13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.40653 0.49015
## sample estimates:
## mean of the differences
## -1.4582
We saw in a previous lecture that when the value of \(\sigma^2\) was known, our confidence intervals could be constructed as
\[ \overline{X} \pm z_{\alpha/2} \times \sigma/\sqrt{n} \] In the situation when \(\sigma^2\) is unknown, we replace \(\sigma\) with the sample standard deviation \(s\)
\[
\overline{X} \pm z_{\alpha/2} \times s/\sqrt{n}
\] What we didn’t cover was how to compute the value for \(z_{\alpha/2}\). Assuming that our statistic is normally distributed and that \(\alpha = 0.05\), we can find this value with the qnorm
function (q for quantile, norm for normal)
alpha <- 0.05
z <- qnorm(1 - alpha/2)
z
## [1] 1.96
Note: \(1 - \alpha/2\) in this case is 0.975, indicating that we want the 97.5% quantile. Since the normal distribution is symmetric, \(-z\) represents the 2.5% quantile. These values are indicated in the plot below:
Using the formula above, we can compute a confidence interval for the mean value of group A
meanA <- mean(A)
sdA <- sd(A)
n <- length(A)
z <- qnorm(1 - alpha/2)
c(meanA - z * sdA / sqrt(n), meanA + z * sdA / sqrt(n))
## [1] 2.5207 4.9398
Since our variance was estimated here, we might also consider doing the same for a t-test. The critical value is gotten the same way with the function qt
(q for quantile, t for t-distribution). However, for the qt
function, we also need to include an argument for degrees of freedom, which is \(n-1\)
meanA <- mean(A)
sdA <- sd(A)
n <- length(A)
t <- qt(1 - alpha/2, df = n - 1)
c(meanA - t * sdA / sqrt(n), meanA + t * sdA / sqrt(n))
## [1] 2.3720 5.0886
Taking the ideas above together, we can perform our own paired t-test by looking at the difference directly
## difference in two groups
diff <- A - B
## Mean, standard error, and sample size
mean_d <- mean(diff)
sd_d <- sd(diff)
n <- length(diff)
## Get appropriate t-statistic
t <- qt(1 - 0.05/2, df = n - 1)
## Confidence interval for mean difference
c(mean_d - t * sd_d / sqrt(n), mean_d + t * sd_d / sqrt(n))
## [1] -3.40653 0.49015
If our null hypothesis was that the mean difference was \(0\), we would fail to reject, as this value was included in our final confidence interval.
For this assignment, we will be walking through a similar study. You may use the code above as reference, though you are free to use the t.test
function in R as well. The course website also includes an .Rmd
file for your answer. Please use that document with the Knit option to make your homework submission a pdf.
An investigator wishes to compare enzyme levels in two subpopulations of D. melanogaster, A and B, in a standardized experiment. He independently selects a random sample of 50 individuals from each of the two subpopulations and measures the enzyme level of each individual. The investigator plans to use a parametric approach to analyzing the data. Assume that the distributional assumptions associated with the parametric approach are indeed valid for the purposes of this exercise. The data for this problem can be found on the course website, enzyme.csv
.
What is the name of the parametric statistical procedure the investigator should be using to test the null hypothesis that the mean difference in outcome is zero?
What is the value of the test statistic and it’s p-value?
How many degrees of freedom are associated with this test?
What is the mean difference in outcome? What is its standard error?
What is a 95% confidence interval for the true mean difference in enzyme levels?
What assumptions went into this procedure?
Find the variance of group A and group B. What are they? Should our t test assume equal variances?
What conclusions would you make for this study?
Assume now that this study was done using matched pairs. Using the code above (that is, without using t.test
), create a difference vector between groups A and B and construct a 95% confidence interval. Be sure to use the correct degrees of freedom in finding the critical value from the qt
function.
Modify the code you wrote in problem 9 using the qnorm
function instead of qt
to find the critical values. That is, instead of assuming our data follows a \(t\) distribution, assume that it follows a standard normal. Again, compute the confidence interval. How does it compare to the confidence interval from problem 9? Is it larger or smaller? Why do you think that is? (Hint: When does a \(t\) distribution approximate a normal distribution, and what impact does that have on the size of our confidence intervals?)