This assignment will involve reading in a csv file and performing an analysis. In this first section, we will cover some of the previous lectures, as well as some of the tools we might need. We will use random data for this example
enzyme <- data.frame(A = rnorm(12, 3, 2),
B = rnorm(12, 5, 2))
boxplot(enzyme)
head(enzyme)
## A B
## 1 1.41252 6.5011
## 2 0.32301 3.6794
## 3 4.17303 3.9509
## 4 4.25864 5.8720
## 5 6.88642 4.5636
## 6 4.14263 7.7066
By looking at the first 6 rows, we see that this dataset has two columns, labeled A and B.
In R syntax, we subset a dataset with data[row, column]
. For example, to grab a particular column, we can either type the column names in quotes, or input a number indicating which column we want. We can grab A and B individually subsetting as such
A <- enzyme$A
B <- enzyme$B
## Or this
A <- enzyme[, "A"]
B <- enzyme[, "B"]
A
## [1] 1.41252 0.32301 4.17303 4.25864 6.88642 4.14263 3.53642 6.72361 2.49473
## [10] 4.74439 6.20118 3.38441
B
## [1] 6.50112 3.67938 3.95086 5.87199 4.56364 7.70659 -0.10405 8.62977
## [9] 2.09476 1.96198 2.56961 6.57936
## We could also do this, but I don't recommend it
# A <- enzyme[, 1]
# B <- enzyme[, 2]
This is helpful if we want to compute a statistic for a specific column. For example, the mean and standard deviation of column A can be found
A <- enzyme[, "A"]
mean(A)
## [1] 4.0234
sd(A)
## [1] 2.0076
Having grabbed each column separately, we can use these values to perform a t-test
## Assuming equal variance is two sample t-test
t.test(A, B, var.equal = TRUE)
##
## Two Sample t-test
##
## data: A and B
## t = -0.5, df = 22, p-value = 0.62
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4545 1.5005
## sample estimates:
## mean of x mean of y
## 4.0234 4.5004
## Assuming var not equal is Welch's t-test
t.test(A, B, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: A and B
## t = -0.5, df = 20.6, p-value = 0.62
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4623 1.5083
## sample estimates:
## mean of x mean of y
## 4.0234 4.5004
If the data were paired, we would pass an additional argument to indicate this
t.test(A, B, paired = TRUE)
##
## Paired t-test
##
## data: A and B
## t = -0.539, df = 11, p-value = 0.6
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4247 1.4706
## sample estimates:
## mean of the differences
## -0.477
We saw in a previous lecture that when the value of \(\sigma^2\) was known, our confidence intervals could be constructed as
\[ \overline{X} \pm z_{\alpha/2} \times \sigma/\sqrt{n} \] In the situation when \(\sigma^2\) is unknown, we replace \(\sigma\) with the sample standard deviation \(s\)
\[
\overline{X} \pm z_{\alpha/2} \times s/\sqrt{n}
\] What we didn’t cover was how to compute the value for \(z_{\alpha/2}\). Assuming that our statistic is normally distributed and that \(\alpha = 0.05\), we can find this value with the qnorm
function (q for quantile, norm for normal)
alpha <- 0.05
z <- qnorm(1 - alpha/2)
z
## [1] 1.96
Note: \(1 - \alpha/2\) in this case is 0.975, indicating that we want the 97.5% quantile. Since the normal distribution is symmetric, \(-z\) represents the 2.5% quantile. These values are indicated in the plot below:
Using the formula above, we can compute a confidence interval for the mean value of group A
meanA <- mean(A)
sdA <- sd(A)
n <- length(A)
z <- qnorm(1 - alpha/2)
c(meanA - z * sdA / sqrt(n), meanA + z * sdA / sqrt(n))
## [1] 2.8875 5.1593
Since our variance was estimated here, we might also consider doing the same for a t-test. The critical value is gotten the same way with the function qt
(q for quantile, t for t-distribution). However, for the qt
function, we also need to include an argument for degrees of freedom, which is \(n-1\)
meanA <- mean(A)
sdA <- sd(A)
n <- length(A)
t <- qt(1 - alpha/2, df = n - 1)
c(meanA - t * sdA / sqrt(n), meanA + t * sdA / sqrt(n))
## [1] 2.7479 5.2990
Taking the ideas above together, we can perform our own paired t-test by looking at the difference directly
## difference in two groups
diff <- A - B
## Mean, standard error, and sample size
mean_d <- mean(diff)
sd_d <- sd(diff)
n <- length(diff)
## Get appropriate t-statistic
t <- qt(1 - 0.05/2, df = n - 1)
## Confidence interval for mean difference
c(mean_d - t * sd_d / sqrt(n), mean_d + t * sd_d / sqrt(n))
## [1] -2.4247 1.4706
If our null hypothesis was that the mean difference was \(0\), we would fail to reject, as this value was included in our final confidence interval.
For this assignment, we will be walking through a similar study. You may use the code above as reference, though you are free to use the t.test
function in R as well. The course website also includes an .Rmd
file for your answer. Please use that document with the Knit option to make your homework submission a pdf.
An investigator wishes to compare enzyme levels in two subpopulations of D. melanogaster, A and B, in a standardized experiment. She independently selects a random sample of 50 individuals from each of the two subpopulations and measures the enzyme level of each individual. The investigator plans to use a parametric approach to analyzing the data. Assume that the distributional assumptions associated with the parametric approach are indeed valid for the purposes of this exercise. The data for this problem can be found on the course website, enzyme.csv
. To read the data into R, you can use the following code
## Make sure enzyme.csv in the same directory as your Rmd file
enzyme <- read.csv("enzyme.csv")
A <- enzyme$A
B <- enzyme$B
What is the name of the parametric statistical procedure the investigator should be using to test the null hypothesis that the mean difference in outcome is zero?
What is the value of the test statistic and it’s p-value?
How many degrees of freedom are associated with this test?
What is the mean difference in outcome? What is its standard error? For this, create a vector diff <- A-B
and use the mean
and sd
function to find the relevant statistics of this vector.
What is a 95% confidence interval for the true mean difference in enzyme levels?
What assumptions went into this procedure?
Find the variance of group A and group B. What are they? Should our t test assume equal variances?
What conclusions would you make for this study?