“Summarize summarizes mutate mutates at the group level”
While in a sense these are each doing very similar tasks, they differ in critical ways:
For this example, we will look at a simple toy dataset consisting of four subjects, each with values in two groups.
library(dplyr)
dat <- data.frame(subject = rep(LETTERS[1:4], each = 2),
class = rep(c("X", "Y"), times = 4),
val = c(1,3,3,5,5,7,7,9))
dat
## subject class val
## 1 A X 1
## 2 A Y 3
## 3 B X 3
## 4 B Y 5
## 5 C X 5
## 6 C Y 7
## 7 D X 7
## 8 D Y 9
Without groups, both mutate
and summarize
will find the mean of the entire val
column. Mutate adds
this as an additional column, while summarize returns a single value
(both are data frames, however)
## Mean of all added as column
dat %>% mutate(totalMean = mean(val))
## subject class val totalMean
## 1 A X 1 5
## 2 A Y 3 5
## 3 B X 3 5
## 4 B Y 5 5
## 5 C X 5 5
## 6 C Y 7 5
## 7 D X 7 5
## 8 D Y 9 5
## Mean of all added as column
dat %>% summarize(totalMean = mean(val))
## totalMean
## 1 5
Here, mutate computes the mean based on the grouping, in this case
the subject. Looking at the resulting data frame, we see for example
that A has a mean value of 2 in both of its rows, B a mean value of 4,
etc. In some sense, this is redundant, as there is only one subject A
and one mean value from val
. We remove this redundancy with
summarize, which simply returns a column indicating which subject we
have summarized, along with their mean value.
## By subject
dat %>% group_by(subject) %>%
mutate(subjectMean = mean(val))
## # A tibble: 8 × 4
## # Groups: subject [4]
## subject class val subjectMean
## <chr> <chr> <dbl> <dbl>
## 1 A X 1 2
## 2 A Y 3 2
## 3 B X 3 4
## 4 B Y 5 4
## 5 C X 5 6
## 6 C Y 7 6
## 7 D X 7 8
## 8 D Y 9 8
## This keeps subject and mean val
dat %>% group_by(subject) %>%
summarize(subjectMean = mean(val))
## # A tibble: 4 × 2
## subject subjectMean
## <chr> <dbl>
## 1 A 2
## 2 B 4
## 3 C 6
## 4 D 8
Here, we see the same thing as above, except now we have grouped by class. As there are only two different classes in this data frame, there are only going to be two unique values for the mean. As before, mutate adds a new entry for each row, while summarize returns a data frame with only two rows, one for each class.
## By class
dat %>% group_by(class) %>%
mutate(classMean = mean(val))
## # A tibble: 8 × 4
## # Groups: class [2]
## subject class val classMean
## <chr> <chr> <dbl> <dbl>
## 1 A X 1 4
## 2 A Y 3 6
## 3 B X 3 4
## 4 B Y 5 6
## 5 C X 5 4
## 6 C Y 7 6
## 7 D X 7 4
## 8 D Y 9 6
## By class
dat %>% group_by(class) %>%
summarize(classMean = mean(val))
## # A tibble: 2 × 2
## class classMean
## <chr> <dbl>
## 1 X 4
## 2 Y 6
Here, the group of class/subject uniquely defines each of the rows –
that is, each row will have it’s own associated mean. Though mutate and
summarize now include the same number of rows, note that summarize has
dropped the val
column, retaining only information on the
groups and their associated means .
## By both (in this case, uniquely identifies row)
dat %>% group_by(subject, class) %>%
mutate(subClassMean = mean(val))
## # A tibble: 8 × 4
## # Groups: subject, class [8]
## subject class val subClassMean
## <chr> <chr> <dbl> <dbl>
## 1 A X 1 1
## 2 A Y 3 3
## 3 B X 3 3
## 4 B Y 5 5
## 5 C X 5 5
## 6 C Y 7 7
## 7 D X 7 7
## 8 D Y 9 9
## By both (in this case, uniquely identifies row)
dat %>% group_by(subject, class) %>%
summarize(subClassMean = mean(val))
## # A tibble: 8 × 3
## # Groups: subject [4]
## subject class subClassMean
## <chr> <chr> <dbl>
## 1 A X 1
## 2 A Y 3
## 3 B X 3
## 4 B Y 5
## 5 C X 5
## 6 C Y 7
## 7 D X 7
## 8 D Y 9