Mutate and Summarize

“Summarize summarizes mutate mutates at the group level”

While in a sense these are each doing very similar tasks, they differ in critical ways:

Mutate will retain all of the columns of the original dataset, adding a single column to represent the summary
Summarize will drop all columns in the dataset except those that are used to signify groups. It will add a column for the summary

For this example, we will look at a simple toy dataset consisting of four subjects, each with values in two groups.

library(dplyr)

dat <- data.frame(subject = rep(LETTERS[1:4], each = 2), 
                  class = rep(c("X", "Y"), times = 4), 
                  val = c(1,3,3,5,5,7,7,9))

dat

##   subject class val
## 1       A     X   1
## 2       A     Y   3
## 3       B     X   3
## 4       B     Y   5
## 5       C     X   5
## 6       C     Y   7
## 7       D     X   7
## 8       D     Y   9

Without groups

Without groups, both mutate and summarize will find the mean of the entire val column. Mutate adds this as an additional column, while summarize returns a single value (both are data frames, however)

## Mean of all added as column
dat %>% mutate(totalMean = mean(val))

##   subject class val totalMean
## 1       A     X   1         5
## 2       A     Y   3         5
## 3       B     X   3         5
## 4       B     Y   5         5
## 5       C     X   5         5
## 6       C     Y   7         5
## 7       D     X   7         5
## 8       D     Y   9         5

## Mean of all added as column
dat %>% summarize(totalMean = mean(val))

##   totalMean
## 1         5

Grouping by subject

Here, mutate computes the mean based on the grouping, in this case the subject. Looking at the resulting data frame, we see for example that A has a mean value of 2 in both of its rows, B a mean value of 4, etc. In some sense, this is redundant, as there is only one subject A and one mean value from val. We remove this redundancy with summarize, which simply returns a column indicating which subject we have summarized, along with their mean value.

## By subject
dat %>% group_by(subject) %>% 
  mutate(subjectMean = mean(val))

## # A tibble: 8 × 4
## # Groups:   subject [4]
##   subject class   val subjectMean
##   <chr>   <chr> <dbl>       <dbl>
## 1 A       X         1           2
## 2 A       Y         3           2
## 3 B       X         3           4
## 4 B       Y         5           4
## 5 C       X         5           6
## 6 C       Y         7           6
## 7 D       X         7           8
## 8 D       Y         9           8

## This keeps subject and mean val
dat %>% group_by(subject) %>% 
  summarize(subjectMean = mean(val))

## # A tibble: 4 × 2
##   subject subjectMean
##   <chr>         <dbl>
## 1 A                 2
## 2 B                 4
## 3 C                 6
## 4 D                 8

Grouping by class

Here, we see the same thing as above, except now we have grouped by class. As there are only two different classes in this data frame, there are only going to be two unique values for the mean. As before, mutate adds a new entry for each row, while summarize returns a data frame with only two rows, one for each class.

## By class
dat %>% group_by(class) %>% 
  mutate(classMean = mean(val))

## # A tibble: 8 × 4
## # Groups:   class [2]
##   subject class   val classMean
##   <chr>   <chr> <dbl>     <dbl>
## 1 A       X         1         4
## 2 A       Y         3         6
## 3 B       X         3         4
## 4 B       Y         5         6
## 5 C       X         5         4
## 6 C       Y         7         6
## 7 D       X         7         4
## 8 D       Y         9         6

## By class
dat %>% group_by(class) %>% 
  summarize(classMean = mean(val))

## # A tibble: 2 × 2
##   class classMean
##   <chr>     <dbl>
## 1 X             4
## 2 Y             6

Grouping by class and subject

Here, the group of class/subject uniquely defines each of the rows – that is, each row will have it’s own associated mean. Though mutate and summarize now include the same number of rows, note that summarize has dropped the val column, retaining only information on the groups and their associated means .

## By both (in this case, uniquely identifies row)
dat %>% group_by(subject, class) %>% 
  mutate(subClassMean = mean(val))

## # A tibble: 8 × 4
## # Groups:   subject, class [8]
##   subject class   val subClassMean
##   <chr>   <chr> <dbl>        <dbl>
## 1 A       X         1            1
## 2 A       Y         3            3
## 3 B       X         3            3
## 4 B       Y         5            5
## 5 C       X         5            5
## 6 C       Y         7            7
## 7 D       X         7            7
## 8 D       Y         9            9

## By both (in this case, uniquely identifies row)
dat %>% group_by(subject, class) %>% 
  summarize(subClassMean = mean(val))

## # A tibble: 8 × 3
## # Groups:   subject [4]
##   subject class subClassMean
##   <chr>   <chr>        <dbl>
## 1 A       X                1
## 2 A       Y                3
## 3 B       X                3
## 4 B       Y                5
## 5 C       X                5
## 6 C       Y                7
## 7 D       X                7
## 8 D       Y                9