Lab 6 – Practice with dplyr and ggplot2

library(ggplot2)
library(dplyr)

## Less uggo plots
theme_set(theme_bw())

Introduction

This lab serves primarily as a set of practice problems to hone our dplyr and ggplot skills

dplyr Practice

Question 1

This question uses the penguins dataset.

penguins <- read.csv("https://collinn.github.io/data/penguins.csv")

Using the appropriate dplyr functions, first filter the penguin dataset to only include observations in which the variable sex takes the values “male” and “female” (a handful of observations are simply missing entries). Then, create a new variable in this dataset called size that takes the value “Small” if the penguin’s body mass is less than the median and the value “Big” if the body mass is greater than or equal to the median
Create a boxplot illustrating the relationship between dill depth, size, and sex. Do small or large penguins tend to have deeper bills? Which groups appears to have the greatest amount of variability?
The code below will take the variable size and re-code it as a factor variable. The utility here comes in the levels argument: doing this before creating a plot allows you to modify the order in which the values in a plot

penguins <- mutate(penguins, size = factor(size, levels = c("Small", "Big")))

Recreate the plot and compare it to what you saw in (2.). What has changed? Note that this will work for all future categorical variables when creating ggplots.

Question 2

This question will use the titanic dataset.

data(Titanic)
titanic <- as.data.frame(Titanic)
titanic <- titanic[rep(1:nrow(titanic), times = titanic$Freq), ]
titanic$Freq <- NULL

Recreate the plot below by refactoring the levels as we did in Question 1. Pay special note to the order of values presented for Sex, Class, and Survived.

You can modify the the colors by adding scale_fill_brewer(palette = "Greens") to your ggplot. See ?scale_fill_brewer() to check out other palettes, if you would like

Question 3

This question uses the college dataset.

## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")

Here, we are interested in answering the question: “which region has the largest outlier relative to the region for median ACT score”

First, standardize the ACT_median variable by region and then create the appropriate boxplot showing the relationship between region and standardized median ACT score (Hint: how do you find standardized values of a variable?). Which region appears to have the greatest outlier?
Next, create a summary of the college dataset that shows the maximum and minimum median ACT values for each region. Which region has the largest median ACT? Which has the smallest?
Compare your answers from (2) with what you found in (1). Does having the largest/smallest ACT values necessarily result in having outliers? Explain what is happening.

Question 4

This question uses the tips dataset

tips <- read.csv("https://collinn.github.io/data/tips.csv")

Begin by using the appropriate dplyr functions to filter the dataset to only include bills that occurred on either Saturday or Sunday and in which the size of the party was greater than 1.
Create a new variable, tipPercent that finds what percentage of the total bill was offered as a tip
Construct a summary that includes the average tipping percentage of an individual by sex and smoker status. Along with this summary, also include the number of individuals making up each group (n())
Create the appropriate plot to illustrate the relationship between total bill (x-axis) and tip percent (y-axis). What type of relationship do you see? Would Pearson’s correlation or Spearman’s correlation be more appropriate to describe this?

Extending ggplot

This section will cover some common extensions to our basic ggplots. Briefly, this will include:

Creating side-by-side plots
Modifying the plot without aesthetics, as well as introducing a handful of new modifications
Introducing geom_jitter()
Combining data summaries into additional layers.

Each section will serve to introduce a new idea or technique, each relatively simple in isolation. Our ultimate goal, however, will be the integration of data summaries into our plots.

Side-by-side plots

There are often times where we will want to create an arrangement of plots that can be presented adjacent to one another. A convenient method for doing so utilizes the gridExtra package, which can be installed by copying and pasting the code below into your console. Once the package is installed, you only ever need to load the package with the library() function, just as we do for ggplot2 and dplyr.

## Copy this and run in the console once to install package
# then delete -- DO NOT PUT IN YOUR RMARKDOWN FILE
if (!require(gridExtra)) {
  install.packages("gridExtra")
}

## Once installed,  load package like you do with others
library(gridExtra)

Typically when we create a ggplot in an R code chunk it will immediately print out the associated graphic:

ggplot(mpg, aes(displ, cty, color = cyl)) + 
  geom_point()

However, just as with any other object in R (linear models, data.frames, etc.,), we can assign our ggplot to a variable name; doing so will stop the plot from printing out immediately

## Note that nothing prints after this code chunk
p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) + 
  geom_point()

If we do want to display the plot, it is as simple as writing out the variable name:

p1

In order to arrange our plots side-by-side, we need to start by saving each of them to a new variable. We can then use the grid.arrange() function included in the gridExtra package to specify how we would like them aligned. Here, we revisit a problem from a previous lab where we compare how the color aesthetic works when a quantitative variable is coded as a factor. Placing them side-by-side provides for easy comparison:

p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) + 
  geom_point() + 
  theme(legend.position = "bottom") # move legend to bottom

p2 <- ggplot(mpg, aes(displ, cty, color = factor(cyl))) + 
  geom_point() + 
  theme(legend.position = "bottom")

## Print them out in a single row
grid.arrange(p1, p2, nrow = 1)

You may find that your plots are a bit scrunched. You can remedy this by using fig.width = 8 (or any other number) as a chunk option

p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) + 
  geom_point() + 
  theme(legend.position = "bottom") # move legend to bottom

p2 <- ggplot(mpg, aes(displ, cty, color = factor(cyl))) + 
  geom_point() + 
  theme(legend.position = "bottom")

## This should look a bit cleaner
grid.arrange(p1, p2, nrow = 1)

Question 5

Using grid.arrange(), reproduce the two plots you created in Q1 to show side-by-side how using factor() and levels can allow you to rearrange your plots.

Modification without aesthetics

The logic of ggplot is such that we can map variables included into our datasets into visual aspects of the plots with the use of the aesthetics function aes(). However, it is also possible to modify plots directly by specifying values for an aesthetic outside of aes(). Consider the collection of options below:

ggplot(mpg, aes(displ, cty)) + 
  geom_point(color = 'blue', size = 6, shape = 17, alpha = 0.25)

We have

color, which we assign directly to be blue
size, which will change the size of the points represented
shape, specifying shape, a collection of which can be found here
alpha, taking values between 0 and 1, which determine the amount of transparency

There are two additional things that are important to note here:

We have defined this attributes inside of the geom_point() function, meaning that they will only apply to those points being created in that layer. If we were to add them to ggplot instead, they would apply to every single layer included in the plot (more on this soon)
Because they are not inside of aes(), they are interpreted directly, i.e., ‘blue’ maps to the color blue

This last point is crucial. If we were to add these modifications inside of aes(), it would look (unsuccessfully) for these variables in our data frame. When they are not found, they will try to create them as variables rather than as values. For example, see what happens to color when we put it in aes()

## Adds a legend for "blue" and makes it reddish
ggplot(mpg, aes(displ, cty)) + 
  geom_point(aes(color = 'blue'))

Keep this in mind as we move into the next section.

Jittering and multiple layers

One of my favorite plots is created with the geom_jitter() function which, among its many uses, can be used as an alternative to the boxplot for visualizing the relationship between a categorical and quantitative variable. By default, trying to create points aligning with a categorical variable will cause many of the points to overlaps as they are all aligned within their own category. The jitter function operates by applying a small amount of random variation to the x and y values, producing a nice effect:

## Using points
p1 <- ggplot(mpg, aes(drv, cty)) +
  geom_point()

## Using jitter
p2 <- ggplot(mpg, aes(drv, cty)) +
  geom_jitter()

grid.arrange(p1, p2, nrow = 1)

We can modify the amount of variation with height and width arguments to the jittering function. There is no right or wrong answer, so feel free to play around with it:

p1 <- ggplot(mpg, aes(drv, cty)) +
  geom_point()

p2 <- ggplot(mpg, aes(drv, cty)) +
  geom_jitter()

p3 <- ggplot(mpg, aes(drv, cty)) +
  geom_jitter(width = 0.15, color = 'dodgerblue')

grid.arrange(p1, p2, p3, nrow = 1)

The plots created by geom_jitter() offer a nice alternative to those created by geom_boxplot() as, instead of summarizing the data into median, Q1, Q3, and outliers, it allows us to see directly the distribution and density of our observations. Of course, there is nothing to stop us from combining these ideas to create more interesting plots: because each geom represents a layer, we can readily add multiple layers to a plot. This allows us to nicely visualize how the geom_jitter and geom_boxplot functions compare. Note, however, that the layers are placed in order, with subsequent ones being created “on top of” the former

## Remove outliers to make black dots go away
p1 <- ggplot(mpg, aes(drv, cty)) +
  geom_jitter(width = 0.15, alpha = 0.5, color = "dodgerblue") +
  geom_boxplot(fill = "pink", outliers = FALSE)

p2 <- ggplot(mpg, aes(drv, cty)) +
  geom_boxplot(fill = "pink", outliers = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.5, color = "dodgerblue")


grid.arrange(p1, p2, nrow = 1)

Notice how in this case we specified color, fill, and alpha all outside of the aes() function. As such, ggplot was able to interpret these values directly.

Adding data summaries

We will conclude our discussion of ggplot in this lab with the introduction of data summaries in our plots. To do so, let’s first consider an aspect of ggplot that we have, until now, taken for granted. Within our ggplot function, we specify both a data argument and an aes() argument which creates a map from the variables in our data to the aesthetics in the plot. Each time we have added a layer (geom_jitter(), geom_boxplot()), we have inherited both the data and the values of the aesthetics. However, we could also specify data and aes() in each layer instead. The code below creates two of the exact same plots:

(NOTE: When doing this, we have to specify data = mpg in the layer or else we will get an error)

## Values are inherited from ggplot
p1 <- ggplot(data = mpg, aes(drv, cty)) +
  geom_boxplot()

## Values *not* inherited and are defined in layer
p2 <- ggplot() +
  geom_boxplot(data = mpg, aes(drv, cty))

grid.arrange(p1, p2, nrow = 1)

This will only make a difference if we include an additional layer. As we did in the previous section, let’s try to add points associated with geom_jitter()

## Values are inherited from ggplot
p1 <- ggplot(data = mpg, aes(drv, cty)) +
  geom_boxplot() + 
  geom_jitter(width = 0.15, alpha = 0.5)

## Values *not* inherited and are defined in layer
p2 <- ggplot() +
  geom_boxplot(data = mpg, aes(drv, cty)) +
  geom_jitter(width = 0.15, alpha = 0.5)

grid.arrange(p1, p2, nrow = 1)

Now, our second plot has nothing: because there is no dataset or aesthetics to inherit from ggplot(), it does not produce any points. We can rectify this by also specifying the data and aesthetics in that layer as well, giving us the desired plot:

## Add data and aes to geom_jitter
ggplot() +
  geom_boxplot(data = mpg, aes(drv, cty)) +
  geom_jitter(data = mpg, aes(drv, cty), width = 0.15, alpha = 0.5)

Why would we care about this when it seems like so much extra work? Because it allows us to create layers that contain completely different datasets.

For example, consider the boxplots above. We know from earlier in the semester that distributions that are skewed have mean values that differ from the median. We can use this technique to illustrate that visually.

If we wanted to plot the mean values for cty for each value of drv, we would start by creating a data summary with dplyr:

drv_summary <- mpg %>% group_by(drv) %>% 
  summarize(meanCty = mean(cty))

print(drv_summary)

## # A tibble: 3 × 2
##   drv   meanCty
##   <chr>   <dbl>
## 1 4        14.3
## 2 f        20.0
## 3 r        14.1

The new dataset drv_summary now has a variable drv, indicating drive train, and the variable meanCty, giving the average city miles per gallon. If we wanted to include just this in a plot, we could do so as follows:

## shape = 7 will give me squares instead of circles
ggplot(drv_summary, aes(drv, meanCty)) + 
  geom_point(color = 'red', size = 4, shape = 7)

Just as easily, we could include this with the box plots we created above. This time, we add an additional layer and specify data and aes() in the new layer:

ggplot() +
  geom_boxplot(data = mpg, aes(drv, cty)) +
  geom_point(data = drv_summary, aes(drv, meanCty), 
             color = 'red', size = 4, shape = 7)

Now, it becomes immediately clear that for 4wd and fwd, the distribution is skewed right (the mean is greater than the median), while for rwd, the distribution is skewed left (the mean is less than the median).

The ability to add data summaries to our plots gives us a way to present standard visualizations to a reader that allow us to highlight particular attributes that are not typically represented

Extra note on colored points

To create a colored point with a black outline, we need to modify the shape so that we can use both a color (outline) and fill (inside) aesthetic. Below is an example

df <- data.frame(x = 1, y = 1)

ggplot(df, aes(x, y)) + 
  geom_point(shape = 21, color = 'black', fill = 'coral', size = 8)

This will be useful for the next questions.

Question 6

Recreate the following plot that shows the mean and median net tuition for each of the regions in the college dataset

college <- read.csv("https://collinn.github.io/data/college2019.csv")

Note that:

Jitter function has height = 0.15 and alpha = 0.25
The colors are aquaremarine for average and magenta for median
The size of the colored points is size = 3.5

Question 7

Recreate this from the ecological correlation lecture, where the individual points are the schools and the colored dots get their x and y values from the regions’ average admission rate and median debt, respectively