Note: Simply add the questions for this lab to the end of the R Markdown document you have been using for Lab 2. We will have some time in class on Monday to wrap things up in case you do not finish.

Introduction

This lab will be a continuation of our exploration of ggplot2. Whereas the first lab was oriented around creating a number of standard plots from the data, here we will focus on a number of ancillary issues, including titles and labels, legends, and themes. The bulk of this lab will be focused on the topic of scales, which manage the relationship between the data and the resulting aesthetics. We will conclude by taking a closer look at some of the arguments that can be used to augment different layers.

Titles and Axes

By default, plots made with ggplot2 do not include a title, and the labels for the axes are taken from the variable names given in aes(). This is the case, for example, when we have our plot of engine displacement (displ) and highway miles (hwy):

library(ggplot2)

## Prettier graphs
theme_set(theme_bw())

ggplot(mpg, aes(displ, hwy)) + 
  geom_point()

We can add titles or change the x and y axis labels with the functions ggtitle, xlab, and ylab, respectively

library(ggplot2)

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  ggtitle("Engine size to fuel economy") + 
  xlab("Displacement") + 
  ylab("Fuel Economy (Highway)")

Note that just like in the first lab, we can add subsequent components with +. As another note, it is common to create a new line for each layer for readability.

As is typically the case with ggplots, there are multiple ways to accomplish the same goal. The labs() function allows us to modify multiple labels at once by specifying them with an argument

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  labs(x = "Displacement", y = "Fuel Economy (Highway)", title = "Engine size to fuel economy")

The labs() function also takes arguments for any grouping aesthetics. For example, if we use the shape aesthetic in aes() we can then pass a shape argument to labs() to rename the legend of the plot (note: the function factor() as in factor(cyl) turns a continuous variable into a categorical one – we will learn more about this later).

The argument name is the same as what is used for creating the groups, and changing these will make corresponding changes in the legend:

## Without label
ggplot(mpg, aes(displ, hwy, shape = factor(cyl))) + 
  geom_point() 

## With label
ggplot(mpg, aes(displ, hwy, shape = factor(cyl))) + 
  geom_point() +
  labs(shape = "Cylinders") # Since we used shape aesthetic, we use "shape" here

Question 12: Using the mpg dataset, create a box plot with class on the x-axis and cty on the y-axis. Add a color aesthetic that accounts for year (by default, year is a continuous variable. Use factor() to make it a categorical). Create appropriate labels for the axes, title, and legend.

Themes

As you might imagine, there are a tremendous variety of options to modify the style of your graphic. The collection of non-data related elements of your plot, including the appearance of titles, labels, legends, tick marks and lines all make up what is known as the theme. Elements related to the theme are modified with the theme() function; a quick look at ?theme demonstrates how comprehensive this list can be. Here, however, we consider only a small subset of these items to demonstrate how the process works. It is less important that any of these are memorized; rather, knowing that such possibilities exist should assist you when using search engines to learn how to modify your graphics.

The system for modifying themes consists of two components:

  1. The elements that are being modified (i.e., text, tick marks, legend)
  2. The element functions associated with each element that control the visual properties.

For example, elements consisting of text are modified with the element function element_text(). We can also see some of the particulars that can be modified with ?element_text. To motivate this, consider the following box plot:

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot()

Because of the width of our figure, all of the labels on the x-axis are bunched together. We can help fix this problem by rotating the axis text on the x-axis. That is, we are modifying the element axis.text.x (that is, text that is located on the x-axis) with the element function element_text

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45))

Here, we see that the rotation has helped with the overlapping, but now the text is running into our plot. We can further alter the V-ertical ad-JUST ment with vjust

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

It is highly unlikely (and completely unnecessary) that you would remember on your own that text on the x-axis is specified with axis.text.x. However, if you find yourself in a situation in which you have a general idea of what you want to change, it is likely that looking through the arguments of ?theme that you would find something matching what you want to do. This, along with diligent search engine use, makes for a potent strategy in solving most ggproblems.

Question 13: For this question, use the code in the block below. To the plot that is generated, modify the following:

  1. Modify the plot.title by changing its color to red and writing it in italics.
  2. Change the size of the legend text so that it is much bigger than the default

You will want to investigate ?theme to find the appropriate ways to do this.

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) + 
  geom_point() + 
  labs(title = "My plot")

Colors

Just as scales mediate the mapping between discrete and continuous variables to their respective axes, the relationship between variables and color aesthetics is no different.

Consider the last lab, for example, in which we plotted the relationship between displacement and highway miles colored by cylinder. When cyl was stored as a numeric (or integer) vector, the resulting color scale was continuous, taking all values between dark and light blue. However, once we included color as a factor, the color scale became discrete, offering four distinct colors to represent our groups:

This is an illustration of color being treated as either a continuous or discrete scale. And, analogous to the scales we used for our axes, this scales are modified with the functions scale_color_discrete() and scale_color_continuous().

Discrete color scales

The first thing to know about the scale_color_discrete() is that everybody actually uses scale_color_brewer() which comes with a full suite of pre-built palettes for use with discrete variables (see ?scale_color_brewer()). The great thing about this is that with minimal effort, we can feel confident that our colors are going to look good

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
  geom_point() + scale_color_brewer(palette = "Spectral")

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
  geom_point() + scale_color_brewer(palette = "Set2")

Although I don’t recommend it, you can also specify your own colors for different values passing a named vector to scale_color_manual

ggplot(mpg, aes(displ, hwy, color = factor(drv))) +
  geom_point() + 
  scale_color_manual(values = c(f = "steelblue", r = "tomato", `4` = "goldenrod1"))

One useful trick for making colors really pop is to change the shape aesthetic of points made by geom_point(). Examples of different shapes are given here, but the one we are interested in specifically is shape 21. Although this shape does have a color aesthetic, it works more like a bar plot – the color sets the outline, while fill sets the fill. This creates a nice contrast around the edge of the points

ggplot(mpg, aes(displ, hwy, fill = factor(drv))) +
  geom_point(shape = 21, color = "black", size = 2) + # Fill defined in aes() 
  scale_fill_manual(values = c(f = "steelblue", r = "tomato", `4` = "goldenrod1"))

Different question here

Question 14: This time we are going to use the built-in R dataset ChickWeight (?ChickWeight). Put Time on the x-axis and weight on the y-axis, and specify the color aesthetic with Diet. add a layer with geom_smooth. Finally, use a different color palette than the default, either a pre-built one with the brewer function or by selecting the colors manually. By the end of the study, which diet seemed to result in chicks with the greatest average weight?

Continuous color scales

There are primarily two types of continuous color scales we will concern ourselves with, and this will depend upon what we are trying to demonstrate. Generally speaking, there are two possible options:

  1. sequential scales - most useful in distinguishing high values from low values
  2. diverging scales - used to put equal emphasis on both high and low ends of the data range

Roughly corresponding to these two options are two types of color scales readily available for ggplot: gradient and viridis

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "gradient")

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis")

The gradient color type is pretty straight forward, though the colors are typically manually specified (which can be tricky to get to look nice). You can specify a high and low value, indicating the range of colors on which you wish to gradient. Choosing colors that are on the opposite ends of a color wheel will give you the best contrast.

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "gradient", high = "orange", low = "blue")

A list of colors provided in R are available here


Alternatively, the viridis scales constitute a set of different color maps that are designed with a few thoughts in mind:

  1. Colorful with a wide palette, making differences easy to see
  2. Perceptually uniform, so that values close together have similar colors
  3. Robust to colorblindness, meaning they also do well when printed in black and white

You can read more about viridis scales here.

A range of different viridis scales are provided in ggplot, though their description is not particularly well documented. You can select different scales by passing an additional argument option with options available for “A”-“H”. Here are a few for illustration:

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis", option = "A") + ggtitle("Magma")

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis", option = "D") + ggtitle("Viridis")

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis", option = "E") + ggtitle("Cividis")

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis", option = "H") + ggtitle("Turbo")

Question 15: For this question, we are going to use another dataset built into R, the USArrests (see ?USArrets). Create a scatter plot using this data with the urban population on the x-axis and the number of assaults per 100,000 residents on the y-axis. Then, choose two sensible colors and add a color gradient corresponding to the murder rate. Looking at this plot, does it seem that high rates of murder are more likely to correspond with larger urban population or with states with high rates of assault?

## Load data
data("USArrests")