Overview of ggplot2

The ggplot2 package is intended to follow “grammar of graphics” made of independent components that can be added together to build plot. Every ggplot2 plot has three components:

Data – the information we wish to visually represent
Aesthetics – A map between the data and visual properties, e.g., xy axes, color, shape, etc.
Layers – How each of the observations in the data is rendered, e.g., scatterplot, box plot, etc.,

This lab will be a first exploration into what these components are and how they relate to the construction of publication-ready graphics.

Data

For this lab, we will be using the mpg dataset that is included in the ggplot2 package. This dataset contains a subset of fuel economy data that the EPA collected for various car models in 1999 and 2008. We begin with a quick visual inspection of the variables in our dataset:

library(ggplot2) # load package first

## head() will examine the first few rows of a data.frame
head(mpg)

## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

A description of all of the variables can be read with ?mpg.

Aesthetics

Consider the scatterplot below constructed from the mpg dataset:

Here, each observation is represented by a point, the vertical and horizontal positions being determined by the value of two variables (displ and hwy), and a color determined by a third (drv). Conceivably, the size and shapes of the points could be associated with variables in the data.frame as well. These attributes to a graph are called aesthetics and represent a mapping from the variables in the data to visual properties in a graph. This will be an important point to keep in the back of our minds as our aesthetics becoming increasingly more detailed.

Every plot we construct will begin with a call to ggplot(), with the first argument always given as the data.frame. Without defining any aesthetics, though, all we are left with is a space that is reserved for our eventual graphic.

## An empty ggplot with no aesthetics
ggplot(mpg)

Let’s start by establishing the X and Y positions for our graph based on their respective variables. We add aesthetics with the aes() function:

## Every ggplot begins with ggplot()
ggplot(mpg, aes(x = displ, y = hwy))

The resulting plot is a simple XY plane with displ set as the x-axis and hwy set as the y axes. Further, note that the range of the x and y axes are already determined based on the ranges of values for the variables in mpg. This is what we mean when we say that the aesthetics represent a map from the data itself to the construction of our plot.

We also see here that while the aes() function creates a map from the variables to graphical aesthetics, it doesn’t actually add any of our observed data. This is done with layers, the topic of the next section.

Layers

We can think of layers as being the particular details of how we want our data to be portrayed on the ggplot. These can range from simple plots of the xy coordinates as in a scatter plot to more detailed summaries of the data. Notably, a ggplot can have an arbitrary number of layers added to it, which we “add” to a plot with +. For example, let’s add a layer to the previous graph that allows us to see our observed data as points. We do this with geom_point().

When adding layers to a ggplot, it is best practice to end a line with + and only include one additional layer on each subsequent line

## Best practice is to end each line with a `+` and including only one layer per line
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

Compared with the plot above, note that we still have the same xy-coordinate system generated with ggplot(mpg, aes(displ, hwy)), but we have now “added” a layer of scatter points that sit on top of the initial plot. We can continue like this constructively, adding as many layers as we wish. For example, we can add an additional layer showing a fitted line that sits “on top” of the points

## Adding an additional layer
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

As we continue working with ggplot, this will be a helpful way to think about the process.

Question 1: What happens when you change the order of the layers for a given plot? Try with the plot above and comment on the difference.

Question 2: Consider the plot below, and identify the data, the aesthetics, and the layers that are used to build the plot

## Penguin dataset
pengy <- read.csv("https://collinn.github.io/data/penguins.csv")

ggplot(pengy, aes(x = bill_length_mm, y = bill_depth_mm, 
                  size = body_mass_g, color = sex)) + 
  geom_point()

Most (but not all) layers that we might add in ggplot begin with geom_ to help indicate that they can be thought of as the actual, geometric elements (lines, points, etc.,) that you will see on the plot. In addition to geom_point(), there are a few others that are worth being aware of:

geom_point - scatter plots
geom_jitter – scatter plot with “jittered” coordinates
geom_bar - bar graph for categorical data
geom_smooth - smoother for data
geom_boxplot - box plot
geom_histogram - histograms
geom_violin - violin plot

Question 3: Use geom_boxplot() to create a box plot with the mpg data with the class variable on the x-axis and hwy on the y-axis.

Working with Groups

Plots are primarily generated to quickly convey interesting relationships that exist in our data. In the scatter plot we just generated, we are immediately able to see that as an engine’s displacement size increases, the highway fuel economy begins to de crease. One question we may have is whether or not this pattern is the same across multiple types of vehicles. Or, more generally, we are asking if this same trend persists across different groupings of our data. We can ask questions like this by modifying the aesthetics of our plots.

A standard XY plot has two dimensions (one horizontal, one vertical), but we can add a third dimension that incorporates color into our figure. Here, we set the color to group by drive train (four wheel drive, front wheel drive, rear wheel drive) using a color argument in aes() and setting it equal to the variable drv

ggplot(mpg, aes(displ, hwy, color = drv)) +
  geom_point()

Immediately we can see distinct clusters appear in our plot. In addition to adding colors to the points themselves, ggplot also generated a legend (which as we will see is generally associated with the aesthetics of a graph). We will address legends more towards the end of this lab

In addition to color, we can also set the shapes of the points. And so long as we use the same grouping variable for each, we can combine this aesthetic with color:

## Using shapes instead of color
ggplot(mpg, aes(displ, hwy, shape = drv)) +
  geom_point()

## We can have multiple aesthetics combined in legend
ggplot(mpg, aes(displ, hwy, color = drv, shape = drv)) +
  geom_point()

We can also combine aesthetics for different groups, adding a fourth dimension to our data. In addition to asking about the drive train when considering the relationship between displacement and fuel economy, we could also consider groups based on the vehicles’ class

ggplot(mpg, aes(displ, hwy, color = class, shape = drv)) +
  geom_point()

While this is technically correct, we run the risk of overloading our plot, making it difficult to interpret in any useful way. Instead of adding an additional aesthetic to our plot, we can instead use a technique known as faceting to split our plot across multiple groups. Here, we recreate the plot above except facet on the drv variable rather than plot its shape. We do this with the function facet_wrap() which takes its argument in the form ~variable (note the tilde)

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point() +
  facet_wrap(~drv)

Question 4: When faceting, should our variable be quantitative, categorical, or does it not matter?

Question 5: Use the penguin dataset below to create a ggplot that is faceted on a variable. Describe what the plot shows in a few sentences.

Variable Types and Aesthetics

As we saw above, there are some aesthetics that, when set, appear to create discrete groupings of the data. This behavior, however, is contingent on the aesthetic mapping being towards a variable that is categorical or discrete in nature. For example, suppose we are interested again in the relationship between engine displacement and highway fuel economy, except this time we want to group it by the number of cylinders a vehicle has. We might create something like this:

ggplot(mpg, aes(displ, hwy, color = cyl)) +
  geom_point()

What do you notice in the legend? Rather than creating a palette of discrete colors, we are left with a gradient of colors from dark to light blue. This is a consequence of the fact that, by default, the variable cyl is stored as a numerical variable rather than a categorical one. We can use the function factor() to turn a quantitative variable like cyl into a categorical one (we will discuss factor() in more detail later, but as always, you can investigate how this function behaves with ?factor)

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
  geom_point()

This is just one example of how the rendering of an aesthetic can change based on the type of variable it is mapped to.

Question 6: Consider the two plots including cyl above. With a partner, come up with at least one reason why each graph might be useful. That is, try to think of a situation when might it make sense to treat carb as either a continuous or categorical variable.

Question 7: Create a box plot with class as the x axis and hwy as the y. In order to compare these vehicles from when data was first collected to when it was last collected, specify the color aesthetic as year. Is this the plot that you would expect? Make whatever modifications necessary to produce the intended graph.

More on Aesthetics

So far, we have seen how ggplot() can be used to define a set of aesthetics which are subsequently used by layers to render the appropriate data summaries. For example, we see in the following plot that both geom_point() get cues on where to draw X and Y coordinates based on the associated aesthetics:

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point()

This is an example of layers inheriting aesthetics from the main ggplot() call. This is not strictly necessary, however; through the local specification of an aesthetic, we can create aesthetics that are defined within the layer. For example, here we define the X and Y aesthetics within geom_point()

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy))

While the resulting plots may appear identical, their construction is different in a very critical way. This becomes apparent once you attempt to add additional layers to the plot.

Question 8: Try running the code below:

ggplot(mpg) +
  geom_point(aes(displ, hwy)) +
  geom_smooth()

Why doesn’t this work? Read the error message carefully. How would you alter this code to get the desired result?

Question 9: Look at and run the code below. If the resulting plots are identical, what difference does it make where the color aesthetic is described?

# How are these different?  Even though they look the same
ggplot(mpg, aes(displ, hwy)) + 
  geom_point(aes(color = drv))

ggplot(mpg, aes(displ, hwy, color = drv)) + 
  geom_point()

Based on your answer above, detail what happens when the code below is run and explain what is happening in each of these plots and why they look the way that they do. Do they all seem equally useful?

## Plot 1
ggplot(mpg, aes(displ, hwy)) + 
  geom_point(aes(color = drv)) + 
  geom_smooth()

## Plot 2
ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(aes(color = drv))

## Plot 3
ggplot(mpg, aes(displ, hwy, color = drv)) + 
  geom_point() +
  geom_smooth()

To wrap up our discussion of aesthetics for now, we briefly consider the impact that constants that are not associated with variables in our data. Whereas aes() can be used to map variables of the data itself onto our plots, there are often times where we wish to make specifications of our plot independently of the data. In general, these constants are specified within a layer, but outside of the aes() function. Here, consider the variables color and alpha, which determines the transparency of a geom:

ggplot(mpg, aes(displ, hwy)) + 
  geom_point(alpha = 0.25, color = "magenta")

Remember: because these constant aesthetics do not make any reference to the data, there is no need to provide a legend relating the visual aesthetic to anything the viewer might need. There is also no need to include them inside aes().

Question 10: Create two separate plots using the code below. With your partner, provide answers to the following questions:

Why does geom_point(color = "blue") create dots that are blue?
What is ggplot trying to do when we instead call geom_point(aes(color = "blue")). Why has it created a legend?
Considering the last two questions, what role does aes() play in the construction of these plots?

## Plot 1
ggplot(mpg, aes(displ, hwy)) + 
  geom_point(color = "blue")

## Plot 2
ggplot(mpg, aes(displ, hwy)) + 
  geom_point(aes(color = "blue"))

TBD

Scales, themes, and colors (oh my!)

Lab #2 - Plotting with ggplot2