ggplot2The ggplot2 package is intended to follow “grammar of
graphics” made of independent components that can be added together to
build plot. Every ggplot2 plot has three components:
This lab will be a first exploration into what these components are and how they relate to the construction of publication-ready graphics.
For this lab, we will be using the mpg dataset that is
included in the ggplot2 package. This dataset contains a
subset of fuel economy data that the EPA collected for various car
models in 1999 and 2008. We begin with a quick visual inspection of the
variables in our dataset:
library(ggplot2) # load package first
## head() will examine the first few rows of a data.frame
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
A description of all of the variables can be read with
?mpg.
Consider the scatterplot below constructed from the mpg
dataset:
Here, each observation is represented by a point, the
vertical and horizontal positions being determined by
the value of two variables (displ and hwy),
and a color determined by a third (drv). Conceivably, the
size and shapes of the points could be associated with variables in the
data.frame as well. These attributes to a graph are called
aesthetics and represent a mapping from the
variables in the data to visual properties in a graph. This will be an
important point to keep in the back of our minds as our aesthetics
becoming increasingly more detailed.
Every plot we construct will begin with a call to
ggplot(), with the first argument always given as the
data.frame. Without defining any aesthetics, though, all we are left
with is a space that is reserved for our eventual graphic.
## An empty ggplot with no aesthetics
ggplot(mpg)
Let’s start by establishing the X and Y positions for our graph based
on their respective variables. We add aesthetics with the
aes() function:
## Every ggplot begins with ggplot()
ggplot(mpg, aes(x = displ, y = hwy))
The resulting plot is a simple XY plane with displ set
as the x-axis and hwy set as the y axes. Further, note that
the range of the x and y axes are already determined based on the ranges
of values for the variables in mpg. This is what we mean
when we say that the aesthetics represent a map from the data itself to
the construction of our plot.
We also see here that while the aes() function creates a
map from the variables to graphical aesthetics, it doesn’t actually add
any of our observed data. This is done with layers, the topic
of the next section.
We can think of layers as being the particular details of how we want
our data to be portrayed on the ggplot. These can range from simple
plots of the xy coordinates as in a scatter plot to more detailed
summaries of the data. Notably, a ggplot can have an arbitrary number of
layers added to it, which we “add” to a plot with +. For
example, let’s add a layer to the previous graph that allows us to see
our observed data as points. We do this with
geom_point().
When adding layers to a ggplot, it is best practice to end a line
with + and only include one additional layer on each
subsequent line
## Best practice is to end each line with a `+` and including only one layer per line
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
Compared with the plot above, note that we still have the same
xy-coordinate system generated with
ggplot(mpg, aes(displ, hwy)), but we have now “added” a
layer of scatter points that sit on top of the initial plot. We can
continue like this constructively, adding as many layers as we wish. For
example, we can add an additional layer showing a fitted line that sits
“on top” of the points
## Adding an additional layer
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
As we continue working with ggplot, this will be a helpful way to think about the process.
Question 1: What happens when you change the order of the layers for a given plot? Try with the plot above and comment on the difference.
Question 2: Consider the plot below, and identify the data, the aesthetics, and the layers that are used to build the plot
## Penguin dataset
pengy <- read.csv("https://collinn.github.io/data/penguins.csv")
ggplot(pengy, aes(x = bill_length_mm, y = bill_depth_mm,
size = body_mass_g, color = sex)) +
geom_point()
Most (but not all) layers that we might add in ggplot begin with
geom_ to help indicate that they can be thought of as the
actual, geometric elements (lines, points, etc.,) that you will see on
the plot. In addition to geom_point(), there are a few
others that are worth being aware of:
geom_point - scatter plotsgeom_jitter – scatter plot with “jittered”
coordinatesgeom_bar - bar graph for categorical datageom_smooth - smoother for datageom_boxplot - box plotgeom_histogram - histogramsgeom_violin - violin plotQuestion 3: Use geom_boxplot() to
create a box plot with the mpg data with the
class variable on the x-axis and hwy on the
y-axis.
Plots are primarily generated to quickly convey interesting relationships that exist in our data. In the scatter plot we just generated, we are immediately able to see that as an engine’s displacement size increases, the highway fuel economy begins to de crease. One question we may have is whether or not this pattern is the same across multiple types of vehicles. Or, more generally, we are asking if this same trend persists across different groupings of our data. We can ask questions like this by modifying the aesthetics of our plots.
A standard XY plot has two dimensions (one horizontal, one vertical),
but we can add a third dimension that incorporates color into our
figure. Here, we set the color to group by drive train (four wheel
drive, front wheel drive, rear wheel drive) using a color argument in
aes() and setting it equal to the variable
drv
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point()
Immediately we can see distinct clusters appear in our plot. In addition to adding colors to the points themselves, ggplot also generated a legend (which as we will see is generally associated with the aesthetics of a graph). We will address legends more towards the end of this lab
In addition to color, we can also set the shapes of the points. And so long as we use the same grouping variable for each, we can combine this aesthetic with color:
## Using shapes instead of color
ggplot(mpg, aes(displ, hwy, shape = drv)) +
geom_point()
## We can have multiple aesthetics combined in legend
ggplot(mpg, aes(displ, hwy, color = drv, shape = drv)) +
geom_point()
We can also combine aesthetics for different groups, adding a fourth dimension to our data. In addition to asking about the drive train when considering the relationship between displacement and fuel economy, we could also consider groups based on the vehicles’ class
ggplot(mpg, aes(displ, hwy, color = class, shape = drv)) +
geom_point()
While this is technically correct, we run the risk of overloading our
plot, making it difficult to interpret in any useful way. Instead of
adding an additional aesthetic to our plot, we can instead use a
technique known as faceting to split our plot across multiple
groups. Here, we recreate the plot above except facet on the
drv variable rather than plot its shape. We do this with
the function facet_wrap() which takes its argument in the
form ~variable (note the tilde)
ggplot(mpg, aes(displ, hwy, color = class)) +
geom_point() +
facet_wrap(~drv)
Question 4: When faceting, should our variable be quantitative, categorical, or does it not matter?
Question 5: Use the penguin dataset below to create a ggplot that is faceted on a variable. Describe what the plot shows in a few sentences.
As we saw above, there are some aesthetics that, when set, appear to create discrete groupings of the data. This behavior, however, is contingent on the aesthetic mapping being towards a variable that is categorical or discrete in nature. For example, suppose we are interested again in the relationship between engine displacement and highway fuel economy, except this time we want to group it by the number of cylinders a vehicle has. We might create something like this:
ggplot(mpg, aes(displ, hwy, color = cyl)) +
geom_point()
What do you notice in the legend? Rather than creating a palette of
discrete colors, we are left with a gradient of colors from dark to
light blue. This is a consequence of the fact that, by default, the
variable cyl is stored as a numerical variable rather than
a categorical one. We can use the function factor() to turn
a quantitative variable like cyl into a categorical one (we
will discuss factor() in more detail later, but as always,
you can investigate how this function behaves with
?factor)
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point()
This is just one example of how the rendering of an aesthetic can change based on the type of variable it is mapped to.
Question 6: Consider the two plots including
cyl above. With a partner, come up with at least one reason
why each graph might be useful. That is, try to think of a situation
when might it make sense to treat carb as either a
continuous or categorical variable.
Question 7: Create a box plot with
class as the x axis and hwy as the y. In order to compare
these vehicles from when data was first collected to when it was last
collected, specify the color aesthetic as year. Is this the
plot that you would expect? Make whatever modifications necessary to
produce the intended graph.
So far, we have seen how ggplot() can be used to define
a set of aesthetics which are subsequently used by layers to render the
appropriate data summaries. For example, we see in the following plot
that both geom_point() get cues on where to draw X and Y
coordinates based on the associated aesthetics:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
This is an example of layers inheriting aesthetics from the
main ggplot() call. This is not strictly necessary,
however; through the local specification of an aesthetic, we
can create aesthetics that are defined within the layer. For example,
here we define the X and Y aesthetics within
geom_point()
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy))
While the resulting plots may appear identical, their construction is different in a very critical way. This becomes apparent once you attempt to add additional layers to the plot.
Question 8: Try running the code below:
ggplot(mpg) +
geom_point(aes(displ, hwy)) +
geom_smooth()
Why doesn’t this work? Read the error message carefully. How would you alter this code to get the desired result?
Question 9: Look at and run the code below. If the resulting plots are identical, what difference does it make where the color aesthetic is described?
# How are these different? Even though they look the same
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv))
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point()
Based on your answer above, detail what happens when the code below is run and explain what is happening in each of these plots and why they look the way that they do. Do they all seem equally useful?
## Plot 1
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
geom_smooth()
## Plot 2
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(aes(color = drv))
## Plot 3
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point() +
geom_smooth()
To wrap up our discussion of aesthetics for now, we briefly consider
the impact that constants that are not associated with
variables in our data. Whereas aes() can be used to map
variables of the data itself onto our plots, there are often times where
we wish to make specifications of our plot independently of the
data. In general, these constants are specified within a layer,
but outside of the aes() function. Here, consider
the variables color and alpha, which
determines the transparency of a geom:
ggplot(mpg, aes(displ, hwy)) +
geom_point(alpha = 0.25, color = "magenta")
Remember: because these constant aesthetics do not make any reference
to the data, there is no need to provide a legend relating the visual
aesthetic to anything the viewer might need. There is also no need to
include them inside aes().
Question 10: Create two separate plots using the code below. With your partner, provide answers to the following questions:
geom_point(color = "blue") create dots that
are blue?geom_point(aes(color = "blue")). Why has it created a
legend?aes() play in the construction of these plots?## Plot 1
ggplot(mpg, aes(displ, hwy)) +
geom_point(color = "blue")
## Plot 2
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = "blue"))
Scales, themes, and colors (oh my!)