ggplot2
Each time you open a new R Markdown file, you are greeted with a “setup” block that looks like this:
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Within this, we are able to pass additional arguments that modify how the rest of the document will render. A full list is included here. For this lab, helpful options to add will include centering the plots as well as changing their size to keep our resulting documents from getting too large. This can be done like this:
knitr::opts_chunk$set(echo = TRUE, fig.align = 'center', fig.height = 4, fig.width = 4)
A few additional ones that are often useful are included here:
echo = FALSE
This will not print out the code written
in the block, useful for setup (i.e., reading in data, defining
functions) that are not relevant for the final documentinclude = FALSE
This will not print output from a code
block. This includes results, plots, messages, etc.,warning = FALSE
This will hide warnings from printing
out, often helpfulmessage = FALSE
Just like above, this one will hide
messageseval = FALSE
This will stop R from running the code in
the block, but it will still print it. I use this, for example, when
creating code blocks I want you to run without having the document run
it itself, particularly in cases when the code would break/cause an
errorWhile options set in the setup block will affect all of the resulting
blocks, these can be overwritten in each individual chunk. This is done
simply enough by appending a comma after the initial r
,
followed by intended arguments:
```{r, fig.align = ‘center’, echo = FALSE }
```
The ggplot2
package is intended to follow “grammar of
graphics” made of independent components that can be added together to
build plot. Every ggplot2 plot has three components:
This lab will be a first exploration into what these components are and how they relate to the construction of publication-ready graphics.
For this lab, we will be using the mpg
dataset that is
included in the ggplot2
package. This dataset contains a
subset of fuel economy data that the EPA collected for various car
models in 1999 and 2008. We begin with a quick visual inspection of the
variables in our dataset:
library(ggplot2) # load package first
## In ggplot2 it is stored as a different object so we convert to data.frame first
mpg <- as.data.frame(mpg)
head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
A description of all of the variables can be read with
?mpg
.
The aesthetics of a ggplot2
plot (which I will hereby
simply call a “ggplot”) establish the most general properties relating
our collected data to the visual representation we wish to create.
Specifically, the aesthetics create a map from the variables in the
data to visual properties of the graph. This will be an important
point to keep in the back of our minds as our aesthetics becoming
increasingly more detailed.
For now, suppose we wish to consider the relationship between engine
displacement and highway fuel economy in the mpg
dataset.
The first order of business would be to establish these variables as the
x and y axes.
## Every ggplot begins with ggplot()
ggplot(mpg, aes(x = displ, y = hwy))
The resulting plot is a simple XY plane with displ
set
as the x-axis and hwy
set as the y axes. Further, note that
the range of the x and y axes are already determined based on the ranges
of values for the variables in mpg
. This is what we mean
when we say that the aesthetics represent a map from the data itself to
the construction of our plot.
While aes()
(short for aesthetics) can take a number of
interesting values, for now we will leave it as is and move on to
creating the first layer for our plot.
We can think of layers as being the particular details of how we want
our data to be portrayed on the ggplot. These can range from simple
plots of the xy coordinates as in a scatter plot to more detailed
summaries of the data. Notably, a ggplot can have an arbitrary number of
layers added to it, which we “add” to a plot with +
. We
will utilize this later in the lab as we work to create more interesting
plots. For now, we will begin by adding a layer that generates points
with the function geom_point()
.
When adding layers to a ggplot, it is best practice to end a line
with +
and only include one additional layer on each
subsequent line
ggplot(mpg, aes(displ, hwy)) +
geom_point()
Compared with the plot above, note that we still have the same
xy-coordinate system generated with
ggplot(mpg, aes(displ, hwy))
, but we have now “added” a
layer of scatter points that sit on top of the initial plot. As we
continue working with ggplot, this will be a helpful way to think about
the process.
Question 1 Identify the three components that were needed to produce the scatter plot above, namely the data, the aesthetics, and the layers.
Most (but not all) layers that we might add in ggplot begin with
geom_
to help indicate that they can be thought of as the
actual, geometric elements (lines, points, etc.,) that you will see on
the plot. In addition to geom_point()
, there are a few
others that are worth being aware of:
geom_point
- scatter plotsgeom_jitter
– scatter plot with “jittered”
coordinatesgeom_bar
- bar graph for categorical datageom_smooth
- smoother for datageom_boxplot
- box plotgeom_histogram
- histogramsgeom_violin
- violin plotQuestion 2 Use geom_boxplot()
to create
a box plot with the mpg
data with the class
variable on the x-axis and hwy
on the y-axis.
Plots are primarily generated to quickly convey interesting relationships that exist in our data. In the scatter plot we just generated, we are immediately able to see that a larger engine displacement is generally associated with reduced highway fuel economy. One question we may have is whether or not this pattern is the same across multiple types of vehicles. Or, more generally, we are asking if this same trend persists across different groupings of our data. We can ask questions like this by modifying the aesthetics of our plots.
A standard XY plot has two dimensions (one horizontal, one vertical),
but we can add a third dimension that incorporates color into our
figure. Here, we set the color to group by drive train (four wheel
drive, front wheel drive, rear wheel drive) using a color argument in
aes()
and setting it equal to the variable
drv
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point()
Immediately we can see distinct clusters appear in our plot. In addition to adding colors to the points themselves, ggplot also generated a legend (which as we will see is generally associated with the aesthetics of a graph). We will address legends more in the next lab.
In addition to color, we can also set the shapes of the points. And so long as we use the same grouping variable for each, we can combine this aesthetic with color:
## Using shapes instead of color
ggplot(mpg, aes(displ, hwy, shape = drv)) +
geom_point()
## We can have multiple aesthetics combined in legend
ggplot(mpg, aes(displ, hwy, color = drv, shape = drv)) +
geom_point()
We can also combine aesthetics for different groups, adding a fourth dimension to our data. In addition to asking about the drive train when considering the relationship between displacement and fuel economy, we could also consider groups based on the vehicles’ class
ggplot(mpg, aes(displ, hwy, color = class, shape = drv)) +
geom_point()
While this is technically correct, we run the risk of overloading our
plot, making it difficult to interpret in any useful way. Instead of
adding an additional aesthetic to our plot, we can instead use a
technique known as faceting to split our plot across multiple
groups. Here, we recreate the plot above except facet on the
drv
variable rather than plot its shape. We do this with
the function facet_wrap()
which takes its argument in the
form ~variable
(note the tilde)
ggplot(mpg, aes(displ, hwy, color = class)) +
geom_point() +
facet_wrap(~drv)
Question 3 Replicate the same box plot from question
2, this time faceting on the variable drv
.
Once the main ggplot has been created with ggplot()
, any
number of layers can subsequently be added to the graphic. A common
layer to add to an underlying graphic is that of a data
summary, often a statistic, that further summarizes trends in the
data. Here, consider the scatter plot displaying the relationship
between displ
and hwy
, but now with an
additional layer that adds a smoothing function to the data with the
layer geom_smooth
:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
Note that neither geom_point()
or
geom_smooth()
require additional information about the
plot: aesthetics that are set in ggplot()
are automatically
inherited by subsequent levels. If an aesthetic is only set within a
single layer, it will only apply to that layer.
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
geom_smooth()
We will look at the relationship between aesthetics and layers more in the next section.
Layers can also be added that provide a statistical summary of the data. For example, consider a box plot illustrating the relationship between vehicle drive train and displacement. Note that the components of the box plot represent the median, the quartiles, and whiskers that are typically some multiple of the IQR:
ggplot(mpg, aes(displ, drv)) +
geom_boxplot()
Suppose that we wish to add an indication of the mean value of
displacement as well. We can plot statistical summaries of our data with
the function stat_summary()
. Here, we specify the function
we wish to compute (fun = "mean"
) and indicate the geom we
wish it to plot (geom = "point"
). We will also change the
color and size to make it easier to see:
ggplot(mpg, aes(displ, drv)) +
geom_boxplot() +
stat_summary(geom = "point", fun = "mean", color = "red", size = 2)
Question 4 Rather than geom_boxplot
,
add geom_jitter(height = 0.15)
to create a plot with
displ
on the x axis and drv
on the y axis. Add
summaries for both mean (blue) and median (red).
Question 5 Notice that in this section, we made
several calls to layers that included arguments specifying visual
aspects of the data
(stat_summary(..., color = 'red', size = 2)
and
geom_jitter(height = 0.15)
) without using the
aes()
function. Why did this still work? What impact did it
have on the plots that were created? (Hint: Think about the specific
purpose of aes
and consider how that differs from the
examples we just gave.)
As was briefly mentioned earlier, aesthetics that are set in
ggplot()
are automatically inherited by subsequent levels,
while those set within a single layer will only apply to that layer. For
a plot with only a single layer, the relevant aesthetics can be set in
either place:
ggplot(mpg, aes(displ, hwy)) +
geom_point()
ggplot(mpg) +
geom_point(aes(displ, hwy))
While the resulting plots may appear identical, their construction is different in a very critical way. This becomes apparent once you attempt to add additional layers to the plot.
Question 6 Try running the code below:
ggplot(mpg) +
geom_point(aes(displ, hwy)) +
geom_smooth()
Why doesn’t this work? How would you alter this code to get the desired result?
Question 7 Look at and run the code below. If the resulting plots are identical, what difference does it make where the color aesthetic is described?
# How are these different? Even though they look the same
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv))
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point()
Based on your answer above, detail what happens when the code below is run and explain what is happening in each of these plots and why they look the way that they do.
## Plot 1
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
geom_smooth()
## Plot 2
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point() +
geom_smooth()
As you likely have noticed, there are several aesthetics that, when
set, appear to great discrete groupings of the data. For example, think
back to the question XX when setting the color aesthetics for XX in
different layers altered how geom_smooth
produced its
summary. While this is sometimes a consequence of the specific
aesthetics set (we will see more in the next lab), it is often a
consequence of the type of variable that is being used.
Consider, for example, looking again at our plot of displacement against
highway fuel economy, except this time specifying color by the number of
cylinders a vehicle has
ggplot(mpg, aes(displ, hwy, color = cyl)) +
geom_point()
Why do you think this looks the way that it does?
Question 8 What class of variable is
cyl
in the mpg
dataset? Now look at the
legend, what effect do you think the class has on how the legend
appears?
Question 9 Now run the following code:
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point()
How does this change the legend and the colors on the plot? Which of these do you think is more appropriate for this graph?
We see that for some aesthetics, depending on the variable type
ggplot2
will either map this to a continuous
scale, as was the case when cyl
was used as a numeric
variable, or as a discrete scale, once it was changed to a
factor. As we can see, variables on a discrete scale are used to infer
an underlying “grouping” structure which can have downstream effects on
other layers that we add:
ggplot(mpg, aes(displ, hwy, color = cyl)) +
geom_point() +
geom_smooth()
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point() +
geom_smooth()
Question 10 Create a box plot with
class
as the x axis and hwy as the y. In order to compare
these vehicles from when data was first collected to when it was last
collected, specify the color aesthetic as year
. Is this the
plot that you would expect? Make whatever modifications necessary to
produce the intended graph.
To wrap up our discussion of aesthetics for now, we briefly consider
the impact that constants. Whereas aes()
can be
used to map properties of the data itself onto our plots, there are
often times where we wish to make specifications of our plot
independently of the data. In fact, we saw this already in a
previous question on stat_summary
, where we were able to
specify both size and color without any reference to the data. In
general, these constants are specified within a layer, but
outside of the aes()
function. Here, consider the
variable alpha
which determines the transparency of a
geom:
ggplot(mpg, aes(displ, hwy)) +
geom_point(alpha = 0.25)
Remember: because these constant aesthetics do not make any reference to the data, there is no need to provide a legend relating the visual aesthetic to anything the viewer might need.
Question 11 Create two separate plots using the code
below. What are all of the visual differences between these two plots?
Why doesn’t the first one have a legend, and why do you think the second
one isn’t blue? What role does the aes()
function (or lack
of) play in these plots?
ggplot(mpg, aes(displ, hwy)) +
geom_point(color = "blue")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = "blue"))