ggplot2
continuedThis lab will be a continuation of our exploration of
ggplot2
. Whereas the first lab was oriented around creating
a number of standard plots from the data, here we will focus on a number
of ancillary issues, including titles and labels, legends, and themes.
The bulk of this lab will be focused on the topic of scales,
which manage the relationship between the data and the resulting
aesthetics. We will conclude by taking a closer look at some of the
arguments that can be used to augment different layers.
By default, plots made with ggplot2
do not include a
title, and the labels for the axes are taken from the variable names
given in aes()
. This is the case, for example, when we have
our plot of engine displacement and highway miles
library(ggplot2)
ggplot(mpg, aes(displ, hwy)) +
geom_point()
We can add titles or change the x and y axis labels with the
functions ggtitle
, xlab
, and
ylab
, respectively
library(ggplot2)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
ggtitle("Engine size to fuel economy") +
xlab("Displacement") +
ylab("Fuel Economy (Highway)")
As is typically the case with ggplots, there are multiple ways to
accomplish the same goal. The labs()
function allows us to
modify multiple labels at once by specifying them with an argument
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
labs(x = "Displacement", y = "Fuel Economy (Highway)", title = "Engine size to fuel economy")
The labs()
function also takes arguments for any
grouping aesthetics. The argument name is the same as what is used for
creating the groups, and changing these will make corresponding changes
in the legend:
## Without label
ggplot(mpg, aes(displ, hwy, shape = factor(cyl))) +
geom_point()
## With label
ggplot(mpg, aes(displ, hwy, shape = factor(cyl))) +
geom_point() +
labs(shape = "Cylinders") # Since we used shape aesthetic, we use "shape" here
Question 12 Using the mpg
dataset,
create a boxplot with class
on the x-axis and
cty
on the y-axis. Add a color aesthetic that accounts for
year
(don’t forget to address the fact that, by default,
year
is a continuous variable). Create appropriate labels
for the axes, title, and legend.
As you might imagine, there are a tremendous variety of options to
modify the style of your graphic. The collection of non-data related
elements of your plot, including the appearance of titles, labels,
legends, tick marks and lines all make up what is known as the
theme. Elements related to the theme are modified with the
theme()
function; a quick look at ?theme
demonstrates how comprehensive this list can be. Here, however, we
consider only a small subset of these items to demonstrate how the
process works. It is less important that any of these are memorized;
rather, knowing that such possibilties exist should assist you when
using search engines to learn how to modify your graphics.
The system for modifying themes consists of two components:
For example, elements consisting of text are modified with
the element function element_text()
. We can also see some
of the particulars that can be modified with ?element_text
.
To motivate this, consider a plot that we constructed in the previous
lab:
ggplot(mpg, aes(class, hwy)) +
geom_boxplot()
Because of the width of our figure, all of the labels on the x-axis
are bunched together. We can help fix this problem by rotating the axis
text on the x-axis. That is, we are modifying the element
axis.text.x
(that is, text that is located on the x-axis)
with the element function element_text
ggplot(mpg, aes(class, hwy)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45))
Here, we see that the rotation has helped with the overlapping, but
now the text is running into our plot. We can further alter the
V erticle ad JUST
ment with vjust
ggplot(mpg, aes(class, hwy)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
Note: considering a real-world example, it is highly unlikely that
you would recall that text on the x-axis is specified with
axis.text.x
. However, if you had a general idea of what it
is you wanted to change, it is likely that looking through the
arguments of ?theme
that you would find something matching
what you were looking for. This combined with with diligent search
engine use is a potent strategy for solving most ggproblems.
Question 13 For this question, use the code in the block below. To the plot that is generated, modify the following:
plot.title
by changing its color to red and
writing it in italics.ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point()
There are a number of pre-built themes that come included in ggplot,
serving as a template from which you can further tailor your graphics.
Here, for example, we consider a black and white theme, generated by
adding theme_bw()
to our constructed plot:
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point() +
labs(x = "Displacement", y = "Highway", title = "Snappy title", color = "Cylinders") +
theme_bw() # adds a black and white theme
Other pre-built themes:
theme_bw()
theme_linedraw()
, theme_light()
and
theme_dark()
theme_minimal()
theme_classic()
theme_void()
You can judge the differences in these themes below:
Any theme can be further customized using theme()
.
Question 14 Create a ggplot that includes either a
color or shape aesthetic, with appropriately labeled axes, legend, and
title. Add any of the pre-built themes shown above. Then, using the
theme
function, further modify the plot so that the legend
position is on the bottom (Hint: ?theme
)
From our discussion in the previous lab, we know that aesthetics
responsible for creating a map from the data used to visual aspects of
the plot generated. The specific details of how this mapping occurs are
contained within the concept of scales. Scales, for example,
are responsible for determining the length of the x-axis or the specific
colors and shapes generated by an aesthetic. Here we are going to limit
our discussion to the axes and colors, but the general principles will
be true for all of the aesthetics generated by ggplot2
.
A major concept that will be critical to keep in mind during this
section is the distinction between continuous and
discrete values. Continuous values are those that exist on a
spectrum without gaps (which does include integers), while
discrete values are those that take on a limited (and generally small)
number of unique values. In the mpg
dataset that we have
been using so far, the highway fuel economy hwy
would be an
example of a continuous variable, while the class of vehicle,
class
, would be an example of a discrete variable.
Whenever an aesthetic is added to a ggplot, an associated scale is created behind the scenes. For example, as both the vehicle displacement and highway fuel economy are continuous variables, scales for both the x and y axes are made continuous. We can see that when we explicitly add these scales to the plot, nothing changes:
## Scales created behind the scenes
ggplot(mpg, aes(displ, hwy)) +
geom_point()
## That is the exact same as this
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous() + # creates continuous x axis
scale_y_continuous() # creates continuous y axis
The same thing occurs when one of the variables is discrete
## Scales created behind the scenes
ggplot(mpg, aes(drv, hwy)) +
geom_boxplot()
## One of these is now discrete
ggplot(mpg, aes(drv, hwy)) +
geom_boxplot() +
scale_x_discrete() +
scale_y_continuous()
If we were to try and add a scale that did not match the variable type, we would get an error
## x is discrete but we try to include continuous, resulting in error
ggplot(mpg, aes(drv, hwy)) +
geom_boxplot() +
scale_x_continuous() +
scale_y_continuous()
## Error: Discrete value supplied to continuous scale
Again, because these scale terms are added automatically behind the scenes, we never have to worry about including them specifically unless we wish to change something about them. For now, we will only concern ourselves with breaks, labels, and transformations.
Question 15 Using the code block below, explicitly add the scales for the x and y axes that would otherwise show up by default
ggplot(mpg, aes(displ, drv)) +
geom_boxplot()
Breaks and labels refer to the tick marks on the x and y axes. Breaks refer to the actual location on the axes we wish to have marks, while labels refer to the labels of the breaks. Each of these takes a vector argument, and if both are provided, they must be the same length:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(breaks = c(2, 4, 4.5, 7), labels = c("2", "4", "4 & 1/2", "7"))
Because of how the underlying functions work, the breaks and labels falling outside of the range of the data will not render correctly. So, for example, the range of displacement size falls between 1.6 and 7. Any breaks outside of this range will be ignored.
range(mpg$displ)
## [1] 1.6 7.0
## Because 0 and 8 are not in range, they are ignored
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(breaks = c(0, 2, 4, 4.5, 7, 8), labels = c("0", "2", "4", "4 & 1/2", "7", "8"))
If we did want to include breaks outside of our range, we
can do so by adding an argument to limits
to our scale
function that takes new minimum and maximum values. This is often useful
if we wish to include zero in our plot, even if zero is not within the
range of the data.
## Increase range of x axis to include 0 and 10
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(breaks = c(0, 2, 4, 4.5, 7, 8),
labels = c("0", "2", "4", "4 & 1/2", "7", "8"),
limits = c(0, 10))
When the values are discrete, rather than continuous, the breaks
cannot be adjusted as each tick corresponds to a different group. We
can, however, change the labels of these groups with a named
vector. The names of the vector must correspond to
the names of the group. So, for example, knowing that the
drv
variable has categories “r”, “f”, and “4”, we can make
adjustments as such:
ggplot(mpg, aes(hwy, drv)) +
geom_boxplot() +
scale_y_discrete(labels = c(r = "Rear", f = "Foward", `4` = "4WD"))
A few things to note from this last plot:
drv
is on the y-axis, we need to be sure to use
the y scaleQuestion 16 Write code to recreate the plot below as closely as possible. In particular, consider themes, breaks, and labels.
This last section on our axes scales involves
transformations and is generally only associated with
continuous variables. These are done with the trans
argument provided in the scale function. For example, if we wish to plot
the relationship between displacement and fuel economy in descending
order, we could do this by reversing the relevant axis
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(trans = "reverse")
Other transformations help us identify trends that are on disproportionate scales. For example, consider this contrived dataset where each observation grows by a power of 10. This makes it difficult to see any meaningful relationship earlier in the plot. By adjusting the scale for x to be on a \(log_{10}\) scale, we are able to better see what is going on
df <- data.frame(x = 1:10, # 1 - 10
y = 10^(1:10)) # 10^1 - 10^10
df
## x y
## 1 1 1e+01
## 2 2 1e+02
## 3 3 1e+03
## 4 4 1e+04
## 5 5 1e+05
## 6 6 1e+06
## 7 7 1e+07
## 8 8 1e+08
## 9 9 1e+09
## 10 10 1e+10
## Very difficult to see this relationship for smaller values
ggplot(df, aes(x, y)) +
geom_point()
## On a more appropriate scale, we see the relationship is linear
ggplot(df, aes(x, y)) +
geom_point() + ylab("log 10 scale") +
scale_y_continuous(trans = "log10")
Question 17 The variable hwy
in the
mpg
dataset gives us fuel economy, which is
miles/gallon; however, another common metric is reporting fuel
consumption, which is a measure of gallons per mile. Use the
transformation "reciprocal"
to create a scatter plot
showing the relationship between displacement size and fuel consumption.
(Note: for now, don’t worry about the fact that that the labels and
breaks on the y-axis have disappeared, this appears to be a mistake in
R
)
As mentioned above and explored in the previous section, ggplot manages the relationship between the data and aesthetics through the use of scales. And, we saw, that the scales used for the axes were different depending on whether or not the associated variables were continuous or discrete. As we will now see, the relationship between data and the color aesthetic is no different.
Consider the last lab, for example, in which we plotted the
relationship between displacement and highway miles colored by cylinder.
When cyl
was stored as a numeric (or integer) vector, the
resulting color scale was continuous, taking all values between
dark and light blue. However, once we included color as a factor, the
color scale became discrete, offering four distinct colors to represent
our groups:
This is an illustration of color being treated as either a
continuous or discrete scale. And, analogous to the scales we
used for our axes, this scales are modified with the functions
scale_color_continuous()
and
scale_color_discrete()
.
There are primarily two types of continuous color scales we will concern ourselves with, and this will depend upon what we are trying to demonstrate. Generally speaking, there are two possible options:
Roughly corresponding to these two options are two types of color scales readily available for ggplot: viridis and gradient:
ggplot(mpg, aes(displ, hwy, color = cty)) +
geom_point() +
scale_color_continuous(type = "viridis")
ggplot(mpg, aes(displ, hwy, color = cty)) +
geom_point() +
scale_color_continuous(type = "gradient")
The viridis scales constitute a set of different color maps that are designed with a few thoughts in mind:
You can read more about viridis scales here.
A range of different viridis scales are provided in ggplot, though
their description is not particularly well documented. You can select
different scales by passing an additional argument option
with options available for “A”-“H”. Here are a few for illustration:
ggplot(mpg, aes(displ, hwy, color = cty)) +
geom_point() +
scale_color_continuous(type = "viridis", option = "A") + ggtitle("Magma")
ggplot(mpg, aes(displ, hwy, color = cty)) +
geom_point() +
scale_color_continuous(type = "viridis", option = "D") + ggtitle("Viridis")
ggplot(mpg, aes(displ, hwy, color = cty)) +
geom_point() +
scale_color_continuous(type = "viridis", option = "E") + ggtitle("Cividis")
ggplot(mpg, aes(displ, hwy, color = cty)) +
geom_point() +
scale_color_continuous(type = "viridis", option = "H") + ggtitle("Turbo")
The gradient color type, on the other hand, gives you a bit more
control. Here, you can specify a high
and low
value, indicating the range of colors on which you wish to gradient.
Choosing colors that are on the opposite ends of a color wheel will give
you the best contrast.
ggplot(mpg, aes(displ, hwy, color = cty)) +
geom_point() +
scale_color_continuous(type = "gradient", high = "orange", low = "blue")
A list of colors provided in R are available here
Question 18 For this question, we are going to use
another dataset built into R, the USArrests
(see
?USArrets
). Create a scatter plot using this data with the
urban population on the x-axis and the number of assaults per 100,000
residents on the y-axis. Then, choose two sensible colors and add a
color gradient corresponding to the murder rate. Looking at this plot,
does it seem that high rates of murder are more likely to correspond
with larger urban population or with states with high rates of
assault?
While there is an associated scale_color_discrete()
function for use with discrete variables, we will instead use a similar
function, scale_color_brewer()
, which comes with a full
suite of pre-built palettes for use with discrete variables. These can
be found in the documentation for ?scale_color_brewer()
.
The great thing about this is that with minimal effort, we can feel
confident that our colors are going to look good
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point() + scale_color_brewer(palette = "Spectral")
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point() + scale_color_brewer(palette = "Set2")
Although I don’t recommend it, you can also specify your own colors
for different values passing a named vector to
scale_color_manual
ggplot(mpg, aes(displ, hwy, color = factor(drv))) +
geom_point() +
scale_color_manual(values = c(f = "steelblue", r = "tomato", `4` = "goldenrod1"))
Question 19 This time we are going to use the
builtin R dataset ChickWeight
(?ChickWeight
).
Put Time
on the x-axis and weight
on the
y-axis, and specify the color aesthetic with Diet
. In
addition to adding scatter points, add a layer with
geom_smooth
. Finally, use a different color palette than
the default, either a pre-built one with the brewer function or by
selecting the colors manually. By the end of the study, which diet
seemed to result in chicks with the greatest average weight?
We conclude our lab on ggplot with a discussion of a new layer and
associated geom – bar plots and fill. Let’s begin by subsetting our data
to only include those vehicles whose manufacturer is Chevrolet, Dodge,
or Ford. We can check inclusion with the %in%
operator
which we will investigate in more detail later. We then create this bar
plot with a call to geom_bar()
.
mpg2 <- subset(mpg, manufacturer %in% c("chevrolet", "dodge", "ford"))
ggplot(mpg2, aes(manufacturer)) +
geom_bar()
Along the x-axis, we see the three manufacturers included in our
dataset, and along with y-axis, the frequency with which each of them
appears in our dataset. Suppose we wish to further identify how many of
each type of drive train is included from each manufacturer. We can use
the color aesthetic, but it is likely not what we are anticipating.
Instead, we need to introduce a new aesthetic, fill
ggplot(mpg2, aes(manufacturer, color = drv)) +
geom_bar()
ggplot(mpg2, aes(manufacturer, fill = drv)) +
geom_bar()
Note that fill
is also the aesthetic we would use to
fill in our box plots if we chose to do so.
By default, geom_bar()
provides a count of the total
number of each observations within each group. Once we have specified a
(discrete) grouping variable, we have a few additional options. The
default here, again, is to simply leave the bars stacked with the total
frequency provided on the y-axis. We can modify this with the argument
position
. Up first, we consider setting
position = "fill"
, which forces the height of each bar to
sum up to 1
ggplot(mpg2, aes(manufacturer, fill = drv)) +
geom_bar(position = "fill")
Although this gives us no information on the differences between manufacturers, it tells us a great deal about the composition of drive trains within each manufacturer. For example, we see that just a little over 50% of the Chevrolets in our dataset have rear wheel drive, while read wheel drive makes up just under 50% of Fords, and none of the Dodges.
Another useful position for our bar plots is "dodge"
,
which splits the different groups and plots their frequency
side-by-side.
ggplot(mpg2, aes(manufacturer, fill = drv)) +
geom_bar(position = "dodge")
What is interesting about this (and perhaps a little off-putting) is
that this preserves the total width for each manufacturer. This results
in the individual drive train bars for Chevrolet all being smaller than
the drive train bars for Dodge and Ford. If we wish instead for the
grouping variables to be of equal width, we need to use an element
function similar to what we did when modifying text. In this case,
the element function is position_dodge(preserve = "single")
or position_dodge2(preserve = "single")
, which adds a tiny
bit of space between the bars.
ggplot(mpg2, aes(manufacturer, fill = drv)) +
geom_bar(position = position_dodge(preserve = "single"))
ggplot(mpg2, aes(manufacturer, fill = drv)) +
geom_bar(position = position_dodge2(preserve = "single"))
Question 20 For this question, we are going to use
the built-in dataset in R, ToothGrowth
(?ToothGrowth
) which measures the length of odontoblasts
(cells responsible for tooth growth) in 60 guinea pigs in response to
the administration of supplemental vitamin C. Three different doses of
the vitamin were given via two different delivery methods, either orange
juice or ascorbic acid. Begin by creating a box plot with the length of
odontoblasts on the x-axis and the dose on the y-axis. Use the
fill
aesthetic to indicate supplement type, and use
scale_fill_brewer()
to select a different palette for the
colors. What can we learn from this plot? Is a higher dose of vitamin C
associated with increased tooth growth? Did either of the supplements
appear any better than the other?
(Taken from Professor Miller)
Question 21 The code below will load a data set
containing 970 Hollywood films released between 2007 and 2011, then
reduce these data to only include variables that could be known prior to
a film’s opening weekend. The data are then simplified further to only
include the four largest studios (Warner Bros, Fox, Paramount, and
Universal) in the three most common genres (action, comedy, drama). You
will use the resulting data (ie: movies_subset
) for this
question.
movies = read.csv("https://remiller1450.github.io/data/HollywoodMovies.csv")
movies_subset = subset(movies, LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal") &
Genre %in% c("Action", "Comedy", "Drama"),
select = c("Movie", "LeadStudio", "Story", "Genre","Budget",
"TheatersOpenWeek","Year","OpeningWeekend"))
Your goal in this question is to create a graphic that effectively
differentiates movies with high revenue on opening weekend from those
with low revenue on opening weekend (the variable
OpeningWeekend
records this revenue in millions of US
dollars). To practice the topics introduced in this lab, your graphic
should include at least 3 of the following 5 components:
Finally, using the graph you created, write 2-3 sentences explaining the trends you found (ie: what attributes seem to predict a film having low/high opening weekend revenue).