This lab will be a continuation of ggplot2. Whereas the first lab introduced some of the basics of ggplot, the goals of this lab will be twofold: to become acquainted with some of the thematic elements of a ggplot and to investigate how scaling mediates the process between observed variable and aesthetic mapping. We’ll conclude with some special considerations for bar plots.

Titles and Axes

The most basic ggplot2 do not include a title, and the labels for the axes are taken from the variable names given in aes(). This is the case, for example, when we have our plot of engine displacement and highway miles

library(ggplot2)

ggplot(mpg, aes(displ, hwy)) + 
  geom_point()

We can add titles or change the x and y axis labels with the functions ggtitle, xlab, and ylab, respectively

library(ggplot2)

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  ggtitle("Engine size to fuel economy") + 
  xlab("Displacement") + 
  ylab("Fuel Economy (Highway)")

As is typically the case with ggplots, there are multiple ways to accomplish the same goal. The labs() function allows us to modify multiple labels at once by specifying them with an argument

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  labs(x = "Displacement", y = "Fuel Economy (Highway)", title = "Engine size to fuel economy")

The labs() function also takes arguments for any grouping aesthetics. The argument name is the same as what is used for creating the groups, and changing these will make corresponding changes in the legend:

## Without label
ggplot(mpg, aes(displ, hwy, shape = factor(cyl))) + 
  geom_point() 

## With label
ggplot(mpg, aes(displ, hwy, shape = factor(cyl))) + 
  geom_point() +
  labs(shape = "Cylinders") # Since we used shape aesthetic, we use "shape" here

Question 11: Using the mpg dataset, create a boxplot with class on the x-axis and cty on the y-axis. Add a color aesthetic that accounts for year. Create appropriate labels for the axes, title, and legend. (Note: you should use factor() to turn year into a categorical value)

Themes

The collection of non-data related elements of your plot, including the appearance of titles, labels, legends, tick marks and lines all make up what is known as the theme. Elements related to the theme are modified with the theme() function; a quick look at ?theme demonstrates how comprehensive this list can be. Here, however, we consider only a small subset of these items to demonstrate how the process works.

The system for modifying themes consists of two components:

  1. The elements that are being modified (i.e., text, tick marks, legend)
  2. The element functions associated with each element that control the visual properties.

For example, elements consisting of text are modified with the element function element_text(). See ?element_text() for a list of the qualities that can be modified. To motivate an example, consider a plot from the previous lab:

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot()

Because of the width of our figure, all of the labels on the x-axis are bunched together. To remedy this, we need to (1) identify which element we want to modify and (2) determine which properties we want to change. ?theme contains a list of all of the elements.

In this case, as we are looking at text on the x axis, our element is called axis.text.x. The first attribute we’ll try modifying is the angle to see if rotating the text will solve the issue. We do this using the element_text() function:

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45))

Here, we see that the rotation has helped with the overlapping, but now the text is running into our plot. We can further alter the V ertical ad JUST ment with vjust

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

In general, this strategy of searching for thematic elements in theme() and modifying attributes is a useful one. Also worth knowing is element_blank(), which will remove an element from the plot altogether.

Question 12: For this question, use the code in the block below. To the plot that is generated, modify the following:

  1. Modify the plot.title by changing its color to red and writing it in italics.
  2. Change the size of the title and the text in the legend (two separate things) so that they are much bigger than the default
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) + 
  geom_point() 

Pre-built themes

While we are able to modify specific elements of the theme within the theme() function, there are a number of pre-built templates to get us started. We “add” them to our plots just as we did before. Here, we consider a black and white theme, genearted by adding theme_bw():

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) + 
  geom_point() + 
  labs(x = "Displacement", y = "Highway", title = "Snappy title", color = "Cylinders") +
  theme_bw() # adds a black and white theme

Other pre-built themes:

  • theme_bw()
  • theme_linedraw(), theme_light() and theme_dark()
  • theme_minimal()
  • theme_classic()
  • theme_void()

You can judge the differences in these themes below:

Any theme can be further customized using theme(), though note that if you add a theme template after making changes to theme(), those changes will likely be overwritten. Adding theme() after the tempalate, alternatively, will modify the template.

Question 13: Create a ggplot that includes either a color or shape aesthetic, with appropriately labeled axes, legend, and title. Add any of the pre-built themes shown above. Then, using the theme function, further modify the plot so that the legend position is on the bottom (Hint: ?theme)

Scales

From our discussion in the previous lab, we know that aesthetics responsible for creating a map from the data used to visual aspects of the plot generated. The specific details of how this mapping occurs are contained within the concept of scales. Scales, for example, are responsible for determining the length of the x-axis or the specific colors and shapes generated by an aesthetic. Here we are going to limit our discussion to the axes and colors, but the general principles will be true for all of the aesthetics generated by ggplot2.

A major concept that will be critical to keep in mind during this section is the distinction between continuous and discrete values. Continuous values are those that exist on a spectrum without gaps (which does include integers), while discrete values are those that take on a limited (and generally small) number of unique values. In the mpg dataset that we have been using so far, the highway fuel economy hwy would be an example of a continuous variable, while the class of vehicle, class, would be an example of a discrete variable.

Modifying the axes

Whenever an aesthetic is added to a ggplot, an associated scale is created behind the scenes. For example, as both the vehicle displacement and highway fuel economy are continuous variables, scales for both the x and y axes are made continuous. We can see that when we explicitly add these scales to the plot, nothing changes:

## Scales created behind the scenes
ggplot(mpg, aes(displ, hwy)) + 
  geom_point()

## That is the exact same as this
ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  scale_x_continuous() + # creates continuous x axis 
  scale_y_continuous() # creates continuous y axis

The same thing occurs when one of the variables is discrete

## Scales created behind the scenes
ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot()

## One of these is now discrete
ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot() + 
  scale_x_discrete() + 
  scale_y_continuous()

If we were to try and add a scale that did not match the variable type, we would get an error

## x is discrete but we try to include continuous, resulting in error
ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot() + 
  scale_x_continuous() + 
  scale_y_continuous()
## Error in `scale_x_continuous()`:
## ! Discrete values supplied to continuous scale.
## ℹ Example values: "f", "f", "f", "f", and "f"

Again, because these scale terms are added automatically behind the scenes, we never have to worry about including them specifically unless we wish to change something about them. For now, we will only concern ourselves with breaks, labels, and transformations.

Breaks and labels

Breaks and labels refer to the tick marks on the x and y axes. Breaks refer to the actual location on the axes we wish to have marks, while labels refer to the labels of the breaks. Each of these takes a vector argument, and if both are provided, they must be the same length:

ggplot(mpg, aes(displ, hwy)) +
  geom_point() + 
  scale_x_continuous(breaks = c(2, 4, 4.5, 7), labels = c("2", "4", "4 & 1/2", "7"))

Because of how the underlying functions work, the breaks and labels falling outside of the range of the data will not render correctly. So, for example, the range of displacement size falls between 1.6 and 7. Any breaks outside of this range will be ignored.

range(mpg$displ)
## [1] 1.6 7.0
## Because 0 and 8 are not in range, they are ignored
ggplot(mpg, aes(displ, hwy)) +
  geom_point() + 
  scale_x_continuous(breaks = c(0, 2, 4, 4.5, 7, 8), labels = c("0", "2", "4", "4 & 1/2", "7", "8"))

If we did want to include breaks outside of our range, we can do so by adding an argument to limits to our scale function that takes new minimum and maximum values. This is often useful if we wish to include zero in our plot, even if zero is not within the range of the data.

## Increase range of x axis to include 0 and 10
ggplot(mpg, aes(displ, hwy)) +
  geom_point() + 
  scale_x_continuous(breaks = c(0, 2, 4, 4.5, 7, 8), 
                     labels = c("0", "2", "4", "4 & 1/2", "7", "8"), 
                     limits = c(0, 10))

In this way, we see that modifying scale_x_continuous() allows us to extend the scale beyond the aesthetic mapping implied by the data.

When the values are discrete, rather than continuous, the breaks cannot be adjusted as each tick corresponds to a different group. We can, however, change the labels of these groups with a named vector. The names of the vector must correspond to the names of the group. So, for example, knowing that the drv variable has categories “r”, “f”, and “4”, we can make adjustments as such:

ggplot(mpg, aes(hwy, drv)) +
  geom_boxplot() +
  scale_y_discrete(labels = c(r = "Rear", f = "Foward", `4` = "4WD"))

A few things to note from this last plot:

  1. Because drv is discrete and on the y-axis, we need to be sure to use the y scale for discrete variables
  2. Named vectors usually consist of characters and numbers. When special characters are used (such as hypens), or when the name is just a number (as in the case above, where the name was just “4”), we must enclose the name in backtics `` so that R knows what to do with them

Question 14: Write code to recreate the plot below as closely as possible. In particular, consider themes, breaks, and labels.

Transformations

This last section on our axes scales involves transformations and is generally only associated with continuous variables. These are done with the trans argument provided in the scale function. For example, if we wish to plot the relationship between displacement and fuel economy in descending order, we could do this by reversing the relevant axis

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  scale_x_continuous(trans = "reverse")

Other transformations help us identify trends that are on disproportionate scales. For example, consider this contrived dataset where each observation grows by a power of 10. This makes it difficult to see any meaningful relationship earlier in the plot. By adjusting the scale for x to be on a \(log_{10}\) scale, we are able to better see what is going on

df <- data.frame(x = 1:10, # 1 - 10
                 y = 10^(1:10)) # 10^1 - 10^10
df
##     x           y
## 1   1          10
## 2   2         100
## 3   3        1000
## 4   4       10000
## 5   5      100000
## 6   6     1000000
## 7   7    10000000
## 8   8   100000000
## 9   9  1000000000
## 10 10 10000000000
## Very difficult to see this relationship for smaller values
ggplot(df, aes(x, y)) +
  geom_point()

## On a more appropriate scale, we see the relationship is linear
ggplot(df, aes(x, y)) +
  geom_point() + ylab("log 10 scale") +
  scale_y_continuous(trans = "log10")

Colors

Just as scales mediate the mapping between discrete and continuous variables to their respective axes, the relationship between variables and color aesthetics is no different.

Consider the last lab, for example, in which we plotted the relationship between displacement and highway miles colored by cylinder. When cyl was stored as a numeric (or integer) vector, the resulting color scale was continuous, taking all values between dark and light blue. However, once we included color as a factor, the color scale became discrete, offering four distinct colors to represent our groups:

This is an illustration of color being treated as either a continuous or discrete scale. And, analogous to the scales we used for our axes, this scales are modified with the functions scale_color_discrete() and scale_color_continuous().

Discrete color scales

The first thing to know about the scale_color_discrete() is that everybody actually uses scale_color_brewer() which comes with a full suite of pre-built palettes for use with discrete variables (see ?scale_color_brewer()). The great thing about this is that with minimal effort, we can feel confident that our colors are going to look good

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
  geom_point() + scale_color_brewer(palette = "Spectral")

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
  geom_point() + scale_color_brewer(palette = "Set2")

Although I don’t recommend it, you can also specify your own colors for different values passing a named vector to scale_color_manual

ggplot(mpg, aes(displ, hwy, color = factor(drv))) +
  geom_point() + 
  scale_color_manual(values = c(f = "steelblue", r = "tomato", `4` = "goldenrod1"))

One useful trick for making colors really pop is to change the shape aesthetic of points made by geom_point(). Examples of different shapes are given here, but the one we are interested in specifically is shape 21. Although this shape does have a color aesthetic, it works more like a bar plot – the color sets the outline, while fill sets the fill. This creates a nice contrast around the edge of the points

ggplot(mpg, aes(displ, hwy, fill = factor(drv))) +
  geom_point(shape = 21, color = "black", size = 2) + # Fill defined in aes() 
  scale_fill_manual(values = c(f = "steelblue", r = "tomato", `4` = "goldenrod1"))

Different question here

Question 15: This time we are going to use the built-in R dataset ChickWeight (?ChickWeight). Put Time on the x-axis and weight on the y-axis, and specify the color aesthetic with Diet. add a layer with geom_smooth. Finally, use a different color palette than the default, either a pre-built one with the brewer function or by selecting the colors manually. By the end of the study, which diet seemed to result in chicks with the greatest average weight?

Continuous color scales

There are primarily two types of continuous color scales we will concern ourselves with, and this will depend upon what we are trying to demonstrate. Generally speaking, there are two possible options:

  1. sequential scales - most useful in distinguishing high values from low values
  2. diverging scales - used to put equal emphasis on both high and low ends of the data range

Roughly corresponding to these two options are two types of color scales readily available for ggplot: gradient and viridis

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "gradient")

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis")

The gradient color type is pretty straight forward, though the colors are typically manually specified (which can be tricky to get to look nice). You can specify a high and low value, indicating the range of colors on which you wish to gradient. Choosing colors that are on the opposite ends of a color wheel will give you the best contrast.

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "gradient", high = "orange", low = "blue")

A list of colors provided in R are available here


Alternatively, the viridis scales constitute a set of different color maps that are designed with a few thoughts in mind:

  1. Colorful with a wide palette, making differences easy to see
  2. Perceptually uniform, so that values close together have similar colors
  3. Robust to colorblindness, meaning they also do well when printed in black and white

You can read more about viridis scales here.

A range of different viridis scales are provided in ggplot, though their description is not particularly well documented. You can select different scales by passing an additional argument option with options available for “A”-“H”. Here are a few for illustration:

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis", option = "A") + ggtitle("Magma")

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis", option = "D") + ggtitle("Viridis")

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis", option = "E") + ggtitle("Cividis")

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  scale_color_continuous(type = "viridis", option = "H") + ggtitle("Turbo")

Question 16: For this question, we are going to use another dataset built into R, the USArrests (see ?USArrets). Create a scatter plot using this data with the urban population on the x-axis and the number of assaults per 100,000 residents on the y-axis. Then, choose two sensible colors and add a color gradient corresponding to the murder rate. Looking at this plot, does it seem that high rates of murder are more likely to correspond with larger urban population or with states with high rates of assault?

## Load data
data("USArrests")

Barplots Considerations

Positioning Barplots

We conclude our lab on ggplot with a discussion of a new layer and associated geom – bar plots and fill. Let’s begin by subsetting our data to only include those vehicles whose manufacturer is Chevrolet, Dodge, or Ford. We can check inclusion with the %in% operator. We then create this bar plot with a call to geom_bar().

library(dplyr)
mpg2 <- filter(mpg, manufacturer %in% c("chevrolet", "dodge", "ford"))

ggplot(mpg2, aes(manufacturer)) + 
  geom_bar()

Along the x-axis, we see the three manufacturers included in our dataset, and along with y-axis, the frequency with which each of them appears in our dataset. Suppose we wish to further identify how many of each type of drive train is included from each manufacturer. We can use the color aesthetic, but it is likely not what we are anticipating. Instead, we need to introduce a new aesthetic, fill

ggplot(mpg2, aes(manufacturer, color = drv)) + 
  geom_bar()

ggplot(mpg2, aes(manufacturer, fill = drv)) + 
  geom_bar()

Note that fill is also the aesthetic we would use to fill in our box plots if we chose to do so.

By default, geom_bar() provides a count of the total number of each observations within each group. Once we have specified a (discrete) grouping variable, we have a few additional options. The default here, again, is to simply leave the bars stacked with the total frequency provided on the y-axis. We can modify this with the argument position. Up first, we consider setting position = "fill", which forces the height of each bar to sum up to 1

ggplot(mpg2, aes(manufacturer, fill = drv)) + 
  geom_bar(position = "fill")

Although this gives us no information on the differences between manufacturers, it tells us a great deal about the composition of drive trains within each manufacturer. For example, we see that just a little over 50% of the Chevrolets in our dataset have rear wheel drive, while read wheel drive makes up just under 50% of Fords, and none of the Dodges.

Another useful position for our bar plots is "dodge", which splits the different groups and plots their frequency side-by-side.

ggplot(mpg2, aes(manufacturer, fill = drv)) + 
  geom_bar(position = "dodge")

What is interesting about this (and perhaps a little off-putting) is that this preserves the total width for each manufacturer. This results in the individual drive train bars for Chevrolet all being smaller than the drive train bars for Dodge and Ford. If we wish instead for the grouping variables to be of equal width, we need to use an element function similar to what we did when modifying text. In this case, the element function is position_dodge(preserve = "single") or position_dodge2(preserve = "single"), which adds a tiny bit of space between the bars.

ggplot(mpg2, aes(manufacturer, fill = drv)) + 
  geom_bar(position = position_dodge(preserve = "single"))

ggplot(mpg2, aes(manufacturer, fill = drv)) + 
  geom_bar(position = position_dodge2(preserve = "single"))

Bar plotting summarized data

Ok, we actually conclude our section with a special consideration: categorical data that has already been summarized. Consider the following constructed dataset consisting of 100 observations that are assigned to either groups \(A\) or \(B\):

## Create simulated data with seed
set.seed(69)
df <- data.frame(subject = 1:100, 
                 group = sample(c("A", "B"), size = 100, 
                                replace = TRUE, 
                                prob = c(0.3, 0.7)))
head(df)
##   subject group
## 1       1     B
## 2       2     A
## 3       3     B
## 4       4     A
## 5       5     B
## 6       6     A

In this data.frame, each row is an observation. Creating a bar plot with this is straight forward:

ggplot(df, aes(group)) + geom_bar()

If we read carefully through geom_bar() and the stat argument, we see that this aesthetic mapping works by counting each of our observations and producing the appropriate total. But what if our data has already been counted?

library(dplyr)
df_summary <- group_by(df, group) %>% 
  summarize(N = n())

df_summary
## # A tibble: 2 × 2
##   group     N
##   <chr> <int>
## 1 A        29
## 2 B        71

What we need is a way to communicate that our groups should be taken from group and our totals from N. However, this will not work by default as geom_bar() is expecting a single aesthetic.

ggplot(df_summary, aes(x = group, y = N)) + geom_bar()
## Error in `geom_bar()`:
## ! Problem while computing stat.
## ℹ Error occurred in the 1st layer.
## Caused by error in `setup_params()`:
## ! `stat_count()` must only have an x or y aesthetic.

By default, the statistic associated with geom_bar() is count (that is, count the number of observations). By including a y aesthetic and changing the stat argument to stat = "identity", we are able to directly instruct ggplot as to the appropriate y value.

ggplot(df_summary, aes(x = group, y = N)) + geom_bar(stat = "identity")

Question 17: Below is the majors dataset, containing demographic information on a number of college majors.

majors <- read.csv("https://collinn.github.io/data/majors.csv")

Subset the dataset to only include those in the business category. Then create a bar plot showing the number of individual with each major given by the variable Workforce. Make whatever modifications to the plot and axes to make it presentable.

Practice Problems

Question 18: The code below will load a data set containing 970 Hollywood films released between 2007 and 2011, then reduce these data to only include variables that could be known prior to a film’s opening weekend. The data are then simplified further to only include the four largest studios (Warner Bros, Fox, Paramount, and Universal) in the three most common genres (action, comedy, drama). You will use the resulting data for this question.

movies <- read.csv("https://collinn.github.io/data/hollywood.csv")
movies <- filter(movies, Genre %in% c("Action", "Comedy", "Drama"),
                 LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal")) %>% 
  select("Movie", "LeadStudio", "Story", "Genre","Budget",
         "TheatersOpenWeek","Year","OpeningWeekend")

With your group, create 2-3 publication ready graphics that effectively differentiate movies with high revenue on opening weekend from those with low revenue on opening weekend (the variable OpeningWeekend records this revenue in millions of US dollars). Write a few sentences comparing the plots you created and any trends that you found.

Question 19: The data frame diamonds, like mpg, is included in the ggplot2 package. This data records the attributes of several thousand diamonds sold by a wholesale online retailer. Your goal is to recreate the graph shown below as closely as possible. A few hints:

  • Pay attention to scales, theme, and labels
  • Find what transformations are available for continuous x axis
  • The argument alpha = 0.3 is used in one of the layers to give each point 30% opacity
  • Default colors are used
library(ggplot2)
data("diamonds")
## Generating code
ggplot(diamonds, aes(carat,  price, color = color)) +
  geom_point(alpha = 0.3) + theme_bw() + scale_x_continuous(trans = "log2") +
  labs(x = "Carat", y = "Sale Price ($)", color = "Color Grade", title = "Diamond Sales") + theme(legend.position = "bottom")