Copy and paste the code below into a code chunk at the top of your document

library(ggplot2)
library(dplyr)

# Make graphs less ugly for ggplot2
theme_set(theme_bw())

# Get mpg dataset from ggplot2 package and modify variables
mpg <- as.data.frame(mpg)
mpg <- mutate(mpg, cyl = factor(cyl), 
              year = factor(year))

# Load majors data
majors <- read.csv("https://remiller1450.github.io/data/majors.csv")

# College data
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

The majors dataset contains salary data for various college majors based on the results of the 2022 American Community Survey. The data were originally obtained here
The college dataset was used in the course notes
mpg dataset described more below

Grammer of Graphics

The ggplot2 package is intended to follow the “grammar of graphics”, a coherent philosophy of independent components that can be added together to build a plot. Every ggplot is made up of three components:

Data – the information we wish to visually represent
Aesthetics – A mapping between the data and visual properties, i.e., xy axes, color, etc
Layers – How the observations in our data our rendered, i.e., scatter plot, box plot, etc.,

This lab will be an introduction to these components and how they related to the construction of high quality, publication-ready graphics.

ggplot(majors, aes(x = Per_Male, y = Bach_Med_Income, color = Category)) +
  geom_point()

head(majors[, c(1, 2, 4, 22)])

##                     Major Category Per_Male Bach_Med_Income
## 1       Computer Science   Science     75.1          101600
## 2            Engineering   Science     81.3           90490
## 3      Civil Engineering   Science     77.7           94370
## 4 Electrical Engineering   Science     84.5          106600
## 5 Mechanical Engineering   Science     88.1           99180
## 6              Mathmatics  Science     57.0           76350

Data

For this lab, one of the datasets we will be using is the mpg dataset included within the ggplot2 package. This dataset contains a subset of fuel economy data that the EPA collected for various car models in 1999 and 2008. We begin with a quick visual inspection of the variables in our dataset:

head(mpg)

##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact

A description of all of the variables can be read with ?mpg.

Aesthetics

The aesthetics of a ggplot2 plot (which we simply call a “ggplot”) establish the most general properties relating our collected data to the visual representation we wish to create. Specifically, the aesthetics create a map from the variables in the data to visual properties of the graph. This will be an important point to keep in the back of our minds as our aesthetics becoming increasingly more detailed.

For example, suppose we wish to specify that we want the cty (city mpg) to be represented on the x-axis of our plot. The general syntax is as follows:

ggplot(mpg, aes(x = cty))

To include a variable on the y-axis, we would also specify this in the aesthetic. Here, we create a blank plot that has city mileage on the x-axis and highway mileage on the y-axis

ggplot(mpg, aes(x = cty, y = hwy))

Every ggplot will begin this way: we have the name of the function (ggplot()), the data that is being used (mpg), and a set of variables that we wish to include in the aesthetics (aes(x = cty, y = hwy)). Recall that the names of the variables in a dataset can be found in the Environment Panel. Of course, not every plot will require both axes – a histogram, for example, only needs an x-axis.

We will see additional interesting components that we can add to the aesthetics shortly.

Layers

The actual visual components that we wish to include in the plot are given in layers which we add to the plot with +. The layers in ggplot are often referred to as “geoms” (for geometry) and are added following the ggplot. For example, here we add the “point” geom to create a scatterplot

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

Note that this is the same plot that we had created above, except now we have added a layer of points.

Most (but not all) layers that we might add in ggplot begin with geom_ to help indicate that they can be thought of as the actual, geometric elements (lines, points, etc.,) that you will see on the plot. In addition to geom_point(), there are a few others that are worth being aware of:

geom_point - scatter plots
geom_jitter – scatter plot with “jittered” coordinates
geom_bar - bar graph for categorical data
geom_smooth - smoother for data
geom_boxplot - box plot
geom_histogram - histograms
geom_violin - violin plot

Additional Aesthetics

While the x and y axes are the most common aesthetics, we can take advantage of the mapping between data and visual properties. Here, we consider the majors dataset as an example, plotting the relationship between percentage male in a field against the median income for a bachelor’s degree:

ggplot(majors, aes(Per_Male, Bach_Med_Income)) + 
  geom_point()

Consider the additional information we see when we include color in the aesthetics:

## Adding color = Category
ggplot(majors, aes(Per_Male, Bach_Med_Income, color = Category)) + 
  geom_point()

This is an example of a multivariate plot; it allows us to succinctly describe the relationship between 3 or more variables.

Question 1: Looking at the multivariate plot above, what additional information is gleaned by including Category as one of the variables?

This lab will be divided into three sections: creating univariate plots, bivariate plots, and then exploring multivariate plots. At the end of the lab is a (mostly) comprehensive collection of examples. These examples include different arguments that we can add to a ggplot to change, for example, how the bar charts are organized or to modify the number of bins in a histogram.

Problem Sets

Part 1 - Univariate Plots

Question 2: The class variable in the mpg dataset details the “type” of car for each observation (pickup, SUV). Create the appropriate graph to demonstrate this distribution of this variable and comment on which class appears to be the most frequent.

Question 3: The cty variable in the mpg dataset details the average fuel economy for the vehicle when driven in the city. Create the appropriate graph to represent this variable. Does this variable appear to be symmetric or skewed?

Part 2 - Bivariate Plots

Question 4: Using the mpg dataset, we are interested in determining if either front wheel drive, rear wheel drive, or 4 wheel drive have better highway fuel economy. Create the appropriate plot to show the distribution of fuel economy for each of these types of vehicles. Which appears to have the best fuel economy?

Question 5: In the majors dataset, the variable Per_Masters describes the percentage of the workforce within a particular field whose highest degree is a master’s degree. Create a graph to assess whether or not there appears to be a relationship between the percentage of individuals who hold a master’s degree and the percent that is unemployed. What do you find?

Question 6: Using the mpg dataset, create a plot to find which vehicle class produces the largest number of vehicles with 6 cylinder (cyl) engines. Which produces the largest proportion of 6 cylinder engines?

Part 3 - Multivariate Plots

Question 7 First, using the college dataset, create a plot demonstrating the relationship between the region of the schools and their admission rates. Then, use either color/fill or faceting to add the Private variable to the plot. Now create the same plot again, this time adding color if you previously faceted or facet if you previously added color. Between these two plots, which appears the most useful in describing the relationship between these three variables?

Question 8 Below is code that reads in tips data that are used in the Example section of this lab. First identify the four variables from the tips data that are used to make this plot and then write code to reproduce it. (This plot uses faceting which has examples at the end of the lab)

tips <- read.csv("https://remiller1450.github.io/data/Tips.csv")

Question 9 The code below will load a data set containing 970 Hollywood films released between 2007 and 2011, then reduce these data to only include variables that could be known prior to a film’s opening weekend. The data are then simplified further to only include the four largest studios (Warner Bros, Fox, Paramount, and Universal) in the three most common genres (action, comedy, drama). You will use the resulting data for this question.

## Read in data
movies <- read.csv("https://remiller1450.github.io/data/HollywoodMovies.csv")

## Simplify data
movies <- subset(movies, LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal") & 
                               Genre %in% c("Action", "Comedy", "Drama"),
                       select = c("Movie", "LeadStudio", "Story", "Genre","Budget",
                                  "TheatersOpenWeek","Year","OpeningWeekend"))

Your goal in this question is to create a graphic that effectively differentiates movies with high revenue on opening weekend from those with low revenue on opening weekend (the variable OpeningWeekend records this revenue in millions of US dollars).

Your plot should include at least three variables from the dataset, either by including them in the axes, through color or fill, or through faceting. Finally, using the graph you have created, write 2-3 sentences explaining what trends you found and what attributes seem to predict a film having low or high opening weekend revenue.

Examples

These examples will be done with the tips dataset, loaded in below:

## Read in the "Tips" data
tips <- read.csv("https://remiller1450.github.io/data/Tips.csv")

Histograms

We can change the number of bins with bins argument. Feel free to play around with different numbers until you get what you want

ggplot(tips, aes(x = Tip)) + geom_histogram(bins = 15)

I think the default colors here are incredibly ugly, so I usually add lines and a slightly less offensive color to histograms in one variable

ggplot(tips, aes(x = Tip)) + geom_histogram(bins = 15, color = 'black', fill = 'gray')

Boxplots

ggplot(tips, aes(x = Tip)) + geom_boxplot()

We can add groups by passing an extra categorical argument to aes()

ggplot(tips, aes(Tip, y = Day)) + geom_boxplot()

We can add more groups by including a color or fill aesthetic:

ggplot(tips, aes(Tip, y = Day, fill = Smoker)) + geom_boxplot()

ggplot(tips, aes(Tip, y = Day, color = Sex)) + geom_boxplot()

Scatter plots

A basic scatter plot requires only two quantitative variables

ggplot(tips, aes(TotBill, Tip)) + geom_point()

We can add color to this plot to represent a third variable: note that categorical variables will create a collection of “discrete” colors, while adding a quantitative variable will give us color on a scale

# Categorical variable added
ggplot(tips, aes(TotBill, Tip, color = Time)) + geom_point()

# Quantitative variable added
ggplot(tips, aes(TotBill, Tip, color = Size)) + geom_point()

Bar plots

The simplest box plot returns the distribution of a single variable

ggplot(tips, aes(Time)) + geom_bar()

Stacked

When adding additional categorical variables, the default behavior is to give a stacked bar chart

ggplot(tips, aes(Time, fill = Sex)) + geom_bar()

Note that for bar charts our second aesthetic will almost always be fill =. This tells ggplot to “fill in” with the color of the group. Try doing color = Sex to compare

Dodge

We can change to a dodged bar chart by passing an argument to geom_bar()

ggplot(tips, aes(Time, fill = Sex)) + geom_bar(position = "dodge")

If there is a case in which one group has zero observations from another group, the default behavior for dodge looks weird. Here’s an example from the mpg dataset:

ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = "dodge")

Compare the width of the bars for 4 cylinder, 5 cylinder, and 8 cylinder; the default behavior is to fill in the missing groups by making the bars wider. It’s a bit more typing to avoid this behavior, but it can be done

ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = position_dodge(preserve = "single"))

Filled charts (conditional)

Finally, we can get the filled/conditional charts by using position = "fill"

ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = "fill")

Faceting

“Facetting” is a process whereby we split our plots up by a group or categorical variable, allowing us to see two (or more) side-by-side plots. This can be a handy way to view the relationship between two variables separately, conditioned on a third

ggplot(tips, aes(TotBill, Tip)) + 
  geom_point() + 
  facet_wrap(~Sex)

ggplot(tips, aes(TotBill, Tip)) +
  geom_point() + 
  facet_wrap(~Day)

The syntax is facet_wrap(~var_name) and can be added with + at the end of the plot

Extra Details

Here are some extra examples for things you may be intereseted in modifying in your plot

Labels and title

We can add labels and a title with the labs() function (short for labels). The arguments to change labels are shown below

ggplot(tips, aes(TotBill, Tip)) + 
  geom_point() +
  labs(x = "Total Bill", y = "Tip Amount", title = "My neat plot")

Colors

This is (mostly) just for fun. If we use either a color or fill aesthetic in our plots, we can change the colors

# Original
ggplot(tips, aes(TotBill, Tip, color = Day)) + 
  geom_point()

# Using scale_color_brewer
ggplot(tips, aes(TotBill, Tip, color = Day)) + 
  geom_point() + 
  scale_color_brewer(palette = "Set2")

The syntax is nearly identical for fill using scale_fill_brewer()

ggplot(mpg, aes(cyl, fill = drv)) + 
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Accent")

A list of additional palettes can be found with ?scale_fill_brewer() or ?scale_fill_color()

Introduction to ggplot2