ggplot2
Copy and paste the code below into a code chunk at the top of your document
library(ggplot2)
library(dplyr)
# Make graphs less ugly for ggplot2
theme_set(theme_bw())
# Get mpg dataset from ggplot2 package and modify variables
mpg <- as.data.frame(mpg)
mpg <- mutate(mpg, cyl = factor(cyl),
year = factor(year))
# Load majors data
majors <- read.csv("https://remiller1450.github.io/data/majors.csv")
# College data
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
The majors
dataset contains salary data for various
college majors based on the results of the 2022 American Community
Survey. The data were originally obtained here
The college
dataset was used in the course
notes
mpg
dataset described more below
The ggplot2
package is intended to follow the “grammar
of graphics”, a coherent philosophy of independent components that can
be added together to build a plot. Every ggplot is made up of three
components:
This lab will be an introduction to these components and how they related to the construction of high quality, publication-ready graphics.
ggplot(majors, aes(x = Per_Male, y = Bach_Med_Income, color = Category)) +
geom_point()
head(majors[, c(1, 2, 4, 22)])
## Major Category Per_Male Bach_Med_Income
## 1 Computer Science Science 75.1 101600
## 2 Engineering Science 81.3 90490
## 3 Civil Engineering Science 77.7 94370
## 4 Electrical Engineering Science 84.5 106600
## 5 Mechanical Engineering Science 88.1 99180
## 6 Mathmatics Science 57.0 76350
For this lab, one of the datasets we will be using is the
mpg
dataset included within the ggplot2
package. This dataset contains a subset of fuel economy data that the
EPA collected for various car models in 1999 and 2008. We begin with a
quick visual inspection of the variables in our dataset:
head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
A description of all of the variables can be read with
?mpg
.
The aesthetics of a ggplot2
plot (which we simply call a
“ggplot”) establish the most general properties relating our collected
data to the visual representation we wish to create. Specifically, the
aesthetics create a map from the variables in the data to
visual properties of the graph. This will be an important point to keep
in the back of our minds as our aesthetics becoming increasingly more
detailed.
For example, suppose we wish to specify that we want the
cty
(city mpg) to be represented on the x-axis of our plot.
The general syntax is as follows:
ggplot(mpg, aes(x = cty))
To include a variable on the y-axis, we would also specify this in the aesthetic. Here, we create a blank plot that has city mileage on the x-axis and highway mileage on the y-axis
ggplot(mpg, aes(x = cty, y = hwy))
Every ggplot will begin this way: we have the name of the function
(ggplot()
), the data that is being used (mpg
),
and a set of variables that we wish to include in the aesthetics
(aes(x = cty, y = hwy)
). Recall that the names of the
variables in a dataset can be found in the Environment Panel. Of course,
not every plot will require both axes – a histogram, for example, only
needs an x-axis.
We will see additional interesting components that we can add to the aesthetics shortly.
The actual visual components that we wish to include in the plot are
given in layers which we add to the plot with +
.
The layers in ggplot are often referred to as “geoms” (for geometry) and
are added following the ggplot
. For example, here we add
the “point” geom to create a scatterplot
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
Note that this is the same plot that we had created above, except now we have added a layer of points.
Most (but not all) layers that we might add in ggplot begin with
geom_
to help indicate that they can be thought of as the
actual, geometric elements (lines, points, etc.,) that you will see on
the plot. In addition to geom_point()
, there are a few
others that are worth being aware of:
geom_point
- scatter plotsgeom_jitter
– scatter plot with “jittered”
coordinatesgeom_bar
- bar graph for categorical datageom_smooth
- smoother for datageom_boxplot
- box plotgeom_histogram
- histogramsgeom_violin
- violin plotWhile the x and y axes are the most common aesthetics, we can take
advantage of the mapping between data and visual properties. Here, we
consider the majors
dataset as an example, plotting the
relationship between percentage male in a field against the median
income for a bachelor’s degree:
ggplot(majors, aes(Per_Male, Bach_Med_Income)) +
geom_point()
Consider the additional information we see when we include
color
in the aesthetics:
## Adding color = Category
ggplot(majors, aes(Per_Male, Bach_Med_Income, color = Category)) +
geom_point()
This is an example of a multivariate plot; it allows us to succinctly describe the relationship between 3 or more variables.
Question 1: Looking at the multivariate plot above,
what additional information is gleaned by including
Category
as one of the variables?
This lab will be divided into three sections: creating univariate plots, bivariate plots, and then exploring multivariate plots. At the end of the lab is a (mostly) comprehensive collection of examples. These examples include different arguments that we can add to a ggplot to change, for example, how the bar charts are organized or to modify the number of bins in a histogram.
Question 2: The class
variable in the
mpg
dataset details the “type” of car for each observation
(pickup, SUV). Create the appropriate graph to demonstrate this
distribution of this variable and comment on which class appears to be
the most frequent.
Question 3: The cty
variable in the
mpg
dataset details the average fuel economy for the
vehicle when driven in the city. Create the appropriate graph to
represent this variable. Does this variable appear to be symmetric or
skewed?
Question 4: Using the mpg
dataset, we
are interested in determining if either front wheel drive, rear wheel
drive, or 4 wheel drive have better highway fuel economy. Create the
appropriate plot to show the distribution of fuel economy for each of
these types of vehicles. Which appears to have the best fuel
economy?
Question 5: In the majors
dataset, the
variable Per_Masters
describes the percentage of the
workforce within a particular field whose highest degree is a master’s
degree. Create a graph to assess whether or not there appears to be a
relationship between the percentage of individuals who hold a master’s
degree and the percent that is unemployed. What do you find?
Question 6: Using the mpg
dataset,
create a plot to find which vehicle class
produces the
largest number of vehicles with 6 cylinder (cyl
) engines.
Which produces the largest proportion of 6 cylinder engines?
Question 7 First, using the college
dataset, create a plot demonstrating the relationship between the region
of the schools and their admission rates. Then, use either color/fill or
faceting to add the Private
variable to the plot. Now
create the same plot again, this time adding color if you previously
faceted or facet if you previously added color. Between these two plots,
which appears the most useful in describing the relationship between
these three variables?
Question 8 Below is code that reads in
tips
data that are used in the Example section of this lab.
First identify the four variables from the tips
data that
are used to make this plot and then write code to reproduce it. (This
plot uses faceting which has examples at the end of the
lab)
tips <- read.csv("https://remiller1450.github.io/data/Tips.csv")
Question 9 The code below will load a data set containing 970 Hollywood films released between 2007 and 2011, then reduce these data to only include variables that could be known prior to a film’s opening weekend. The data are then simplified further to only include the four largest studios (Warner Bros, Fox, Paramount, and Universal) in the three most common genres (action, comedy, drama). You will use the resulting data for this question.
## Read in data
movies <- read.csv("https://remiller1450.github.io/data/HollywoodMovies.csv")
## Simplify data
movies <- subset(movies, LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal") &
Genre %in% c("Action", "Comedy", "Drama"),
select = c("Movie", "LeadStudio", "Story", "Genre","Budget",
"TheatersOpenWeek","Year","OpeningWeekend"))
Your goal in this question is to create a graphic that effectively
differentiates movies with high revenue on opening weekend from those
with low revenue on opening weekend (the variable
OpeningWeekend
records this revenue in millions of US
dollars).
Your plot should include at least three variables from the dataset, either by including them in the axes, through color or fill, or through faceting. Finally, using the graph you have created, write 2-3 sentences explaining what trends you found and what attributes seem to predict a film having low or high opening weekend revenue.
These examples will be done with the tips
dataset,
loaded in below:
## Read in the "Tips" data
tips <- read.csv("https://remiller1450.github.io/data/Tips.csv")
We can change the number of bins with bins
argument.
Feel free to play around with different numbers until you get what you
want
ggplot(tips, aes(x = Tip)) + geom_histogram(bins = 15)
I think the default colors here are incredibly ugly, so I usually add lines and a slightly less offensive color to histograms in one variable
ggplot(tips, aes(x = Tip)) + geom_histogram(bins = 15, color = 'black', fill = 'gray')
ggplot(tips, aes(x = Tip)) + geom_boxplot()
We can add groups by passing an extra categorical argument to
aes()
ggplot(tips, aes(Tip, y = Day)) + geom_boxplot()
We can add more groups by including a color or fill aesthetic:
ggplot(tips, aes(Tip, y = Day, fill = Smoker)) + geom_boxplot()
ggplot(tips, aes(Tip, y = Day, color = Sex)) + geom_boxplot()
A basic scatter plot requires only two quantitative variables
ggplot(tips, aes(TotBill, Tip)) + geom_point()
We can add color to this plot to represent a third variable: note that categorical variables will create a collection of “discrete” colors, while adding a quantitative variable will give us color on a scale
# Categorical variable added
ggplot(tips, aes(TotBill, Tip, color = Time)) + geom_point()
# Quantitative variable added
ggplot(tips, aes(TotBill, Tip, color = Size)) + geom_point()
The simplest box plot returns the distribution of a single variable
ggplot(tips, aes(Time)) + geom_bar()
When adding additional categorical variables, the default behavior is to give a stacked bar chart
ggplot(tips, aes(Time, fill = Sex)) + geom_bar()
Note that for bar charts our second aesthetic will almost always be
fill =
. This tells ggplot to “fill in” with the color of
the group. Try doing color = Sex
to compare
We can change to a dodged bar chart by passing an argument to
geom_bar()
ggplot(tips, aes(Time, fill = Sex)) + geom_bar(position = "dodge")
If there is a case in which one group has zero observations from
another group, the default behavior for dodge looks weird. Here’s an
example from the mpg
dataset:
ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = "dodge")
Compare the width of the bars for 4 cylinder, 5 cylinder, and 8 cylinder; the default behavior is to fill in the missing groups by making the bars wider. It’s a bit more typing to avoid this behavior, but it can be done
ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = position_dodge(preserve = "single"))
Finally, we can get the filled/conditional charts by using
position = "fill"
ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = "fill")
“Facetting” is a process whereby we split our plots up by a group or categorical variable, allowing us to see two (or more) side-by-side plots. This can be a handy way to view the relationship between two variables separately, conditioned on a third
ggplot(tips, aes(TotBill, Tip)) +
geom_point() +
facet_wrap(~Sex)
ggplot(tips, aes(TotBill, Tip)) +
geom_point() +
facet_wrap(~Day)
The syntax is facet_wrap(~var_name)
and can be added
with +
at the end of the plot
Here are some extra examples for things you may be intereseted in modifying in your plot
We can add labels and a title with the labs()
function
(short for labels). The arguments to change labels are shown below
ggplot(tips, aes(TotBill, Tip)) +
geom_point() +
labs(x = "Total Bill", y = "Tip Amount", title = "My neat plot")
This is (mostly) just for fun. If we use either a color or fill aesthetic in our plots, we can change the colors
# Original
ggplot(tips, aes(TotBill, Tip, color = Day)) +
geom_point()
# Using scale_color_brewer
ggplot(tips, aes(TotBill, Tip, color = Day)) +
geom_point() +
scale_color_brewer(palette = "Set2")
The syntax is nearly identical for fill using
scale_fill_brewer()
ggplot(mpg, aes(cyl, fill = drv)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Accent")
A list of additional palettes can be found with
?scale_fill_brewer()
or
?scale_fill_color()