ggplot2The goals of this lab are three-fold: familiarize ourselves with
common visual summaries and the types of variables they represent, learn
the basics of data visualization in R, and create a ggplot2
reference for use later in the semester. We’ll begin by loading the
appropriate package, ggplot2 along with some of the data we
will be using. Create an R code chunk at the top of your lab and copy in
the code below:
## We load packages with the library() function
## This needs to be done every time you start a new R session
library(ggplot2)
# Make graphs less uggo for ggplot2
theme_set(theme_bw())
# Load majors data
majors <- read.csv("https://collinn.github.io/data/majors.csv")
# Get mpg dataset from ggplot2 package and modify variables
mpg <- as.data.frame(mpg)
mpg$cyl <- factor(mpg$cyl)
We will do most of our plotting in R with the use of the
ggplot2, a package that is intended to follow the “grammar
of graphics” (hence the “gg”). This grammar provides a coherent
philosophy of data visualization consisting of independent components
that can be added together to build a plot. Every ggplot is made up of
three components:
The first dataset we will use today is the mpg dataset
included in ggplot2:
## head() is a function to show the first few rows of a data.frame
head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
Type ?mpg in the console to see a complete description
of the data and the variables used.
Every plot we make with ggplot2 will begin with a call
to the function ggplot() and an argument specifying the
data wish to use. Without any additional information provided, all R
does is reserve space and render a blank plot:
ggplot(data = mpg)
We begin to add specific attributes to a plot with the use of aesthetics. Specifically within ggplot, the aesthetics are used to establish a relationship between the variables in our data and the visual properties of the graph such as the scale of the X and Y axes or the colors used. Notably, this does not include the specific values of the variable. In other words, a mapping of a height variable for students in STA 209 to an X axis may indicate the range our X axis should take (say, 58 inches to 76 inches), but it will not include any information about individual heights we have observed.
To illustrate, we begin by establishing a single axis, the X axis,
mapped to the variable cty or city miles-per-gallon (noting
again that cty is a variable included in mpg).
Aesthetics in ggplot are added with the aes() function
ggplot(data = mpg, mapping = aes(x = cty))
Our plot still doesn’t have any data, but we can clearly see that one
attribute (and its range) has been mapped from the mpg
data.
The aesthetics we define will depend on a number of things including
the type of plot used and the number of variables included. For example,
a histogram only needs a single axis from the data (the Y axis being
determined by frequency), whereas a scatterplot needs two axes, one of
each of the variables. We can include additional aesthetics by placing
them in the aes() function. (Note, we do not need to
include data = and mapping = to call our
function as they are implied by the order).
ggplot(mpg, aes(x = cty, y = hwy))
Our ggplot now has two attributes, one for each cty and
hwy, yet still no observed data.
Once we have specified the data and aesthetics, we are ready to add
layers, which are responsible for rendering the data in a plot.
Layers are often referred to as “geoms” (for geometries) and can be
added to an existing ggplot with +. There are a number of
different geoms available, though for now we will add to our previous
plot geom_point(), giving us a scatterplot
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
This is now the same plot as above, but with the individual observations now included.
Most layers that we might add in ggplot begin with geom_
to help indicate that they can be thought of as the actual, geometric
elements (lines, points, etc.,) that you will see on the plot. In
addition to geom_point(), there are a few others that are
worth being aware of:
geom_point - scatter plotsgeom_jitter - scatter plot with “jittered”
coordinatesgeom_bar - bar graph for categorical datageom_smooth - smoother for datageom_boxplot - box plotgeom_histogram - histogramsNone of these have to be memorized.
While the x and y axes are the most common aesthetics, we can take
advantage of the mapping between data and other visual properties. Here,
we consider the majors dataset as an example, plotting the
relationship between percentage male in a field against the median
income for a bachelor’s degree:
## Per_Male = Percent Male
ggplot(majors, aes(x = Per_Male, y = Bach_Med_Income)) +
geom_point()
By default, the first and second aesthetics are always x
and y, so we do not need to specify them directly. In
addition to these, we can also include a color aesthetic,
which allows us to take another variable, in this case,
Category, and represent it in the plot as a distinct
color:
## Adding color = Category
ggplot(majors, aes(Per_Male, Bach_Med_Income, color = Category)) +
geom_point()
This is an example of a multivariate plot; it allows us to succinctly describe the relationship between 3 or more variables.
Question 1: Look at the two scatterplots above. From the first plot (without color), what can we see about the relationship between the variables? What information is added in the second plot, once we account for color?
This rest of this lab will be divided into three sections: creating univariate plots, creating bivariate plots, and then exploring multivariate plots. At the end of the lab is a (mostly) comprehensive collection of examples. These examples include different arguments that we can add to a ggplot to change, for example, how the bar charts are organized or to modify the number of bins in a histogram.
Univariate plots are those that provide a visual summary of a single variable. Variables are either quantitative, dealing with numerical quantities and typically taking any number within a range of values, or they are quantitative/categorical, describing the number of objects that have a particular property. As will typically be the case, the types of variables we are working with will dictate the appropriate types of plots to use.
For univariate quantitative variables, we will typically be
interested in histograms, which show the relative frequency of
numerical values over a particular domain. Consider the
majors dataset, which contains a variable indicating the
percentage within each college major that has obtained a PhD. As this is
only a single variable, we only need one aesthetic, and because we wish
to generate a histogram, we will use the layer/geom associated with
histograms:
## Per_PhD = percent with PhD
ggplot(majors, aes(x = Per_PhD)) +
geom_histogram()
Typically with layers, we can add additional arguments that change
their behavior. Here, I replicate the exact same histogram, this time
modifying the fill and color variables inside of
geom_histogram() to make it easier to see:
ggplot(majors, aes(x = Per_PhD)) +
geom_histogram(color = 'black', fill = 'gray')
Histograms work through a process called binning in that it
breaks a range of numbers (here, from 0 to about 25) into a number of
equally sized “bins”. For each of our observations that fall into a bin,
its associated bar increases by one. While there are few hard and fast
rules associated with binning, it is helpful to try and modify the
number of total bins to give us a better pictures of the distribution of
the data. We do this with the bins argument in
geom_histogram()
ggplot(majors, aes(x = Per_PhD)) +
geom_histogram(bins = 10, color = 'black', fill = 'gray')
Question 2: Using the example above, play around with various numbers. Between 10, 15, and 20 bins, which of these do you think best illustrates the data we have? Why do you come to that conclusion?
For categorical data, we are typically interested in either
counts or proportions. These are represented with
bar plots. Bar plots are similar to histograms in that they
have the appearance of vertical bars, but it is important to recognize
that they have different purposes and are associated with different
types of data. Similar to histograms, however, they are added with an
additional geom, geom_bar()
ggplot(majors, aes(x = Category)) +
geom_bar()
As you can see, the key difference between a bar plot and a histogram is the X axis; in the histogram, each bar or “bin” represents a range of numerical values while each bar in a bar plot is ever only associated with a single category.
It is also possible to use the y aesthetic instead. How do you think the plot generated from the code below differs from the plot above?
ggplot(majors, aes(y = Category)) +
geom_bar()
Bivariate plots, as the name suggests, are plots presenting the relationship with two variables. As before, the nature of these variables will dictate the type of plot used. For two variables, we have the following possible arrangements:
The relationship between two quantitative variables is described with
a scatterplot. We have seen this already in this lab, and it is
added with the layer geom_point()
## Again, Per_ = percent
ggplot(majors, aes(Per_Masters, Per_PhD)) +
geom_point()
Changing the order of the X and Y axis will not change the relationship between the variables, though depending on our goals it often makes sense to choose one arrangement over the other. Typically, we will want our explanatory/independent variable on the X axis, with the response/dependent variablel on the Y axis.
In the case with a single quantitative variable, we were primarily
interested in seeing the distribution of a single variable
within our dataset. This idea extends naturally with the addition of a
categorical variable, where now we are interested in seeing the
distribution of a quantitative variable within each category of
a categorical variable. This is done with a boxplot. A boxplot extends
the idea of a numerical, five-figure summary into something visual,
giving indications of the median, quantiles, and range of a quantitative
variable. The typically are not used with a single quantitative variable
alone, but there is nothing stopping us from doing so. They are created
with the layer geom_boxplot().
## Only using a single quantitative variable
ggplot(majors, aes(x = Per_PhD)) +
geom_boxplot()
The large box indicates where the middle fifty percent of the data lies, with the horizontal line inside of the box representing the median. Additionally, the “whiskers” give us a sense of the range. The individual points represent outliers.
More useful would be to see this distribution across a range of variables. As this involves mapping an additional variable to our plot, we indicate this addition by adding an additional aesthetic. It can be either X or Y, depending on how we would like our plot oriented
## Quantiative and categorical variable
ggplot(majors, aes(x = Per_PhD, y = Category)) +
geom_boxplot()
Question 3: How does the plot above help illustrate
any association between the two variables, Per_PhD
and Category? Summarize your findings in a 1-2
sentences.
To showcase situations with two categorical variables, we will use
the tips dataset below, containing data on bill information
for various meals at a restaurant. It may be worthwhile to take a minute
or two to investigate the variables included in this data.
tips <- read.csv("https://collinn.github.io/data/tips.csv")
head(tips)
## total_bill tip sex smoker day time size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
The case for presenting two categorical variables in a plot is slightly more involved than were either of the other two bivariate scenarios, largely on account of the different types of relationships we may be interested in displaying. We don’t need to worry about this in too much detail, as it will be covered in greater depth later in the course. For now, we will simply investigate the two primary ways we may be interested in viewing these relationships.
First, let’s look at a univariate bar chart showing the frequency of
meals that were served either for Lunch or Dinner with the variable
time
ggplot(tips, aes(x = time)) + geom_bar()
Simple enough. If we wish to further break this down by an additional
categorical variable, say, by smoking status, we can do so by adding an
additional aesthetic fill (as in, “fill this box with
color”) and assigning it the variable smoker:
ggplot(tips, aes(time, fill = smoker)) +
geom_bar()
This plot shows us two things: we see absolute counts indicating that more people had dinner than lunch, and we see a breakdown of smoking status within each time. However, because the counts for dinner and lunch are so different, it can be difficult for us to assess how the proportions differ within each group. We can change this by specifying that that we want proportions instead of counts, changing our Y axis.
Similar to histograms, we can modify the layer itself by passing in
additional arguments. In this case, we can use the position
argument with the value “fill” to create a bar chart that shows us
proportions
## Note how the Y axis is now from 0 to 1
ggplot(tips, aes(time, fill = smoker)) +
geom_bar(position = "fill")
Here, we can now see that a slightly higher proportion of dinner diners were smokers, compared with those at lunch. As the number of unique values within a category increases, we will see that which variable is mapped to which aesthetic plays a large role in what relationships are being communicated.
The bar charts that we have seen until now are called “stacked” and
“proportional” bar charts, respectively. Often however, we would prefer
to see each of the partitioned categories on its own to compare, and we
can do this by modifying the position argument in
geom_bar():
ggplot(tips, aes(time, fill = sex)) +
geom_bar(position = "dodge")
There are a number of variations on each of these plots that can be employed, some of which are included in the “Examples” section below. We will look at these in more detail a little later in the course.
Finally we come to multivariate plots which visually represent three or more variables in a dataset. While there are a wide variety of ways to do this (including with the use of size or shape), we will primarily limit ourselves to the use of two: color and faceting.
For our multivariate plots, we will be using a subset dataset containing attributes and outcomes for all primarily undergrad institutions in the US with at least 400 full-time students for the year 2019. Our particular subset will consists of college in Iowa and a few neighboring states
To load the data, simply copy and run the following:
college <- read.csv("https://collinn.github.io/data/college2019.csv")
midwest <- subset(college, State %in% c("IA", "MN", "IL", "MO"))
As we have seen previously in this lab, the color aesthetic can be used to include an additional variable in our plots. How the color aesthetic maps will be dependent on the type of variable included. For example, consider a scatterplot demonstrating the relationship between the cost of tuition and the average faculty salary when a categorical variable is included in the color aesthetic.
ggplot(midwest, aes(Cost, Avg_Fac_Salary, color = Type)) +
geom_point()
We see that a legend is created to the right of the plot, with distinct colors provided for each of the values within the category. Contrast this with the case in which a quantitative variable is included in the color aesthetic:
ggplot(midwest, aes(Cost, Avg_Fac_Salary, color = ACT_median)) +
geom_point()
How was this changed?
A quick note should be made with regards to adding color to a bar
chart. The color aesthetic in this case refers to the
outline of the bars, rather than the bars themselves. You’ll realize
pretty quickly if you have specified this when you didn’t intend to
## Eww
ggplot(midwest, aes(x = Region, color = Type)) +
geom_bar()
Instead, we want to use the closely related fill
aesthetic which “fills” the bars with color
ggplot(midwest, aes(x = Region, fill = Type)) +
geom_bar()
Note that this color/fill distinction applies when using boxplots as well.
There are many things we can do with the colors of our plots, but making adjustments ourselves will rarely be required in this course. For those interested, there are a few examples included in the Examples section at the bottom or a slightly more comprehensive set of examples in an old (read: unedited) lab from STA-230. Again, none of this is required.
While color allows us to add an additional variable within a single plot, faceting permits us to separate a plot into pieces based on the values of a category. This is helpful, for example, if we wish to see if various trends or relationships appear to be similar across groups.
Just as we used + to add a layer for our geoms, we use
+ again to add faceting. This syntax is a little bit
different than we have seen generally, but it is of the form
facet_wrap(~variable), where variable is the
name of the variable in the dataset we wish to facet. For example, to
facet the above scatter plot relating cost to faculty salary, we can
facet by State to view this for each of the different states
ggplot(midwest, aes(Cost, Avg_Fac_Salary)) +
geom_point() +
facet_wrap(~State)
Of course, there is nothing from stopping us from limiting our multivariate plots to three variables; we can do four quite easily by adding a color aesthetic to our faceted plot:
ggplot(midwest, aes(Cost, Avg_Fac_Salary, color = Type)) +
geom_point() +
facet_wrap(~State)
As we can see, plots allow us to very quickly and concisely summarize the relationships between all different types of variables in our datasets. Throughout the semester we will continue to focus on creating and presenting effective visual data summaries which will involve both identifying the correct types of plots to create, given the variables we wish to summarize, as well as utilizing faceting and the various aesthetics to best capture the associations and relationships we wish to present.
Question 4: The class variable in the
mpg dataset details the “type” of car for each observation
(pickup, SUV, etc.,). Create the appropriate graph to demonstrate this
distribution of this variable and comment on which class appears to be
the most frequent.
Question 5: The cty variable in the
mpg dataset details the average fuel economy for the
vehicle when driven in the city. Create the appropriate graph to
represent this variable. Does this variable appear to be symmetric or
skewed?
Question 6: Using the mpg dataset, we
are interested in determining if either front wheel drive, rear wheel
drive, or 4 wheel drive have better highway fuel economy. Create the
appropriate plot to show the distribution of fuel economy for each of
these types of vehicles. Which appears to have the best fuel
economy?
Question 7: In the majors dataset, the
variable Per_Masters describes the percentage of the
workforce within a particular field whose highest degree is a master’s
degree. Create a graph to assess whether or not there appears to be a
relationship between the percentage of individuals who hold a master’s
degree and the percent that is unemployed. What do you find?
Question 8: Using the mpg dataset,
create a plot to find which vehicle class produces the
largest number of vehicles with 6 cylinder (cyl) engines.
Which produces the largest proportion of 6 cylinder engines?
Question 9: Below is code that reads in
tips data that are used in the Example section of this lab.
First identify the four variables from the tips data that
are used to make this plot and then write code to reproduce it. (This
plot uses faceting which has examples at the end of the
lab)
tips <- read.csv("https://collinn.github.io/data/tips.csv")
Question 10: Using the full version of the
college dataset (provided again below), create two
different plots box plots with a fill aesthetic to
illustrate the relationship between region, type, and admission rate.
Which of your plots would be more useful in answering the question, “Do
private or public schools generally have a higher admission rate?”
college <- read.csv("https://collinn.github.io/data/college2019.csv")
Question 11: The code below will load a data set containing 970 Hollywood films released between 2007 and 2011, then reduce these data to only include variables that could be known prior to a film’s opening weekend. The data are then simplified further to only include the four largest studios (Warner Bros, Fox, Paramount, and Universal) in the three most common genres (action, comedy, drama). You will use the resulting data for this question.
## Read in data
movies <- read.csv("https://collinn.github.io/data/hollywood.csv")
## Simplify data
movies <- subset(movies, LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal") &
Genre %in% c("Action", "Comedy", "Drama"),
select = c("Movie", "LeadStudio", "Story", "Genre","Budget",
"TheatersOpenWeek","Year","OpeningWeekend"))
Your goal in this question is to create a graphic that effectively
differentiates movies with high revenue on opening weekend from those
with low revenue on opening weekend (the variable
OpeningWeekend records this revenue in millions of US
dollars). In other words, using the plotting methods included in this
lab, we want to create a visual summary of the data that answers the
question: which trends or attributes seem to predict a film having low
or high opening weekend revenue.
Your plot should include at least three variables from the dataset, either by including them in the axes, through color or fill, or through faceting. Finally, using the graph you have created, write 2-3 sentences explaining detailing what you have found.
These examples will be done with the tips dataset,
loaded in below:
## Read in the "Tips" data
tips <- read.csv("https://collinn.github.io/data/tips.csv")
We can change the number of bins with bins argument.
Feel free to play around with different numbers until you get what you
want
ggplot(tips, aes(x = tip)) + geom_histogram(bins = 15)
I think the default colors here are incredibly ugly, so I usually add lines and a slightly less offensive color to histograms in one variable
ggplot(tips, aes(x = tip)) + geom_histogram(bins = 15, color = 'black', fill = 'gray')
ggplot(tips, aes(x = tip)) + geom_boxplot()
We can add groups by passing an extra categorical argument to
aes()
ggplot(tips, aes(tip, y = day)) + geom_boxplot()
We can add more groups by including a color or fill aesthetic:
ggplot(tips, aes(tip, y = day, fill = smoker)) + geom_boxplot()
ggplot(tips, aes(tip, y = day, color = sex)) + geom_boxplot()
A basic scatter plot requires only two quantitative variables
ggplot(tips, aes(total_bill, tip)) + geom_point()
We can add color to this plot to represent a third variable: note that categorical variables will create a collection of “discrete” colors, while adding a quantitative variable will give us color on a scale
# Categorical variable added
ggplot(tips, aes(total_bill, tip, color = time)) + geom_point()
# Quantitative variable added
ggplot(tips, aes(total_bill, tip, color = size)) + geom_point()
The simplest box plot returns the distribution of a single variable
ggplot(tips, aes(time)) + geom_bar()
When adding additional categorical variables, the default behavior is to give a stacked bar chart
ggplot(tips, aes(time, fill = sex)) + geom_bar()
Note that for bar charts our second aesthetic will almost always be
fill =. This tells ggplot to “fill in” with the color of
the group. Try doing color = Sex to compare
We can change to a dodged bar chart by passing an argument to
geom_bar()
ggplot(tips, aes(time, fill = sex)) + geom_bar(position = "dodge")
If there is a case in which one group has zero observations from
another group, the default behavior for dodge looks weird. Here’s an
example from the mpg dataset:
ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = "dodge")
Compare the width of the bars for 4 cylinder, 5 cylinder, and 8 cylinder; the default behavior is to fill in the missing groups by making the bars wider. It’s a bit more typing to avoid this behavior, but it can be done
ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = position_dodge(preserve = "single"))
Finally, we can get the filled/conditional charts by using
position = "fill"
ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = "fill")
“Facetting” is a process whereby we split our plots up by a group or categorical variable, allowing us to see two (or more) side-by-side plots. This can be a handy way to view the relationship between two variables separately, conditioned on a third
ggplot(tips, aes(total_bill, tip)) +
geom_point() +
facet_wrap(~sex)
ggplot(tips, aes(total_bill, tip)) +
geom_point() +
facet_wrap(~day)
The syntax is facet_wrap(~var_name) and can be added
with + at the end of the plot
Here are some extra examples for things you may be intereseted in modifying in your plot
We can add labels and a title with the labs() function
(short for labels). The arguments to change labels are shown below
ggplot(tips, aes(total_bill, tip)) +
geom_point() +
labs(x = "Total Bill", y = "Tip Amount", title = "My neat plot")
This is (mostly) just for fun. If we use either a color or fill aesthetic in our plots, we can change the colors
# Original
ggplot(tips, aes(total_bill, tip, color = day)) +
geom_point()
# Using scale_color_brewer
ggplot(tips, aes(total_bill, tip, color = day)) +
geom_point() +
scale_color_brewer(palette = "Set2")
The syntax is nearly identical for fill using
scale_fill_brewer()
ggplot(mpg, aes(cyl, fill = drv)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Accent")
A list of additional palettes can be found with
?scale_fill_brewer() or
?scale_fill_color()