Introduction

The goals of this lab are three-fold: familiarize ourselves with common visual summaries and the types of variables they represent, learn the basics of data visualization in R, and create a ggplot2 reference for use later in the semester. We’ll begin by loading the appropriate package, ggplot2 along with some of the data we will be using. Create an R code chunk at the top of your lab and copy in the code below:

## We load packages with the library() function
## This needs to be done every time you start a new R session
library(ggplot2)

# Make graphs less uggo for ggplot2
theme_set(theme_bw())

# Load majors data
majors <- read.csv("https://collinn.github.io/data/majors.csv")

# Get mpg dataset from ggplot2 package and modify variables
mpg <- as.data.frame(mpg)
mpg$cyl <- factor(mpg$cyl)

Grammer of Graphics

We will do most of our plotting in R with the use of the ggplot2, a package that is intended to follow the “grammar of graphics” (hence the “gg”). This grammar provides a coherent philosophy of data visualization consisting of independent components that can be added together to build a plot. Every ggplot is made up of three components:

Data – the collected and observed variables we wish to visually represent
Aesthetics – A mapping between the data and visual properties, e.g., xy axes, color, etc.
Layers – How the observations in our data our rendered, e.g., scatterplot, box plot, etc.,

The first dataset we will use today is the mpg dataset included in ggplot2:

## head() is a function to show the first few rows of a data.frame
head(mpg)

##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact

Type ?mpg in the console to see a complete description of the data and the variables used.

Aesthetics

Every plot we make with ggplot2 will begin with a call to the function ggplot() and an argument specifying the data wish to use. Without any additional information provided, all R does is reserve space and render a blank plot:

ggplot(data = mpg)

We begin to add specific attributes to a plot with the use of aesthetics. Specifically within ggplot, the aesthetics are used to establish a relationship between the variables in our data and the visual properties of the graph such as the scale of the X and Y axes or the colors used. Notably, this does not include the specific values of the variable. In other words, a mapping of a height variable for students in STA 209 to an X axis may indicate the range our X axis should take (say, 58 inches to 76 inches), but it will not include any information about individual heights we have observed.

To illustrate, we begin by establishing a single axis, the X axis, mapped to the variable cty or city miles-per-gallon (noting again that cty is a variable included in mpg). Aesthetics in ggplot are added with the aes() function

ggplot(data = mpg, mapping = aes(x = cty))

Our plot still doesn’t have any data, but we can clearly see that one attribute (and its range) has been mapped from the mpg data.

The aesthetics we define will depend on a number of things including the type of plot used and the number of variables included. For example, a histogram only needs a single axis from the data (the Y axis being determined by frequency), whereas a scatterplot needs two axes, one of each of the variables. We can include additional aesthetics by placing them in the aes() function. (Note, we do not need to include data = and mapping = to call our function as they are implied by the order).

ggplot(mpg, aes(x = cty, y = hwy))

Our ggplot now has two attributes, one for each cty and hwy, yet still no observed data.

Layers

Once we have specified the data and aesthetics, we are ready to add layers, which are responsible for rendering the data in a plot. Layers are often referred to as “geoms” (for geometries) and can be added to an existing ggplot with +. There are a number of different geoms available, though for now we will add to our previous plot geom_point(), giving us a scatterplot

ggplot(mpg, aes(x = cty, y = hwy)) + 
  geom_point()

This is now the same plot as above, but with the individual observations now included.

Most layers that we might add in ggplot begin with geom_ to help indicate that they can be thought of as the actual, geometric elements (lines, points, etc.,) that you will see on the plot. In addition to geom_point(), there are a few others that are worth being aware of:

geom_point - scatter plots
geom_jitter - scatter plot with “jittered” coordinates
geom_bar - bar graph for categorical data
geom_smooth - smoother for data
geom_boxplot - box plot
geom_histogram - histograms

None of these have to be memorized.

Additional Aesthetics

While the x and y axes are the most common aesthetics, we can take advantage of the mapping between data and other visual properties. Here, we consider the majors dataset as an example, plotting the relationship between percentage male in a field against the median income for a bachelor’s degree:

## Per_Male = Percent Male
ggplot(majors, aes(x = Per_Male, y = Bach_Med_Income)) + 
  geom_point()

By default, the first and second aesthetics are always x and y, so we do not need to specify them directly. In addition to these, we can also include a color aesthetic, which allows us to take another variable, in this case, Category, and represent it in the plot as a distinct color:

## Adding color = Category
ggplot(majors, aes(Per_Male, Bach_Med_Income, color = Category)) + 
  geom_point()

This is an example of a multivariate plot; it allows us to succinctly describe the relationship between 3 or more variables.

Question 1: Look at the two scatterplots above. From the first plot (without color), what can we see about the relationship between the variables? What information is added in the second plot, once we account for color?

This rest of this lab will be divided into three sections: creating univariate plots, creating bivariate plots, and then exploring multivariate plots. At the end of the lab is a (mostly) comprehensive collection of examples. These examples include different arguments that we can add to a ggplot to change, for example, how the bar charts are organized or to modify the number of bins in a histogram.

Types of Plots

Univariate Plots

Univariate plots are those that provide a visual summary of a single variable. Variables are either quantitative, dealing with numerical quantities and typically taking any number within a range of values, or they are quantitative/categorical, describing the number of objects that have a particular property. As will typically be the case, the types of variables we are working with will dictate the appropriate types of plots to use.

Quantitative

For univariate quantitative variables, we will typically be interested in histograms, which show the relative frequency of numerical values over a particular domain. Consider the majors dataset, which contains a variable indicating the percentage within each college major that has obtained a PhD. As this is only a single variable, we only need one aesthetic, and because we wish to generate a histogram, we will use the layer/geom associated with histograms:

## Per_PhD = percent with PhD
ggplot(majors, aes(x = Per_PhD)) + 
  geom_histogram()

Typically with layers, we can add additional arguments that change their behavior. Here, I replicate the exact same histogram, this time modifying the fill and color variables inside of geom_histogram() to make it easier to see:

ggplot(majors, aes(x = Per_PhD)) + 
  geom_histogram(color = 'black', fill = 'gray')

Histograms work through a process called binning in that it breaks a range of numbers (here, from 0 to about 25) into a number of equally sized “bins”. For each of our observations that fall into a bin, its associated bar increases by one. While there are few hard and fast rules associated with binning, it is helpful to try and modify the number of total bins to give us a better pictures of the distribution of the data. We do this with the bins argument in geom_histogram()

ggplot(majors, aes(x = Per_PhD)) + 
  geom_histogram(bins = 10, color = 'black', fill = 'gray')

Question 2: Using the example above, play around with various numbers. Between 10, 15, and 20 bins, which of these do you think best illustrates the data we have? Why do you come to that conclusion?

Categorical

For categorical data, we are typically interested in either counts or proportions. These are represented with bar plots. Bar plots are similar to histograms in that they have the appearance of vertical bars, but it is important to recognize that they have different purposes and are associated with different types of data. Similar to histograms, however, they are added with an additional geom, geom_bar()

ggplot(majors, aes(x = Category)) + 
  geom_bar()

As you can see, the key difference between a bar plot and a histogram is the X axis; in the histogram, each bar or “bin” represents a range of numerical values while each bar in a bar plot is ever only associated with a single category.

It is also possible to use the y aesthetic instead. How do you think the plot generated from the code below differs from the plot above?

ggplot(majors, aes(y = Category)) + 
  geom_bar()

Bivariate Plots

Bivariate plots, as the name suggests, are plots presenting the relationship with two variables. As before, the nature of these variables will dictate the type of plot used. For two variables, we have the following possible arrangements:

Two quantitative
One quantitative, one categorical
Two categorical

Two Quantitative

The relationship between two quantitative variables is described with a scatterplot. We have seen this already in this lab, and it is added with the layer geom_point()

## Again, Per_ = percent
ggplot(majors, aes(Per_Masters, Per_PhD)) +
  geom_point()

Changing the order of the X and Y axis will not change the relationship between the variables, though depending on our goals it often makes sense to choose one arrangement over the other. Typically, we will want our explanatory/independent variable on the X axis, with the response/dependent variablel on the Y axis.

One Quantiative, One Categorical

In the case with a single quantitative variable, we were primarily interested in seeing the distribution of a single variable within our dataset. This idea extends naturally with the addition of a categorical variable, where now we are interested in seeing the distribution of a quantitative variable within each category of a categorical variable. This is done with a boxplot. A boxplot extends the idea of a numerical, five-figure summary into something visual, giving indications of the median, quantiles, and range of a quantitative variable. The typically are not used with a single quantitative variable alone, but there is nothing stopping us from doing so. They are created with the layer geom_boxplot().

## Only using a single quantitative variable
ggplot(majors, aes(x = Per_PhD)) + 
  geom_boxplot()

The large box indicates where the middle fifty percent of the data lies, with the horizontal line inside of the box representing the median. Additionally, the “whiskers” give us a sense of the range. The individual points represent outliers.

More useful would be to see this distribution across a range of variables. As this involves mapping an additional variable to our plot, we indicate this addition by adding an additional aesthetic. It can be either X or Y, depending on how we would like our plot oriented

## Quantiative and categorical variable
ggplot(majors, aes(x = Per_PhD, y = Category)) + 
  geom_boxplot()

Question 3: How does the plot above help illustrate any association between the two variables, Per_PhD and Category? Summarize your findings in a 1-2 sentences.

Two Categorical

To showcase situations with two categorical variables, we will use the tips dataset below, containing data on bill information for various meals at a restaurant. It may be worthwhile to take a minute or two to investigate the variables included in this data.

tips <- read.csv("https://collinn.github.io/data/tips.csv")
head(tips)

##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

The case for presenting two categorical variables in a plot is slightly more involved than were either of the other two bivariate scenarios, largely on account of the different types of relationships we may be interested in displaying. We don’t need to worry about this in too much detail, as it will be covered in greater depth later in the course. For now, we will simply investigate the two primary ways we may be interested in viewing these relationships.

First, let’s look at a univariate bar chart showing the frequency of meals that were served either for Lunch or Dinner with the variable time

ggplot(tips, aes(x = time)) + geom_bar()

Simple enough. If we wish to further break this down by an additional categorical variable, say, by smoking status, we can do so by adding an additional aesthetic fill (as in, “fill this box with color”) and assigning it the variable smoker:

ggplot(tips, aes(time, fill = smoker)) + 
  geom_bar()

This plot shows us two things: we see absolute counts indicating that more people had dinner than lunch, and we see a breakdown of smoking status within each time. However, because the counts for dinner and lunch are so different, it can be difficult for us to assess how the proportions differ within each group. We can change this by specifying that that we want proportions instead of counts, changing our Y axis.

Similar to histograms, we can modify the layer itself by passing in additional arguments. In this case, we can use the position argument with the value “fill” to create a bar chart that shows us proportions

## Note how the Y axis is now from 0 to 1
ggplot(tips, aes(time, fill = smoker)) + 
  geom_bar(position = "fill")

Here, we can now see that a slightly higher proportion of dinner diners were smokers, compared with those at lunch. As the number of unique values within a category increases, we will see that which variable is mapped to which aesthetic plays a large role in what relationships are being communicated.

The bar charts that we have seen until now are called “stacked” and “proportional” bar charts, respectively. Often however, we would prefer to see each of the partitioned categories on its own to compare, and we can do this by modifying the position argument in geom_bar():

ggplot(tips, aes(time, fill = sex)) + 
  geom_bar(position = "dodge")

There are a number of variations on each of these plots that can be employed, some of which are included in the “Examples” section below. We will look at these in more detail a little later in the course.

Multivariate Plots

Finally we come to multivariate plots which visually represent three or more variables in a dataset. While there are a wide variety of ways to do this (including with the use of size or shape), we will primarily limit ourselves to the use of two: color and faceting.

For our multivariate plots, we will be using a subset dataset containing attributes and outcomes for all primarily undergrad institutions in the US with at least 400 full-time students for the year 2019. Our particular subset will consists of college in Iowa and a few neighboring states

To load the data, simply copy and run the following:

college <- read.csv("https://collinn.github.io/data/college2019.csv")
midwest <- subset(college, State %in% c("IA", "MN", "IL", "MO"))

Color

As we have seen previously in this lab, the color aesthetic can be used to include an additional variable in our plots. How the color aesthetic maps will be dependent on the type of variable included. For example, consider a scatterplot demonstrating the relationship between the cost of tuition and the average faculty salary when a categorical variable is included in the color aesthetic.

ggplot(midwest, aes(Cost, Avg_Fac_Salary, color = Type)) + 
  geom_point()

We see that a legend is created to the right of the plot, with distinct colors provided for each of the values within the category. Contrast this with the case in which a quantitative variable is included in the color aesthetic:

ggplot(midwest, aes(Cost, Avg_Fac_Salary, color = ACT_median)) + 
  geom_point()

How was this changed?

A quick note should be made with regards to adding color to a bar chart. The color aesthetic in this case refers to the outline of the bars, rather than the bars themselves. You’ll realize pretty quickly if you have specified this when you didn’t intend to

## Eww
ggplot(midwest, aes(x = Region, color = Type)) + 
  geom_bar()

Instead, we want to use the closely related fill aesthetic which “fills” the bars with color

ggplot(midwest, aes(x = Region, fill = Type)) + 
  geom_bar()

Note that this color/fill distinction applies when using boxplots as well.

There are many things we can do with the colors of our plots, but making adjustments ourselves will rarely be required in this course. For those interested, there are a few examples included in the Examples section at the bottom or a slightly more comprehensive set of examples in an old (read: unedited) lab from STA-230. Again, none of this is required.

Faceting

While color allows us to add an additional variable within a single plot, faceting permits us to separate a plot into pieces based on the values of a category. This is helpful, for example, if we wish to see if various trends or relationships appear to be similar across groups.

Just as we used + to add a layer for our geoms, we use + again to add faceting. This syntax is a little bit different than we have seen generally, but it is of the form facet_wrap(~variable), where variable is the name of the variable in the dataset we wish to facet. For example, to facet the above scatter plot relating cost to faculty salary, we can facet by State to view this for each of the different states

ggplot(midwest, aes(Cost, Avg_Fac_Salary)) +
  geom_point() +
  facet_wrap(~State)

Of course, there is nothing from stopping us from limiting our multivariate plots to three variables; we can do four quite easily by adding a color aesthetic to our faceted plot:

ggplot(midwest, aes(Cost, Avg_Fac_Salary, color = Type)) +
  geom_point() +
  facet_wrap(~State)

As we can see, plots allow us to very quickly and concisely summarize the relationships between all different types of variables in our datasets. Throughout the semester we will continue to focus on creating and presenting effective visual data summaries which will involve both identifying the correct types of plots to create, given the variables we wish to summarize, as well as utilizing faceting and the various aesthetics to best capture the associations and relationships we wish to present.

Problem Sets

Part 1 - Univariate Plots

Question 4: The class variable in the mpg dataset details the “type” of car for each observation (pickup, SUV, etc.,). Create the appropriate graph to demonstrate this distribution of this variable and comment on which class appears to be the most frequent.

Question 5: The cty variable in the mpg dataset details the average fuel economy for the vehicle when driven in the city. Create the appropriate graph to represent this variable. Does this variable appear to be symmetric or skewed?

Part 2 - Bivariate Plots

Question 6: Using the mpg dataset, we are interested in determining if either front wheel drive, rear wheel drive, or 4 wheel drive have better highway fuel economy (captured by the variable drv). Create the appropriate plot to show the distribution of fuel economy for each of these types of vehicles. Which appears to have the best fuel economy?

Question 7: In the majors dataset, the variable Per_Masters describes the percentage of the workforce within a particular field whose highest degree is a master’s degree. Create a graph to assess whether or not there appears to be a relationship between the percentage of individuals who hold a master’s degree and the percent that is unemployed. What do you find?

Question 8: Using the mpg dataset, create a plot to find which vehicle class produces the largest number of vehicles with 6 cylinder (cyl) engines. Which produces the largest proportion of 6 cylinder engines? (Two notes: (1) cyl is a categorical variable in this example and (2) this question is asking for two plots)

Part 3 - Multivariate Plots

Question 9: Below is code that reads in tips data that are used in the Example section of this lab. First identify the four variables from the tips data that are used to make this plot and then write code to reproduce it. (This plot uses faceting which has examples at the end of the lab)

tips <- read.csv("https://collinn.github.io/data/tips.csv")

Question 10: Using the full version of the college dataset (provided again below), create two different plots box plots with a fill aesthetic to illustrate the relationship between region, type, and admission rate. Which of your plots would be more useful in answering the question, “Do private or public schools generally have a higher admission rate?”

college <- read.csv("https://collinn.github.io/data/college2019.csv")

Question 11: The code below will load a data set containing 970 Hollywood films released between 2007 and 2011, then reduce these data to only include variables that could be known prior to a film’s opening weekend. The data are then simplified further to only include the four largest studios (Warner Bros, Fox, Paramount, and Universal) in the three most common genres (action, comedy, drama). You will use the resulting data for this question.

## Read in data
movies <- read.csv("https://collinn.github.io/data/hollywood.csv")

## Simplify data
movies <- subset(movies, LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal") & 
                               Genre %in% c("Action", "Comedy", "Drama"),
                       select = c("Movie", "LeadStudio", "Story", "Genre","Budget",
                                  "TheatersOpenWeek","Year","OpeningWeekend"))

Your goal in this question is to create a graphic that effectively differentiates movies with high revenue on opening weekend from those with low revenue on opening weekend (the variable OpeningWeekend records this revenue in millions of US dollars). In other words, using the plotting methods included in this lab, we want to create a visual summary of the data that answers the question: which trends or attributes seem to predict a film having low or high opening weekend revenue.

Your plot should include at least three variables from the dataset, either by including them in the axes, through color or fill, or through faceting. Finally, using the graph you have created, write 2-3 sentences explaining detailing what you have found.

Examples

These examples will be done with the tips dataset, loaded in below:

## Read in the "Tips" data
tips <- read.csv("https://collinn.github.io/data/tips.csv")

Histograms

We can change the number of bins with bins argument. Feel free to play around with different numbers until you get what you want

ggplot(tips, aes(x = tip)) + geom_histogram(bins = 15)

I think the default colors here are incredibly ugly, so I usually add lines and a slightly less offensive color to histograms in one variable

ggplot(tips, aes(x = tip)) + geom_histogram(bins = 15, color = 'black', fill = 'gray')

Boxplots

ggplot(tips, aes(x = tip)) + geom_boxplot()

We can add groups by passing an extra categorical argument to aes()

ggplot(tips, aes(tip, y = day)) + geom_boxplot()

We can add more groups by including a color or fill aesthetic:

ggplot(tips, aes(tip, y = day, fill = smoker)) + geom_boxplot()

ggplot(tips, aes(tip, y = day, color = sex)) + geom_boxplot()

Scatter plots

A basic scatter plot requires only two quantitative variables

ggplot(tips, aes(total_bill, tip)) + geom_point()

We can add color to this plot to represent a third variable: note that categorical variables will create a collection of “discrete” colors, while adding a quantitative variable will give us color on a scale

# Categorical variable added
ggplot(tips, aes(total_bill, tip, color = time)) + geom_point()

# Quantitative variable added
ggplot(tips, aes(total_bill, tip, color = size)) + geom_point()

Bar plots

The simplest box plot returns the distribution of a single variable

ggplot(tips, aes(time)) + geom_bar()

Stacked

When adding additional categorical variables, the default behavior is to give a stacked bar chart

ggplot(tips, aes(time, fill = sex)) + geom_bar()

Note that for bar charts our second aesthetic will almost always be fill =. This tells ggplot to “fill in” with the color of the group. Try doing color = Sex to compare

Dodge

We can change to a dodged bar chart by passing an argument to geom_bar()

ggplot(tips, aes(time, fill = sex)) + geom_bar(position = "dodge")

If there is a case in which one group has zero observations from another group, the default behavior for dodge looks weird. Here’s an example from the mpg dataset:

ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = "dodge")

Compare the width of the bars for 4 cylinder, 5 cylinder, and 8 cylinder; the default behavior is to fill in the missing groups by making the bars wider. It’s a bit more typing to avoid this behavior, but it can be done

ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = position_dodge(preserve = "single"))

Filled charts (conditional)

Finally, we can get the filled/conditional charts by using position = "fill"

ggplot(mpg, aes(cyl, fill = drv)) + geom_bar(position = "fill")

Faceting

“Facetting” is a process whereby we split our plots up by a group or categorical variable, allowing us to see two (or more) side-by-side plots. This can be a handy way to view the relationship between two variables separately, conditioned on a third

ggplot(tips, aes(total_bill, tip)) + 
  geom_point() + 
  facet_wrap(~sex)

ggplot(tips, aes(total_bill, tip)) +
  geom_point() + 
  facet_wrap(~day)

The syntax is facet_wrap(~var_name) and can be added with + at the end of the plot

Extra Details

Here are some extra examples for things you may be intereseted in modifying in your plot

Labels and title

We can add labels and a title with the labs() function (short for labels). The arguments to change labels are shown below

ggplot(tips, aes(total_bill, tip)) + 
  geom_point() +
  labs(x = "Total Bill", y = "Tip Amount", title = "My neat plot")

Colors

This is (mostly) just for fun. If we use either a color or fill aesthetic in our plots, we can change the colors

# Original
ggplot(tips, aes(total_bill, tip, color = day)) + 
  geom_point()

# Using scale_color_brewer
ggplot(tips, aes(total_bill, tip, color = day)) + 
  geom_point() + 
  scale_color_brewer(palette = "Set2")

The syntax is nearly identical for fill using scale_fill_brewer()

ggplot(mpg, aes(cyl, fill = drv)) + 
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Accent")

A list of additional palettes can be found with ?scale_fill_brewer() or ?scale_fill_color()

Introduction to ggplot2

Introduction

Grammer of Graphics

Aesthetics

Layers

Additional Aesthetics

Types of Plots

Univariate Plots

Quantitative

Categorical

Bivariate Plots

Two Quantitative

One Quantiative, One Categorical

Two Categorical

Multivariate Plots

Color

Faceting

Problem Sets

Part 1 - Univariate Plots

Part 2 - Bivariate Plots

Part 3 - Multivariate Plots

Examples

Histograms

Boxplots

Scatter plots

Bar plots

Stacked

Dodge

Filled charts (conditional)

Faceting

Extra Details

Labels and title

Colors

Introduction to `ggplot2`