## Copy this into your r setup chunk
knitr::opts_chunk$set(echo = TRUE, 
                      fig.align = 'center', 
                      fig.width = 4, 
                      fig.height = 4, 
                      message = FALSE, 
                      warning = FALSE)
library(ggplot2)
library(dplyr)

## Less uggo plots
theme_set(theme_bw())

Introduction

This lab serves primarily as a set of practice problems to hone our dplyr and ggplot skills

dplyr Practice

Question 1

This question uses the penguins dataset.

penguins <- read.csv("https://collinn.github.io/data/penguins.csv")
  1. Using the mutate function, create a new variable in this dataset called size that takes the value "Small" if the penguin’s body mass is less than the median body mass and "Large" if it is greater than or equal to the median. Recall the function ifelse() which may be helpful here (?ifelse)

  2. Create a box plot illustrating the relationship between bill depth, size, and sex. Do small or large penguins tend to have deeper bills? Which groups appears to have the greatest amount of variability?

  3. Often when plotting categorical variables, we wish to change the order in which they appear on a graph. By default, these occur in alphabetical order. In order to change this, we will use the function factor() along with the argument level; factor() is a function that ensures our variable is categorical, while the level argument allows us to specify the new order. For example, the code below re-levels the size variable so that "Small" comes before "Large":

penguins <- mutate(penguins, size2 = factor(size, levels = c("Small", "Large")))

Note: I have chosen to call this new variable size2 so that I can use this and the original size variable unchanged. In general in our work, however, we will typically choose to overwrite the original variable with the new ordering.

Recreate the box plot from (2.) with size2 and compare it to the original. Has it changed as you expected? (Note: this method of “re-leveling” will work for all categorical variables)

Question 2

This question will use the titanic dataset.

## Copy this code to recreate the titanic dataset
data(Titanic)
titanic <- as.data.frame(Titanic)
titanic <- titanic[rep(1:nrow(titanic), times = titanic$Freq), ]
titanic$Freq <- NULL

Recreate the plot below by re-factoring the levels as we did in Question 1. Pay special note to the order of values presented for Sex, Class, and Survived.

You can modify the the colors by adding scale_fill_brewer(palette = "Greens") to your ggplot. See ?scale_fill_brewer() to check out other palettes, if you would like

Question 3

This question uses the college dataset.

## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")

Here, we are interested in answering the question: “which region has the largest outlier for median ACT relative to the region

  1. First, standardize the ACT_median variable by region and then create the appropriate box plot showing the relationship between region and standardized median ACT score (Hint: how do you find standardized values of a variable?). Which region appears to have the greatest outlier? Consider using mutate() and group_by() in this problem.

  2. Next, create a summary of the college dataset that shows the maximum and minimum median ACT values for each region. Which region has the largest median ACT? Which has the smallest? Consider using group_by() and summarize()

  3. Compare your answers from (2) with what you found in (1). Does having the largest/smallest ACT scores necessarily result in having outliers? Explain what is happening.

Question 4

This question uses the tips dataset. It will also require using dplyr functions, though this time, no additional hints as to which functions are provided.

tips <- read.csv("https://collinn.github.io/data/tips.csv")
  1. Begin by using the appropriate dplyr functions to filter the dataset to only include bills that occurred on either Saturday or Sunday and in which the size of the party was greater than 1.

  2. Create a new variable, tipPercent that finds what percentage of the total bill was offered as a tip

  3. Construct a summary that includes the average tipping percentage of an individual by sex and smoker status. Along with this summary, also include the number of individuals making up each group (n())

  4. Create the appropriate plot to illustrate the relationship between total bill (x-axis) and tip percent (y-axis). What type of relationship do you see? Would Pearson’s correlation or Spearman’s correlation be more appropriate to describe this?

Extending ggplot

This section will cover some common extensions to our basic ggplots. Briefly, this will include:

  • Creating side-by-side plots
  • Modifying the plot without aesthetics, as well as introducing a handful of new modifications
  • Introducing geom_jitter()
  • Combining data summaries into additional layers.

Each section will serve to introduce a new idea or technique, each relatively simple in isolation. Our ultimate goal, however, will be the integration of data summaries into our plots.

Side-by-side plots

There are often times where we will want to create an arrangement of plots that can be presented adjacent to one another. A convenient method for doing so utilizes the gridExtra package, which can be installed by copying and pasting the code below into your console. Once the package is installed, you only ever need to load the package with the library() function, just as we do for ggplot2 and dplyr.

## Copy this and run in the console once to install package
# then delete -- DO NOT PUT IN YOUR RMARKDOWN FILE
if (!require(gridExtra)) {
  install.packages("gridExtra")
}

## Once installed, load package like you do with others
library(gridExtra)

Just like with dplyr and ggplot2, you will need to put this at the top of your RMD document every time you want to use it

Typically when we create a ggplot in an R code chunk it will immediately print out the associated graphic:

ggplot(mpg, aes(displ, cty, color = cyl)) + 
  geom_point()

However, just as with any other object in R (linear models, data.frames, etc.,), we can assign our ggplot to a variable name; doing so will stop the plot from printing out immediately

## Note that nothing prints after this code chunk
p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) + 
  geom_point()

If we do want to display the plot, it is as simple as writing out the variable name:

p1

In order to arrange our plots side-by-side, we need to start by saving each of them to a new variable. We can then use the grid.arrange() function included in the gridExtra package to specify how we would like them aligned. Here, we revisit a problem from a previous lab where we compare how the color aesthetic works when a quantitative variable is coded as a factor. Placing them side-by-side provides for easy comparison:

p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) + 
  geom_point() + 
  theme(legend.position = "bottom") # move legend to bottom

p2 <- ggplot(mpg, aes(displ, cty, color = factor(cyl))) + 
  geom_point() + 
  theme(legend.position = "bottom")

## Print them out in a single row
grid.arrange(p1, p2, nrow = 1)

You may find that your plots are a bit scrunched. You can remedy this by using fig.width = 8 (or any other number) as a chunk option

p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) + 
  geom_point() + 
  theme(legend.position = "bottom") # move legend to bottom

p2 <- ggplot(mpg, aes(displ, cty, color = factor(cyl))) + 
  geom_point() + 
  theme(legend.position = "bottom")

## This should look a bit cleaner
grid.arrange(p1, p2, nrow = 1)

Question 5

Using grid.arrange(), reproduce the two plots you created in Q1 with size and size2 to show side-by-side how using factor() and levels can allow you to rearrange your plots.

Modification without aesthetics

The logic of ggplot is such that we can map variables included into our datasets into visual aspects of the plots with the use of the aesthetics function aes(). However, it is also possible to modify plots directly by specifying values for an aesthetic outside of aes(). Consider the collection of options below:

ggplot(mpg, aes(displ, cty)) + 
  geom_point(color = 'blue', size = 6, shape = 17, alpha = 0.25)

We have

  • color, which we assign directly to be blue
  • size, which will change the size of the points represented
  • shape, specifying shape, a collection of which can be found here
  • alpha, taking values between 0 and 1, which determine the amount of transparency in ithe points

There are two additional things that are important to note here:

  1. We have defined this attributes inside of the geom_point() function, meaning that they will only apply to those points being created in that layer. If we were to add them to ggplot instead, they would apply to every single layer included in the plot (more on this soon)
  2. Because they are not inside of aes(), they are interpreted directly, i.e., ‘blue’ maps to the color blue

This last point is crucial. If we were to add these modifications inside of aes(), it would look (unsuccessfully) for these variables in our data frame. When they are not found, they will try to create them as variables rather than as values. For example, see what happens to color when we put it in aes()

## Adds a legend for "blue" and makes it reddish because there
## is no variable called "blue" in the dataset
ggplot(mpg, aes(displ, cty)) + 
  geom_point(aes(color = 'blue'))

Keep this in mind as we move into the next section.

Jittering and multiple layers

One of my favorite plots is created with the geom_jitter() function which, among its many uses, can be used as an alternative to the box plot for visualizing the relationship between a categorical and quantitative variable. By default, trying to create points aligning with a categorical variable will cause many of the points to overlaps as they are all aligned within their own category. The jitter function operates by applying a small amount of random variation to the x and y values, producing a nice effect:

## Using points
p1 <- ggplot(mpg, aes(drv, cty)) +
  geom_point()

## Using jitter
p2 <- ggplot(mpg, aes(drv, cty)) +
  geom_jitter()

grid.arrange(p1, p2, nrow = 1)

We can modify the amount of variation with height and width arguments to the jittering function. There is no right or wrong answer, so feel free to play around with it:

p1 <- ggplot(mpg, aes(drv, cty)) +
  geom_point()

p2 <- ggplot(mpg, aes(drv, cty)) +
  geom_jitter()

p3 <- ggplot(mpg, aes(drv, cty)) +
  geom_jitter(width = 0.15, color = 'dodgerblue')

grid.arrange(p1, p2, p3, nrow = 1)

The plots created by geom_jitter() offer a nice alternative to those created by geom_boxplot() as, instead of summarizing the data into median, Q1, Q3, and outliers, it allows us to see directly the distribution and density of our observations. Of course, there is nothing to stop us from combining these ideas to create more interesting plots: because each geom represents a layer, we can readily add multiple layers to a plot. This allows us to nicely visualize how the geom_jitter and geom_boxplot functions compare. Note, however, that the layers are placed in order, with subsequent ones being created “on top of” the former. The first has geom_boxplot() second, so it is placed on top of the jitter points.

## Remove outliers to make black dots go away
p1 <- ggplot(mpg, aes(drv, cty)) +
  geom_jitter(width = 0.15, alpha = 0.5, color = "dodgerblue") +
  geom_boxplot(fill = "pink", outliers = FALSE)

p2 <- ggplot(mpg, aes(drv, cty)) +
  geom_boxplot(fill = "pink", outliers = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.5, color = "dodgerblue")


grid.arrange(p1, p2, nrow = 1)

Notice how in this case we specified color, fill, and alpha all outside of the aes() function. As such, ggplot was able to interpret these values directly.

Adding data summaries

We will conclude our discussion of ggplot in this lab with the introduction of data summaries in our plots. To do so, let’s first consider an aspect of ggplot that we have, until now, taken for granted. Within our ggplot function, we specify both a data argument and an aes() argument which creates a map from the variables in our data to the aesthetics in the plot. Each time we have added a layer (geom_jitter(), geom_boxplot()), we have inherited both the data and the values of the aesthetics. However, we could also specify data and aes() in each layer instead. The code below creates two of the exact same plots:

(NOTE: When doing this, we have to specify data = mpg in the layer or else we will get an error)

## Values are inherited from ggplot
p1 <- ggplot(data = mpg, aes(drv, cty)) +
  geom_boxplot()

## Values *not* inherited and are defined in layer
p2 <- ggplot() +
  geom_boxplot(data = mpg, aes(drv, cty))

grid.arrange(p1, p2, nrow = 1)

This will only make a difference if we include an additional layer. As we did in the previous section, let’s try to add points associated with geom_jitter()

## Values are inherited from ggplot
p1 <- ggplot(data = mpg, aes(drv, cty)) +
  geom_boxplot() + 
  geom_jitter(width = 0.15, alpha = 0.5)

## Values *not* inherited and are defined in layer
p2 <- ggplot() +
  geom_boxplot(data = mpg, aes(drv, cty)) +
  geom_jitter(width = 0.15, alpha = 0.5)

grid.arrange(p1, p2, nrow = 1)

Now, our second plot has nothing: because there is no dataset or aesthetics to inherit from ggplot(), it does not produce any points. We can rectify this by also specifying the data and aesthetics in that layer as well, giving us the desired plot:

## Add data and aes to geom_jitter
ggplot() +
  geom_boxplot(data = mpg, aes(drv, cty)) +
  geom_jitter(data = mpg, aes(drv, cty), width = 0.15, alpha = 0.5)

Why would we care about this when it seems like so much extra work? Because it allows us to create layers that contain completely different datasets.

For example, consider the boxplots above. We know from earlier in the semester that distributions that are skewed have mean values that differ from the median. We can use this technique to illustrate that visually.

If we wanted to plot the mean values for cty for each value of drv, we would start by creating a data summary with dplyr:

drv_summary <- mpg %>% group_by(drv) %>% 
  summarize(meanCty = mean(cty))

print(drv_summary)
## # A tibble: 3 × 2
##   drv   meanCty
##   <chr>   <dbl>
## 1 4        14.3
## 2 f        20.0
## 3 r        14.1

The new dataset drv_summary now has a variable drv, indicating drive train, and the variable meanCty, giving the average city miles per gallon. If we wanted to include just this in a plot, we could do so as follows:

## shape = 7 will give me squares instead of circles
ggplot(drv_summary, aes(meanCty, drv)) + 
  geom_point(color = 'red', size = 4, shape = 15)

Just as easily, we could include this with the box plots we created above. This time, we add an additional layer and specify data and aes() in the new layer:

ggplot() +
  geom_boxplot(data = mpg, aes(cty, drv)) +
  geom_point(data = drv_summary, aes(meanCty, drv), 
             color = 'red', size = 4, shape = 15)

Now, it becomes immediately clear that for 4wd and fwd, the distribution is skewed right (the mean is greater than the median), while for rwd, the distribution is skewed left (the mean is less than the median).

The ability to add data summaries to our plots gives us a way to present standard visualizations to a reader that allow us to highlight particular attributes that are not typically represented by the disaggregated data alone.

Extra note on colored points

To create a colored point with a black outline, we need to modify the shape so that we can use both a color (outline) and fill (inside) aesthetic. Below is an example

df <- data.frame(x = 1, y = 1)

ggplot(df, aes(x, y)) + 
  geom_point(shape = 21, color = 'black', fill = 'coral', size = 8)

This will be useful for the next questions.

Question 6

Recreate the following plot that shows the mean and median net tuition for each of the regions in the college dataset

college <- read.csv("https://collinn.github.io/data/college2019.csv")

Note that:

  • Jitter function has height = 0.15 and alpha = 0.25
  • The colors are “aquaremarine” for average and “magenta” for median
  • The size of the colored points is size = 3.5

Question 7

Recreate this from the ecological correlation lecture (we skipped this, but that doesn’t change our ability to produce this graph), where the individual points are the schools and the colored dots get their x and y values from the regions’ average admission rate and median debt, respectively

Note that:

  • alpha = 0.25 for the black dots
  • size = 4 for the colored dots
  • You can add theme(legend.position = "bottom") to move the legend