Preamble

Package and datasets

This lab will be oriented around the use of the plotly package which can be installed in the usual way

# install.packages("plotly")
library(plotly)
library(ggplot2)
library(dplyr)
library(stringr)

In addition to some of the datasets from the ggplot package, we will also be using college scorecard data with which we have worked previously:

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

Plotly itself is a large, open-source graphing library for creating interactive graphics available in a large variety of different programming languages. While this lab will focus on the use of the Plotly package in R, the gist of this will be relevant to the use of plotly in languages such as Python, Matlab, Julia, etc.,.

\(~\)

Converting existing plots

As mentioned above, the plotly package specializes in creating interactive plots and graphics. While plots can be made directly in plotly (indeed, demonstrating this will be the purpose of this lab), it is worth noting that we can do a quick and dirty conversion of an existing ggplot into a plotly object with ggplotly

# Standard ggplot (which we assign to variable p)
p <- ggplot(colleges, aes(Cost, Salary10yr_median, color = Private)) +
  geom_point()

ggplotly(p)

Notice first that you can now hover your mouse over any of the points to reveal information (called a tooltip) related to that observation. You can also click and drag your mouse around to zoom in or out, with a limited help bar appearing in the top right corner of the plot.

It’s worth observing that the tooltip hover simply lists of all of the aesthetics relevant to a particular point. Here, for example, we add a useless fill aesthetic (doesn’t apply to points) and see that we have an additional line in the tooltip giving us redundant information

# Add fill aes
p <- ggplot(colleges, aes(Cost, Salary10yr_median, fill = Private, color = Private)) +
  geom_point()

# Now we have an additional line in the tooltip
ggplotly(p)

We can modify this with the tooltip argument to ggplotly

# Just include x/y aesthetic information
ggplotly(p, tooltip = c("x", "y"))

\(~\)

Creating plots with plot_ly

Creating a plotly plot is not all that different than creating a ggplot. We have a primary plotting function plot_ly() (akin to ggplot()), along with the ability to add_trace() (for example, geom_point()). One large difference is that with plotly, we can create an entire plot from the plot_ly() function without having to add layers after

plot_ly(data = colleges, type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private)

Here, note that

  • type = "scatter" tells us that we want a scatter plot
  • mode = "markers" tells us that we want dots (rather than lines, text, or other symbols)

Also note that, unlike ggplot, here we must specify the variables from our data frame with ~. For example, x = ~Cost tells plotly to look for the “Cost” variable in colleges. Without the ~, it would instead look for a variable in our environment called “Cost” which is not what we want.

To create a plotly plot using add_trace, we have syntax similar to that of ggplot(), except here we can use the %>% operator instead of +

plot_ly(colleges) %>% 
  add_trace(type = "scatter", mode = "markers", 
            x = ~Cost, y = ~Salary10yr_median, color = ~Private)

\(~\)

Comparison of plotly and ggplot

The decision to use plotly or ggplot depends upon the end goal of your visualization. Here are some factors to consider:

ggplot plotly
Easier to construct complex graphics Interactive
Easier customization (colors, etc.) Allows for 3-D graphics
More legible syntax and grammar Allows for animations
Annotations and exporting Can convert ggplot graphics

Because plotly graphics are interactive, they tend to work nicely with R Shiny, which we will learn more about later in the semester.

\(~\)

Lab

Throughout this lab (and in general), it will be helpful to reference this plotly reference page to either explore different options or attempt to find parallels between what you are accustomed to doing in ggplot and what you would like to do with plotly.

Layering

Similar to ggplot, we can add multiple traces in plot_ly. Here, we revist the mpg dataset from ggplot2, first plotting displacement and highway fuel economy, then adding a second trace with add_text which superimposes an additional layer

data(mpg)
# I know, this plot is ugly
plot_ly(mpg) %>% 
  add_trace(type = "scatter", mode = "markers", 
            x = ~displ, y = ~hwy, color=~factor(year)) %>% 
  add_text(x = ~displ, y = ~hwy, text = ~drv)

Also like ggplot, as we will see in the next lab, we can add different datasets to each trace, allowing the x and y variables to be specified somewhere else completely. This is particularly useful when we think about the relationship between data and aesthetics. For example, in this next plot we create a bar chart of college cost by state for four states. Then, using a separate dataset that contains summary information, we add a trace that gives us additional information.

# First, create dataset with mean cost by state, SEPARATE from main dataset
college_mean <- colleges %>% group_by(State) %>% 
  summarize(meanCost = mean(Cost, na.rm = TRUE)) %>% 
  filter(State %in% c("IA", "MN", "IL", "WI"))

# Pipe this into plotly, then create additional trace with `data =` 
colleges %>% 
   filter(State %in% c("IA", "MN", "IL", "WI")) %>%
plot_ly() %>%
   add_trace(type = "box", x = ~Cost, y = ~State) %>% 
  add_trace(data = college_mean, x = ~meanCost, y = ~State,
            type = "scatter", mode = "markers")

Question 1 Using add_trace, create a violin plot that separately displays the distributions of the variable “Enrollment” for private and public colleges in the colleges dataset.

Custom Labels

Perhaps the most appealing feature of plotly is the ability to see a label when you hover over a data point or area of interest.

Although plotly doesn’t have the same explicit specification of aesthetics as ggplot, there are a number of arguments that essentially function as such. Similar to what we showed in the Preamble with the ggplot aesthetics, we can add additional meta-information to be included in the tooltip with the text argument:

plot_ly(data = colleges) %>%
  add_trace(type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)

We can specify which of these aesthetics we wish to retain in the tooltip using the hoverinfo argument

plot_ly(data = colleges, hoverinfo = "text") %>%
  add_trace(type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)

In plotly, label text uses hypertext markup language (HTML), so HTML commands can be used to organize and modify the appearance of labels. Here, the <br> command stands for “break”, creating a newline between the name, city, and state:

library(stringr)
plot_ly(data = colleges, hoverinfo = "text") %>% 
    add_trace(type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private,
        text = ~str_c(Name, "<br>", City, ", ", State))

In this example, str_c() is used to combine fixed character strings with variable values, and the string “<br>” is the HTML command used to begin a new line. More generally, observe that functions of the columns can be used in the specification of aesthetics (in this case, str_c())

Some other useful HTML commands include:

  • <b> my text </b> - Bolds the text in between the tags
  • <i> my text </i> - Italicizes the text in between the tags
  • x<sub>i<sub> - Adds a subscript, in this case we get \(x_i\)

Question 2: Using the colleges data, create a scatter plot of the variables “FourYearComp_Males” and “FourYearComp_Females” that includes a custom label which shows each college’s name in bold text, and also shows on a new line its value for “PercentFemale” after the character string “Percentage Female:”. Use round() to limit the number of decimal place that appear in the tooltip

\(~\)

3D Graphics

Another large appeal of the plotly package is its ability to render high quality 3D graphics

plot_ly(data = colleges, type = "scatter3d", mode = "markers",
        x = ~Enrollment, y = ~Cost, z = ~ACT_median)

As plotly graphs can be rotated with the mouse, they tend to be more effective visualizations than some of the 3D scatter plots generated by other packages.

As a quick aside – the size of these markers can be adjusted with functions similar to the element functions used in ggplot

plot_ly(data = colleges, type = "scatter3d", mode = "markers",
        x = ~Enrollment, y = ~Cost, z = ~ACT_median, 
        marker = list(size = 2)) # here, marker takes a list of arguments

Another 3D graph that plotly can render is a surface, which is useful for displaying model results along with the observed data. For example, here we consider a standard linear regression model that predicts median 10 year salary of a college’s graduates based upon the college’s cost and admission rates:

## Fit the model
model <- lm(Salary10yr_median ~ Cost + Adm_Rate, data = colleges)

We’ll learn more about modeling later in the course. For now, we simply recognize that lm is the R function used to create linear regression models, and the Salary10yr_median ~ Cost + Adm_Rate is a formula syntax of the form y ~ x where y is the outcome and x is the collection of covariates. Here, the outcome is salary and the covariates are cost and admission rates.

Creating a surface from a model involves specifying a grid of covariate points and then asking what values the model would predict based on each point of that grid. Since we want our grid to consist of value for cost and admission rate, we begin by creating a sequence going from the minimum and maximum values of both covariates (the range) and then using the expand.grid function in R to create a large matrix consisting of each pair of points.

# Create two sequences across range of covariates
x1 <- seq(min(colleges$Cost, na.rm = TRUE), 
          max(colleges$Cost, na.rm = TRUE), 
          length.out = 100)
x2 <- seq(min(colleges$Adm_Rate, na.rm = TRUE), 
          max(colleges$Adm_Rate, na.rm = TRUE), 
          length.out = 100)

# Expand these to create a grid
grid <- expand.grid(Cost = x1, Adm_Rate = x2)

Then, we create predictions of the outcome values for each of those points using the predict() function along with our model results

# Create prediction
z <- predict(model, newdata = grid)
# Reformat these predictions into matrix
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE)

# Create our plot with additional add_surface
plot_ly() %>%
  add_trace(data = colleges, type = "scatter3d", x = ~Cost, y = ~Adm_Rate, 
            z = ~Salary10yr_median, color = I("black"), 
            marker = list(size = 3)) %>% 
  # Don't use ~ here since we want these variables from our environment
  add_surface(x = x1, y = x2, z = m, colorscale = "Blues")

To summarize:

  1. First, we created a linear model called model, giving us estimates based on observed values
  2. Second, we created variables x1 and x2, each of which represented the range of the observed data for our covariates, cost and admission rate
  3. We used the predict() function, based on our model and evaluated at a grid of points for x1 and x2 to create predictions at each point of our grid
  4. We added the layer add_surface() which did not use tilde (~) as it took variables from our environment. The first two, x and y asked for the range values for the surface. Then z asked for a matrix of values, an entry for each combination of points x and y. See more with ?add_surface

While it seems that there is a lot going on in our code above (and admittedly, there is), the general structure can be easily adapted to other models. For example, here we create an entirely new type of model called a generalized additive model (GAM) with the same outcome and predictors, now only having to recompute our prediction z

library(mgcv)
model <- gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)

# Recreate predictions
z <- predict(model, newdata = grid)
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE)

# Replot
plot_ly() %>%
  add_trace(data = colleges, type = "scatter3d", x = ~Cost, y = ~Adm_Rate, 
            z = ~Salary10yr_median, color = I("black"), 
            marker = list(size = 3)) %>% 
  add_surface(x = x1, y = x2, z = m, colorscale = "Reds")

Question 3 Using this section’s code as a template, display the linear regression surface for the model Debt_median ~ Net_Tuition + ACT_median on a 3-D scatter plot that uses “Net_Tuition” as the x-variable and “ACT_median” as the y-variable. You should use the lm() function to fit this model prior to Step 2.

\(~\)

Customizing Appearance

We can modify thematic elements of a plotly plot with layout, not unlike using the theme() function with ggplot. We won’t go into more detail here, but the internet is just a few clicks away.

plot_ly(data = colleges, type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private) %>% 
  layout(yaxis = list(title = "10 Year Median Salary"), 
         xaxis = list(title = "Total Cost", color = 'red'))

Animation

Most plotly graphics can be turned into animations with the frame argument, which indicates a series of data snapshots the animation will progress through. While a natural candidate for a variable to snapshot through is time, it doesn’t have to be. Here, for example, we consider again the relationship between engine displacement and highway mileage, colored by drivetrain and animated through the number of cylinders

plot_ly(mpg, x = ~displ, y = ~hwy, type = "scatter", mode = "markers", 
        color = ~factor(drv), frame = ~factor(cyl))

We can control different aspects of our animation with animation_opts. The most common option to change is the frame argument, indicating the amount of time in-between frames (default is 500). The redraw argument considers whether or not the plot should be completely re-rendered between frames (impacting performance). The easing argument is a lot of fun, it changes how the markers move between transitions. Here, I have adjusted the frame to be cty instead of cyl to give us more frames to move through.

plot_ly(mpg, x = ~displ, y = ~hwy, type = "scatter", mode = "markers", 
        color = ~factor(drv), frame = ~cty) %>% 
  animation_opts(frame = 750, easing = "bounce", redraw = FALSE)
  # animation_opts(frame = 750, easing = "elastic", redraw = FALSE)
  # animation_opts(frame = 750, easing = "cubic-in", redraw = FALSE)
  # animation_opts(frame = 750, easing = "exp-in-out", redraw = FALSE)
  # animation_opts(frame = 750, easing = "circle-in-out", redraw = FALSE)

You can find more easing options here between lines 68 and 103.

Question 4 The code below reads in data collected by Mother Jones that aims to document mass shootings in the United States (this dataset contains entries until 2019). For this question, create an animated plot that displays the cumulative total number of fatalities and injuries that have occured in these shootings over time. An example of what this animation should look like is given below; yours should be similar, but it doesn’t need to be identical.

shootings <- read.csv('https://remiller1450.github.io/data/MassShootings.csv')

Hint: Two hints. First, plan out exactly what your data needs to look like to create this animation. What should the columns be? Work backwards from there to determine how to be manipulate your data. Second, ?cumsum