plotly
This lab will be oriented around the use of the plotly
package which can be installed in the usual way
# install.packages("plotly")
library(plotly)
library(ggplot2)
library(dplyr)
library(stringr)
In addition to some of the datasets from the ggplot package, we will also be using college scorecard data with which we have worked previously:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
Plotly itself is a large, open-source graphing library for creating interactive graphics available in a large variety of different programming languages. While this lab will focus on the use of the Plotly package in R, the gist of this will be relevant to the use of plotly in languages such as Python, Matlab, Julia, etc.,.
\(~\)
As mentioned above, the plotly
package specializes in
creating interactive plots and graphics. While plots can be made
directly in plotly
(indeed, demonstrating this will be the
purpose of this lab), it is worth noting that we can do a quick and
dirty conversion of an existing ggplot into a plotly object with
ggplotly
# Standard ggplot (which we assign to variable p)
p <- ggplot(colleges, aes(Cost, Salary10yr_median, color = Private)) +
geom_point()
ggplotly(p)
Notice first that you can now hover your mouse over any of the points to reveal information (called a tooltip) related to that observation. You can also click and drag your mouse around to zoom in or out, with a limited help bar appearing in the top right corner of the plot.
It’s worth observing that the tooltip hover simply lists of all of
the aesthetics relevant to a particular point. Here, for example, we add
a useless fill
aesthetic (doesn’t apply to points) and see
that we have an additional line in the tooltip giving us redundant
information
# Add fill aes
p <- ggplot(colleges, aes(Cost, Salary10yr_median, fill = Private, color = Private)) +
geom_point()
# Now we have an additional line in the tooltip
ggplotly(p)
We can modify this with the tooltip
argument to
ggplotly
# Just include x/y aesthetic information
ggplotly(p, tooltip = c("x", "y"))
\(~\)
plot_ly
Creating a plotly plot is not all that different than creating a
ggplot. We have a primary plotting function plot_ly()
(akin
to ggplot()
), along with the ability to
add_trace()
(for example, geom_point()
). One
large difference is that with plotly, we can create an entire plot from
the plot_ly()
function without having to add layers
after
plot_ly(data = colleges, type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private)
Here, note that
type = "scatter"
tells us that we want a scatter
plotmode = "markers"
tells us that we want dots (rather
than lines, text, or other symbols)Also note that, unlike ggplot
, here we must specify the
variables from our data frame with ~
. For example,
x = ~Cost
tells plotly to look for the “Cost” variable in
colleges
. Without the ~
, it would instead look
for a variable in our environment called “Cost” which is not what we
want.
To create a plotly plot using add_trace
, we have syntax
similar to that of ggplot()
, except here we can use the
%>%
operator instead of +
plot_ly(colleges) %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private)
\(~\)
plotly
and ggplot
The decision to use plotly
or ggplot
depends upon the end goal of your visualization. Here are some factors
to consider:
ggplot |
plotly |
---|---|
Easier to construct complex graphics | Interactive |
Easier customization (colors, etc.) | Allows for 3-D graphics |
More legible syntax and grammar | Allows for animations |
Annotations and exporting | Can convert ggplot graphics |
Because plotly
graphics are interactive, they tend to
work nicely with R
Shiny, which we will learn more about later in the semester.
\(~\)
Throughout this lab (and in general), it will be helpful to reference this plotly reference page to either explore different options or attempt to find parallels between what you are accustomed to doing in ggplot and what you would like to do with plotly.
Similar to ggplot
, we can add multiple traces in
plot_ly
. Here, we revist the mpg
dataset from
ggplot2
, first plotting displacement and highway fuel
economy, then adding a second trace with add_text
which
superimposes an additional layer
data(mpg)
# I know, this plot is ugly
plot_ly(mpg) %>%
add_trace(type = "scatter", mode = "markers",
x = ~displ, y = ~hwy, color=~factor(year)) %>%
add_text(x = ~displ, y = ~hwy, text = ~drv)
Also like ggplot
, as we will see in the next
lab, we can add different datasets to each trace, allowing the
x
and y
variables to be specified somewhere
else completely. This is particularly useful when we think about the
relationship between data and aesthetics. For example, in this next plot
we create a bar chart of college cost by state for four states. Then,
using a separate dataset that contains summary information, we add a
trace that gives us additional information.
# First, create dataset with mean cost by state, SEPARATE from main dataset
college_mean <- colleges %>% group_by(State) %>%
summarize(meanCost = mean(Cost, na.rm = TRUE)) %>%
filter(State %in% c("IA", "MN", "IL", "WI"))
# Pipe this into plotly, then create additional trace with `data =`
colleges %>%
filter(State %in% c("IA", "MN", "IL", "WI")) %>%
plot_ly() %>%
add_trace(type = "box", x = ~Cost, y = ~State) %>%
add_trace(data = college_mean, x = ~meanCost, y = ~State,
type = "scatter", mode = "markers")
Question 1 Using add_trace
, create a
violin plot that separately displays the distributions of the
variable “Enrollment” for private and public colleges in the
colleges
dataset.
Perhaps the most appealing feature of plotly
is the
ability to see a label when you hover over a data point or area of
interest.
Although plotly
doesn’t have the same explicit
specification of aesthetics as ggplot
, there are a number
of arguments that essentially function as such. Similar to what we
showed in the Preamble with the ggplot
aesthetics, we can
add additional meta-information to be included in the tooltip with the
text
argument:
plot_ly(data = colleges) %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)
We can specify which of these aesthetics we wish to retain in the
tooltip using the hoverinfo
argument
plot_ly(data = colleges, hoverinfo = "text") %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)
In plotly
, label text uses hypertext markup language
(HTML), so HTML commands can be used to organize and modify the
appearance of labels. Here, the <br>
command stands
for “break”, creating a newline between the name, city, and state:
library(stringr)
plot_ly(data = colleges, hoverinfo = "text") %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private,
text = ~str_c(Name, "<br>", City, ", ", State))
In this example, str_c()
is used to combine fixed
character strings with variable values, and the string
“<br>” is the HTML command used to begin a new line.
More generally, observe that functions of the columns can be used in the
specification of aesthetics (in this case, str_c()
)
Some other useful HTML commands include:
Question 2: Using the colleges
data,
create a scatter plot of the variables “FourYearComp_Males” and
“FourYearComp_Females” that includes a custom label which shows each
college’s name in bold text, and also shows on a new line its value for
“PercentFemale” after the character string “Percentage Female:”. Use
round()
to limit the number of decimal place that appear in
the tooltip
\(~\)
Another large appeal of the plotly
package is its
ability to render high quality 3D graphics
plot_ly(data = colleges, type = "scatter3d", mode = "markers",
x = ~Enrollment, y = ~Cost, z = ~ACT_median)
As plotly
graphs can be rotated with the mouse, they
tend to be more effective visualizations than some of the 3D scatter
plots generated by other packages.
As a quick aside – the size of these markers can be adjusted with
functions similar to the element functions used in
ggplot
plot_ly(data = colleges, type = "scatter3d", mode = "markers",
x = ~Enrollment, y = ~Cost, z = ~ACT_median,
marker = list(size = 2)) # here, marker takes a list of arguments
Another 3D graph that plotly
can render is a
surface, which is useful for displaying model results along
with the observed data. For example, here we consider a standard
linear regression model that predicts median 10 year salary of
a college’s graduates based upon the college’s cost and admission
rates:
## Fit the model
model <- lm(Salary10yr_median ~ Cost + Adm_Rate, data = colleges)
We’ll learn more about modeling later in the course. For now, we
simply recognize that lm
is the R function used to create
linear regression models, and the
Salary10yr_median ~ Cost + Adm_Rate
is a formula syntax of
the form y ~ x
where y
is the outcome and
x
is the collection of covariates. Here, the outcome is
salary and the covariates are cost and admission rates.
Creating a surface from a model involves specifying a grid
of covariate points and then asking what values the model would predict
based on each point of that grid. Since we want our grid to consist of
value for cost and admission rate, we begin by creating a sequence going
from the minimum and maximum values of both covariates (the range) and
then using the expand.grid
function in R to create a large
matrix consisting of each pair of points.
# Create two sequences across range of covariates
x1 <- seq(min(colleges$Cost, na.rm = TRUE),
max(colleges$Cost, na.rm = TRUE),
length.out = 100)
x2 <- seq(min(colleges$Adm_Rate, na.rm = TRUE),
max(colleges$Adm_Rate, na.rm = TRUE),
length.out = 100)
# Expand these to create a grid
grid <- expand.grid(Cost = x1, Adm_Rate = x2)
Then, we create predictions of the outcome values for each
of those points using the predict()
function along with our
model results
# Create prediction
z <- predict(model, newdata = grid)
# Reformat these predictions into matrix
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE)
# Create our plot with additional add_surface
plot_ly() %>%
add_trace(data = colleges, type = "scatter3d", x = ~Cost, y = ~Adm_Rate,
z = ~Salary10yr_median, color = I("black"),
marker = list(size = 3)) %>%
# Don't use ~ here since we want these variables from our environment
add_surface(x = x1, y = x2, z = m, colorscale = "Blues")
To summarize:
model
, giving
us estimates based on observed valuesx1
and x2
,
each of which represented the range of the observed data for our
covariates, cost and admission ratepredict()
function, based on our model and
evaluated at a grid of points for x1
and x2
to
create predictions at each point of our gridadd_surface()
which did not use
tilde (~
) as it took variables from our environment.
The first two, x
and y
asked for the range
values for the surface. Then z
asked for a matrix of
values, an entry for each combination of points x
and
y
. See more with ?add_surface
While it seems that there is a lot going on in our code above (and
admittedly, there is), the general structure can be easily adapted to
other models. For example, here we create an entirely new type of model
called a generalized additive model (GAM) with the same outcome
and predictors, now only having to recompute our prediction
z
library(mgcv)
model <- gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)
# Recreate predictions
z <- predict(model, newdata = grid)
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE)
# Replot
plot_ly() %>%
add_trace(data = colleges, type = "scatter3d", x = ~Cost, y = ~Adm_Rate,
z = ~Salary10yr_median, color = I("black"),
marker = list(size = 3)) %>%
add_surface(x = x1, y = x2, z = m, colorscale = "Reds")
Question 3 Using this section’s code as a template,
display the linear regression surface for the model
Debt_median ~ Net_Tuition + ACT_median
on a 3-D scatter
plot that uses “Net_Tuition” as the x-variable and “ACT_median” as the
y-variable. You should use the lm()
function to fit this
model prior to Step 2.
\(~\)
We can modify thematic elements of a plotly
plot with
layout
, not unlike using the theme()
function
with ggplot
. We won’t go into more detail here, but the
internet is just a few clicks away.
plot_ly(data = colleges, type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private) %>%
layout(yaxis = list(title = "10 Year Median Salary"),
xaxis = list(title = "Total Cost", color = 'red'))
Most plotly
graphics can be turned into animations with
the frame
argument, which indicates a series of data
snapshots the animation will progress through. While a natural candidate
for a variable to snapshot through is time, it doesn’t have to be. Here,
for example, we consider again the relationship between engine
displacement and highway mileage, colored by drivetrain and animated
through the number of cylinders
plot_ly(mpg, x = ~displ, y = ~hwy, type = "scatter", mode = "markers",
color = ~factor(drv), frame = ~factor(cyl))
We can control different aspects of our animation with
animation_opts
. The most common option to change is the
frame
argument, indicating the amount of time in-between
frames (default is 500). The redraw
argument considers
whether or not the plot should be completely re-rendered between frames
(impacting performance). The easing
argument is a lot of
fun, it changes how the markers move between transitions. Here, I have
adjusted the frame to be cty
instead of cyl
to
give us more frames to move through.
plot_ly(mpg, x = ~displ, y = ~hwy, type = "scatter", mode = "markers",
color = ~factor(drv), frame = ~cty) %>%
animation_opts(frame = 750, easing = "bounce", redraw = FALSE)
# animation_opts(frame = 750, easing = "elastic", redraw = FALSE)
# animation_opts(frame = 750, easing = "cubic-in", redraw = FALSE)
# animation_opts(frame = 750, easing = "exp-in-out", redraw = FALSE)
# animation_opts(frame = 750, easing = "circle-in-out", redraw = FALSE)
You can find more easing options here between lines 68 and 103.
Question 4 The code below reads in data collected by Mother Jones that aims to document mass shootings in the United States (this dataset contains entries until 2019). For this question, create an animated plot that displays the cumulative total number of fatalities and injuries that have occured in these shootings over time. An example of what this animation should look like is given below; yours should be similar, but it doesn’t need to be identical.
shootings <- read.csv('https://remiller1450.github.io/data/MassShootings.csv')
Hint: Two hints. First, plan out exactly what your data
needs to look like to create this animation. What should the columns be?
Work backwards from there to determine how to be manipulate your data.
Second, ?cumsum