dplyr
and ggplot2
## Copy this into your r setup chunk
knitr::opts_chunk$set(echo = TRUE,
fig.align = 'center',
fig.width = 4,
fig.height = 4,
message = FALSE,
warning = FALSE)
library(ggplot2)
library(dplyr)
## Less uggo plots
theme_set(theme_bw())
This lab serves primarily as a set of practice problems to hone our dplyr and ggplot skills
This question uses the penguins dataset.
penguins <- read.csv("https://collinn.github.io/data/penguins.csv")
Using the mutate
function, create a new variable in
this dataset called size
that takes the value
"Small"
if the penguin’s body mass is less than the median
body mass and "Large"
if it is greater than or equal to the
median. Recall the function ifelse()
which may be helpful
here (?ifelse
)
Create a box plot illustrating the relationship between bill depth, size, and sex. Do small or large penguins tend to have deeper bills? Which groups appears to have the greatest amount of variability?
Often when plotting categorical variables, we wish to change the
order in which they appear on a graph. By default, these occur
in alphabetical order. In order to change this, we will use the function
factor()
along with the argument level
;
factor()
is a function that ensures our variable is
categorical, while the level
argument allows us to
specify the new order. For example, the code below re-levels the
size
variable so that "Small"
comes before
"Large"
:
penguins <- mutate(penguins, size2 = factor(size, levels = c("Small", "Large")))
Note: I have chosen to call this new variable size2
so
that I can use this and the original size
variable
unchanged. In general in our work, however, we will typically choose to
overwrite the original variable with the new ordering.
Recreate the box plot from (2.) with size2
and compare
it to the original. Has it changed as you expected? (Note: this method
of “re-leveling” will work for all categorical variables)
This question will use the titanic dataset.
## Copy this code to recreate the titanic dataset
data(Titanic)
titanic <- as.data.frame(Titanic)
titanic <- titanic[rep(1:nrow(titanic), times = titanic$Freq), ]
titanic$Freq <- NULL
Recreate the plot below by re-factoring the levels as we did in Question 1. Pay special note to the order of values presented for Sex, Class, and Survived.
You can modify the the colors by adding
scale_fill_brewer(palette = "Greens")
to your ggplot. See
?scale_fill_brewer()
to check out other palettes, if you
would like
This question uses the college dataset.
## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")
Here, we are interested in answering the question: “which region has the largest outlier for median ACT relative to the region”
First, standardize the ACT_median
variable by
region and then create the appropriate box plot showing the
relationship between region and standardized median ACT score (Hint: how
do you find standardized values of a variable?). Which region appears to
have the greatest outlier? Consider using mutate()
and
group_by()
in this problem.
Next, create a summary of the college dataset that shows the
maximum and minimum median ACT values for each region. Which region has
the largest median ACT? Which has the smallest? Consider using
group_by()
and summarize()
Compare your answers from (2) with what you found in (1). Does having the largest/smallest ACT scores necessarily result in having outliers? Explain what is happening.
This question uses the tips dataset. It will also require using
dplyr
functions, though this time, no additional hints as
to which functions are provided.
tips <- read.csv("https://collinn.github.io/data/tips.csv")
Begin by using the appropriate dplyr
functions to
filter the dataset to only include bills that occurred on either
Saturday or Sunday and in which the size of the party was greater than
1.
Create a new variable, tipPercent
that finds what
percentage of the total bill was offered as a tip
Construct a summary that includes the average tipping percentage
of an individual by sex and smoker status. Along with this summary, also
include the number of individuals making up each group
(n()
)
Create the appropriate plot to illustrate the relationship between total bill (x-axis) and tip percent (y-axis). What type of relationship do you see? Would Pearson’s correlation or Spearman’s correlation be more appropriate to describe this?
This section will cover some common extensions to our basic ggplots. Briefly, this will include:
geom_jitter()
Each section will serve to introduce a new idea or technique, each relatively simple in isolation. Our ultimate goal, however, will be the integration of data summaries into our plots.
There are often times where we will want to create an arrangement of
plots that can be presented adjacent to one another. A convenient method
for doing so utilizes the gridExtra
package, which can be
installed by copying and pasting the code below into your console. Once
the package is installed, you only ever need to load the package with
the library()
function, just as we do for
ggplot2
and dplyr
.
## Copy this and run in the console once to install package
# then delete -- DO NOT PUT IN YOUR RMARKDOWN FILE
if (!require(gridExtra)) {
install.packages("gridExtra")
}
## Once installed, load package like you do with others
library(gridExtra)
Just like with dplyr
and ggplot2
, you will
need to put this at the top of your RMD document every time you want to
use it
Typically when we create a ggplot in an R code chunk it will immediately print out the associated graphic:
ggplot(mpg, aes(displ, cty, color = cyl)) +
geom_point()
However, just as with any other object in R (linear models, data.frames, etc.,), we can assign our ggplot to a variable name; doing so will stop the plot from printing out immediately
## Note that nothing prints after this code chunk
p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) +
geom_point()
If we do want to display the plot, it is as simple as writing out the variable name:
p1
In order to arrange our plots side-by-side, we need to start by
saving each of them to a new variable. We can then use the
grid.arrange()
function included in the
gridExtra
package to specify how we would like them
aligned. Here, we revisit a problem from a previous lab where we compare
how the color aesthetic works when a quantitative variable is coded as a
factor. Placing them side-by-side provides for easy comparison:
p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) +
geom_point() +
theme(legend.position = "bottom") # move legend to bottom
p2 <- ggplot(mpg, aes(displ, cty, color = factor(cyl))) +
geom_point() +
theme(legend.position = "bottom")
## Print them out in a single row
grid.arrange(p1, p2, nrow = 1)
You may find that your plots are a bit scrunched. You can remedy this
by using fig.width = 8
(or any other number) as a chunk
option
p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) +
geom_point() +
theme(legend.position = "bottom") # move legend to bottom
p2 <- ggplot(mpg, aes(displ, cty, color = factor(cyl))) +
geom_point() +
theme(legend.position = "bottom")
## This should look a bit cleaner
grid.arrange(p1, p2, nrow = 1)
Using grid.arrange()
, reproduce the two plots you
created in Q1 with size
and size2
to show
side-by-side how using factor()
and levels
can
allow you to rearrange your plots.
The logic of ggplot
is such that we can map variables
included into our datasets into visual aspects of the plots with the use
of the aesthetics function aes()
. However, it is also
possible to modify plots directly by specifying values for an aesthetic
outside of aes()
. Consider the collection of
options below:
ggplot(mpg, aes(displ, cty)) +
geom_point(color = 'blue', size = 6, shape = 17, alpha = 0.25)
We have
color
, which we assign directly to be bluesize
, which will change the size of the points
representedshape
, specifying shape, a collection of which can be
found herealpha
, taking values between 0 and 1, which determine
the amount of transparency in ithe pointsThere are two additional things that are important to note here:
geom_point()
function, meaning that they will only apply to
those points being created in that layer. If we were to add them to
ggplot
instead, they would apply to every single layer
included in the plot (more on this soon)aes()
, they are
interpreted directly, i.e., ‘blue’ maps to the color blueThis last point is crucial. If we were to add these modifications
inside of aes()
, it would look (unsuccessfully) for these
variables in our data frame. When they are not found, they will try to
create them as variables rather than as values. For example,
see what happens to color when we put it in aes()
## Adds a legend for "blue" and makes it reddish because there
## is no variable called "blue" in the dataset
ggplot(mpg, aes(displ, cty)) +
geom_point(aes(color = 'blue'))
Keep this in mind as we move into the next section.
One of my favorite plots is created with the
geom_jitter()
function which, among its many uses, can be
used as an alternative to the box plot for visualizing the relationship
between a categorical and quantitative variable. By default, trying to
create points aligning with a categorical variable will cause many of
the points to overlaps as they are all aligned within their own
category. The jitter function operates by applying a small amount of
random variation to the x and y values, producing a nice effect:
## Using points
p1 <- ggplot(mpg, aes(drv, cty)) +
geom_point()
## Using jitter
p2 <- ggplot(mpg, aes(drv, cty)) +
geom_jitter()
grid.arrange(p1, p2, nrow = 1)
We can modify the amount of variation with height
and
width
arguments to the jittering function. There is no
right or wrong answer, so feel free to play around with it:
p1 <- ggplot(mpg, aes(drv, cty)) +
geom_point()
p2 <- ggplot(mpg, aes(drv, cty)) +
geom_jitter()
p3 <- ggplot(mpg, aes(drv, cty)) +
geom_jitter(width = 0.15, color = 'dodgerblue')
grid.arrange(p1, p2, p3, nrow = 1)
The plots created by geom_jitter()
offer a nice
alternative to those created by geom_boxplot()
as, instead
of summarizing the data into median, Q1, Q3, and outliers, it allows us
to see directly the distribution and density of our observations. Of
course, there is nothing to stop us from combining these ideas to create
more interesting plots: because each geom represents a layer,
we can readily add multiple layers to a plot. This allows us to nicely
visualize how the geom_jitter
and geom_boxplot
functions compare. Note, however, that the layers are placed in order,
with subsequent ones being created “on top of” the former. The first has
geom_boxplot()
second, so it is placed on top of the jitter
points.
## Remove outliers to make black dots go away
p1 <- ggplot(mpg, aes(drv, cty)) +
geom_jitter(width = 0.15, alpha = 0.5, color = "dodgerblue") +
geom_boxplot(fill = "pink", outliers = FALSE)
p2 <- ggplot(mpg, aes(drv, cty)) +
geom_boxplot(fill = "pink", outliers = FALSE) +
geom_jitter(width = 0.15, alpha = 0.5, color = "dodgerblue")
grid.arrange(p1, p2, nrow = 1)
Notice how in this case we specified color, fill, and alpha all
outside of the aes()
function. As such, ggplot was able to
interpret these values directly.
We will conclude our discussion of ggplot in this lab with the
introduction of data summaries in our plots. To do so, let’s first
consider an aspect of ggplot that we have, until now, taken for granted.
Within our ggplot
function, we specify both a
data
argument and an aes()
argument which
creates a map from the variables in our data to the aesthetics in the
plot. Each time we have added a layer (geom_jitter()
,
geom_boxplot()
), we have inherited both the data
and the values of the aesthetics. However, we could also specify
data
and aes()
in each layer instead. The code
below creates two of the exact same plots:
(NOTE: When doing this, we have to specify
data = mpg
in the layer or else we will get an error)
## Values are inherited from ggplot
p1 <- ggplot(data = mpg, aes(drv, cty)) +
geom_boxplot()
## Values *not* inherited and are defined in layer
p2 <- ggplot() +
geom_boxplot(data = mpg, aes(drv, cty))
grid.arrange(p1, p2, nrow = 1)
This will only make a difference if we include an additional layer.
As we did in the previous section, let’s try to add points associated
with geom_jitter()
## Values are inherited from ggplot
p1 <- ggplot(data = mpg, aes(drv, cty)) +
geom_boxplot() +
geom_jitter(width = 0.15, alpha = 0.5)
## Values *not* inherited and are defined in layer
p2 <- ggplot() +
geom_boxplot(data = mpg, aes(drv, cty)) +
geom_jitter(width = 0.15, alpha = 0.5)
grid.arrange(p1, p2, nrow = 1)
Now, our second plot has nothing: because there is no dataset or
aesthetics to inherit from ggplot()
, it does not produce
any points. We can rectify this by also specifying the data and
aesthetics in that layer as well, giving us the desired plot:
## Add data and aes to geom_jitter
ggplot() +
geom_boxplot(data = mpg, aes(drv, cty)) +
geom_jitter(data = mpg, aes(drv, cty), width = 0.15, alpha = 0.5)
Why would we care about this when it seems like so much extra work? Because it allows us to create layers that contain completely different datasets.
For example, consider the boxplots above. We know from earlier in the semester that distributions that are skewed have mean values that differ from the median. We can use this technique to illustrate that visually.
If we wanted to plot the mean values for cty
for each
value of drv
, we would start by creating a data
summary with dplyr:
drv_summary <- mpg %>% group_by(drv) %>%
summarize(meanCty = mean(cty))
print(drv_summary)
## # A tibble: 3 × 2
## drv meanCty
## <chr> <dbl>
## 1 4 14.3
## 2 f 20.0
## 3 r 14.1
The new dataset drv_summary
now has a variable
drv
, indicating drive train, and the variable
meanCty
, giving the average city miles per gallon. If we
wanted to include just this in a plot, we could do so as follows:
## shape = 7 will give me squares instead of circles
ggplot(drv_summary, aes(meanCty, drv)) +
geom_point(color = 'red', size = 4, shape = 15)
Just as easily, we could include this with the box plots we created
above. This time, we add an additional layer and specify
data
and aes()
in the new layer:
ggplot() +
geom_boxplot(data = mpg, aes(cty, drv)) +
geom_point(data = drv_summary, aes(meanCty, drv),
color = 'red', size = 4, shape = 15)
Now, it becomes immediately clear that for 4wd and fwd, the distribution is skewed right (the mean is greater than the median), while for rwd, the distribution is skewed left (the mean is less than the median).
The ability to add data summaries to our plots gives us a way to present standard visualizations to a reader that allow us to highlight particular attributes that are not typically represented by the disaggregated data alone.
To create a colored point with a black outline, we need to modify the shape so that we can use both a color (outline) and fill (inside) aesthetic. Below is an example
df <- data.frame(x = 1, y = 1)
ggplot(df, aes(x, y)) +
geom_point(shape = 21, color = 'black', fill = 'coral', size = 8)
This will be useful for the next questions.
Recreate the following plot that shows the mean and median net tuition for each of the regions in the college dataset
college <- read.csv("https://collinn.github.io/data/college2019.csv")
Note that:
height = 0.15
and
alpha = 0.25
size = 3.5
Recreate this from the ecological correlation lecture (we skipped this, but that doesn’t change our ability to produce this graph), where the individual points are the schools and the colored dots get their x and y values from the regions’ average admission rate and median debt, respectively
Note that:
alpha = 0.25
for the black dotssize = 4
for the colored dotstheme(legend.position = "bottom")
to move
the legend