dplyr
and ggplot2
library(ggplot2)
library(dplyr)
## Less uggo plots
theme_set(theme_bw())
This lab serves primarily as a set of practice problems to hone our dplyr and ggplot skills
This question uses the penguins dataset.
penguins <- read.csv("https://collinn.github.io/data/penguins.csv")
Using the appropriate dplyr
functions, first filter
the penguin dataset to only include observations in which the variable
sex
takes the values “male” and “female” (a handful of
observations are simply missing entries). Then, create a new variable in
this dataset called size
that takes the value “Small” if
the penguin’s body mass is less than the median and the value “Big” if
the body mass is greater than or equal to the median
Create a boxplot illustrating the relationship between dill depth, size, and sex. Do small or large penguins tend to have deeper bills? Which groups appears to have the greatest amount of variability?
The code below will take the variable size
and
re-code it as a factor variable. The utility here comes in the
levels
argument: doing this before creating a plot allows
you to modify the order in which the values in a plot
penguins <- mutate(penguins, size = factor(size, levels = c("Small", "Big")))
Recreate the plot and compare it to what you saw in (2.). What has changed? Note that this will work for all future categorical variables when creating ggplots.
This question will use the titanic dataset.
data(Titanic)
titanic <- as.data.frame(Titanic)
titanic <- titanic[rep(1:nrow(titanic), times = titanic$Freq), ]
titanic$Freq <- NULL
Recreate the plot below by refactoring the levels as we did in Question 1. Pay special note to the order of values presented for Sex, Class, and Survived.
You can modify the the colors by adding
scale_fill_brewer(palette = "Greens")
to your ggplot. See
?scale_fill_brewer()
to check out other palettes, if you
would like
This question uses the college dataset.
## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")
Here, we are interested in answering the question: “which region has the largest outlier relative to the region for median ACT score”
First, standardize the ACT_median
variable by
region and then create the appropriate boxplot showing the
relationship between region and standardized median ACT score (Hint: how
do you find standardized values of a variable?). Which region appears to
have the greatest outlier?
Next, create a summary of the college dataset that shows the maximum and minimum median ACT values for each region. Which region has the largest median ACT? Which has the smallest?
Compare your answers from (2) with what you found in (1). Does having the largest/smallest ACT values necessarily result in having outliers? Explain what is happening.
This question uses the tips dataset
tips <- read.csv("https://collinn.github.io/data/tips.csv")
Begin by using the appropriate dplyr
functions to
filter the dataset to only include bills that occurred on either
Saturday or Sunday and in which the size of the party was greater than
1.
Create a new variable, tipPercent
that finds what
percentage of the total bill was offered as a tip
Construct a summary that includes the average tipping percentage
of an individual by sex and smoker status. Along with this summary, also
include the number of individuals making up each group
(n()
)
Create the appropriate plot to illustrate the relationship between total bill (x-axis) and tip percent (y-axis). What type of relationship do you see? Would Pearson’s correlation or Spearman’s correlation be more appropriate to describe this?
This section will cover some common extensions to our basic ggplots. Briefly, this will include:
geom_jitter()
Each section will serve to introduce a new idea or technique, each relatively simple in isolation. Our ultimate goal, however, will be the integration of data summaries into our plots.
There are often times where we will want to create an arrangement of
plots that can be presented adjacent to one another. A convenient method
for doing so utilizes the gridExtra
package, which can be
installed by copying and pasting the code below into your console. Once
the package is installed, you only ever need to load the package with
the library()
function, just as we do for
ggplot2
and dplyr
.
## Copy this and run in the console once to install package
# then delete -- DO NOT PUT IN YOUR RMARKDOWN FILE
if (!require(gridExtra)) {
install.packages("gridExtra")
}
## Once installed, load package like you do with others
library(gridExtra)
Typically when we create a ggplot in an R code chunk it will immediately print out the associated graphic:
ggplot(mpg, aes(displ, cty, color = cyl)) +
geom_point()
However, just as with any other object in R (linear models, data.frames, etc.,), we can assign our ggplot to a variable name; doing so will stop the plot from printing out immediately
## Note that nothing prints after this code chunk
p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) +
geom_point()
If we do want to display the plot, it is as simple as writing out the variable name:
p1
In order to arrange our plots side-by-side, we need to start by
saving each of them to a new variable. We can then use the
grid.arrange()
function included in the
gridExtra
package to specify how we would like them
aligned. Here, we revisit a problem from a previous lab where we compare
how the color aesthetic works when a quantitative variable is coded as a
factor. Placing them side-by-side provides for easy comparison:
p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) +
geom_point() +
theme(legend.position = "bottom") # move legend to bottom
p2 <- ggplot(mpg, aes(displ, cty, color = factor(cyl))) +
geom_point() +
theme(legend.position = "bottom")
## Print them out in a single row
grid.arrange(p1, p2, nrow = 1)
You may find that your plots are a bit scrunched. You can remedy this
by using fig.width = 8
(or any other number) as a chunk
option
p1 <- ggplot(mpg, aes(displ, cty, color = cyl)) +
geom_point() +
theme(legend.position = "bottom") # move legend to bottom
p2 <- ggplot(mpg, aes(displ, cty, color = factor(cyl))) +
geom_point() +
theme(legend.position = "bottom")
## This should look a bit cleaner
grid.arrange(p1, p2, nrow = 1)
Using grid.arrange()
, reproduce the two plots you
created in Q1 to show side-by-side how using factor()
and
levels
can allow you to rearrange your plots.
The logic of ggplot
is such that we can map variables
included into our datasets into visual aspects of the plots with the use
of the aesthetics function aes()
. However, it is also
possible to modify plots directly by specifying values for an aesthetic
outside of aes()
. Consider the collection of
options below:
ggplot(mpg, aes(displ, cty)) +
geom_point(color = 'blue', size = 6, shape = 17, alpha = 0.25)
We have
color
, which we assign directly to be bluesize
, which will change the size of the points
representedshape
, specifying shape, a collection of which can be
found herealpha
, taking values between 0 and 1, which determine
the amount of transparencyThere are two additional things that are important to note here:
geom_point()
function, meaning that they will only apply to
those points being created in that layer. If we were to add them to
ggplot
instead, they would apply to every single layer
included in the plot (more on this soon)aes()
, they are
interpreted directly, i.e., ‘blue’ maps to the color blueThis last point is crucial. If we were to add these modifications
inside of aes()
, it would look (unsuccessfully) for these
variables in our data frame. When they are not found, they will try to
create them as variables rather than as values. For example,
see what happens to color when we put it in aes()
## Adds a legend for "blue" and makes it reddish
ggplot(mpg, aes(displ, cty)) +
geom_point(aes(color = 'blue'))
Keep this in mind as we move into the next section.
One of my favorite plots is created with the
geom_jitter()
function which, among its many uses, can be
used as an alternative to the boxplot for visualizing the relationship
between a categorical and quantitative variable. By default, trying to
create points aligning with a categorical variable will cause many of
the points to overlaps as they are all aligned within their own
category. The jitter function operates by applying a small amount of
random variation to the x and y values, producing a nice effect:
## Using points
p1 <- ggplot(mpg, aes(drv, cty)) +
geom_point()
## Using jitter
p2 <- ggplot(mpg, aes(drv, cty)) +
geom_jitter()
grid.arrange(p1, p2, nrow = 1)
We can modify the amount of variation with height
and
width
arguments to the jittering function. There is no
right or wrong answer, so feel free to play around with it:
p1 <- ggplot(mpg, aes(drv, cty)) +
geom_point()
p2 <- ggplot(mpg, aes(drv, cty)) +
geom_jitter()
p3 <- ggplot(mpg, aes(drv, cty)) +
geom_jitter(width = 0.15, color = 'dodgerblue')
grid.arrange(p1, p2, p3, nrow = 1)
The plots created by geom_jitter()
offer a nice
alternative to those created by geom_boxplot()
as, instead
of summarizing the data into median, Q1, Q3, and outliers, it allows us
to see directly the distribution and density of our observations. Of
course, there is nothing to stop us from combining these ideas to create
more interesting plots: because each geom represents a layer,
we can readily add multiple layers to a plot. This allows us to nicely
visualize how the geom_jitter
and geom_boxplot
functions compare. Note, however, that the layers are placed in order,
with subsequent ones being created “on top of” the former
## Remove outliers to make black dots go away
p1 <- ggplot(mpg, aes(drv, cty)) +
geom_jitter(width = 0.15, alpha = 0.5, color = "dodgerblue") +
geom_boxplot(fill = "pink", outliers = FALSE)
p2 <- ggplot(mpg, aes(drv, cty)) +
geom_boxplot(fill = "pink", outliers = FALSE) +
geom_jitter(width = 0.15, alpha = 0.5, color = "dodgerblue")
grid.arrange(p1, p2, nrow = 1)
Notice how in this case we specified color, fill, and alpha all
outside of the aes()
function. As such, ggplot was able to
interpret these values directly.
We will conclude our discussion of ggplot in this lab with the
introduction of data summaries in our plots. To do so, let’s first
consider an aspect of ggplot that we have, until now, taken for granted.
Within our ggplot
function, we specify both a
data
argument and an aes()
argument which
creates a map from the variables in our data to the aesthetics in the
plot. Each time we have added a layer (geom_jitter()
,
geom_boxplot()
), we have inherited both the data
and the values of the aesthetics. However, we could also specify
data
and aes()
in each layer instead. The code
below creates two of the exact same plots:
(NOTE: When doing this, we have to specify
data = mpg
in the layer or else we will get an error)
## Values are inherited from ggplot
p1 <- ggplot(data = mpg, aes(drv, cty)) +
geom_boxplot()
## Values *not* inherited and are defined in layer
p2 <- ggplot() +
geom_boxplot(data = mpg, aes(drv, cty))
grid.arrange(p1, p2, nrow = 1)
This will only make a difference if we include an additional layer.
As we did in the previous section, let’s try to add points associated
with geom_jitter()
## Values are inherited from ggplot
p1 <- ggplot(data = mpg, aes(drv, cty)) +
geom_boxplot() +
geom_jitter(width = 0.15, alpha = 0.5)
## Values *not* inherited and are defined in layer
p2 <- ggplot() +
geom_boxplot(data = mpg, aes(drv, cty)) +
geom_jitter(width = 0.15, alpha = 0.5)
grid.arrange(p1, p2, nrow = 1)
Now, our second plot has nothing: because there is no dataset or
aesthetics to inherit from ggplot()
, it does not produce
any points. We can rectify this by also specifying the data and
aesthetics in that layer as well, giving us the desired plot:
## Add data and aes to geom_jitter
ggplot() +
geom_boxplot(data = mpg, aes(drv, cty)) +
geom_jitter(data = mpg, aes(drv, cty), width = 0.15, alpha = 0.5)
Why would we care about this when it seems like so much extra work? Because it allows us to create layers that contain completely different datasets.
For example, consider the boxplots above. We know from earlier in the semester that distributions that are skewed have mean values that differ from the median. We can use this technique to illustrate that visually.
If we wanted to plot the mean values for cty
for each
value of drv
, we would start by creating a data
summary with dplyr:
drv_summary <- mpg %>% group_by(drv) %>%
summarize(meanCty = mean(cty))
print(drv_summary)
## # A tibble: 3 × 2
## drv meanCty
## <chr> <dbl>
## 1 4 14.3
## 2 f 20.0
## 3 r 14.1
The new dataset drv_summary
now has a variable
drv
, indicating drive train, and the variable
meanCty
, giving the average city miles per gallon. If we
wanted to include just this in a plot, we could do so as follows:
## shape = 7 will give me squares instead of circles
ggplot(drv_summary, aes(drv, meanCty)) +
geom_point(color = 'red', size = 4, shape = 7)
Just as easily, we could include this with the box plots we created
above. This time, we add an additional layer and specify
data
and aes()
in the new layer:
ggplot() +
geom_boxplot(data = mpg, aes(drv, cty)) +
geom_point(data = drv_summary, aes(drv, meanCty),
color = 'red', size = 4, shape = 7)
Now, it becomes immediately clear that for 4wd and fwd, the distribution is skewed right (the mean is greater than the median), while for rwd, the distribution is skewed left (the mean is less than the median).
The ability to add data summaries to our plots gives us a way to present standard visualizations to a reader that allow us to highlight particular attributes that are not typically represented
To create a colored point with a black outline, we need to modify the shape so that we can use both a color (outline) and fill (inside) aesthetic. Below is an example
df <- data.frame(x = 1, y = 1)
ggplot(df, aes(x, y)) +
geom_point(shape = 21, color = 'black', fill = 'coral', size = 8)
This will be useful for the next questions.
Recreate the following plot that shows the mean and median net tuition for each of the regions in the college dataset
college <- read.csv("https://collinn.github.io/data/college2019.csv")
Note that:
height = 0.15
and
alpha = 0.25
size = 3.5
Recreate this from the ecological correlation lecture, where the individual points are the schools and the colored dots get their x and y values from the regions’ average admission rate and median debt, respectively
Note that:
alpha = 0.25
for the black dotssize = 4
for the colored dotstheme(legend.position = "bottom")
to move
the legend