ggplot2
ggplot
ggplot2
# load the following packages
# install.packages("maps")
# install.packages("maptools")
library(ggplot2)
library(dplyr)
library(maps)
library(maptools)
# Also going to set ggplot theme for entire document
theme_set(theme_bw())
This lab is going to be focused on creating maps with the
ggplot2
package, or more generally, plotting spatial data.
In doing so, we will install two additional packages: first, the
maps
package which contains a number of ancillary functions
and datasets to pair with the ggplot2
package for creating
maps. The second is the maptools
package which can be used
to read in and manipulate particular types of map files known as shapefiles.
Unfortunately, the maptools
package is in the process of
being deprecated, with no immediately clear successor offering the same
utility and convenience. For those interested, there is a blog post
detailing what this process will look like in the coming months. The
primary issue is that while external libraries and data structures
continue to evolve, there will be no maintainer to ensure that the
maptools
package continues to work correctly with these
changes. Fortunately, however, none of this will stop us from working
with maptools
now and into the immediate future.
ggplot2
Here, we briefly introduce two new aspects of the
ggplot2
package, one of which we alluded to in the previous
lab.
The first of these is the group
aesthetic. You may have
noticed with previous aesthetics that they appear to create an
underlying grouping structure so that all observations within a
particular group share the relevant aesthetic. For example, here is the
Orange
dataset (built into R) showing the circumference of
orange five different orange trees as they age in days. By setting
color = Tree
, not only does each tree get its own color,
but the ggplot understands which observations are supposed to be linked
together in geom_line()
. We can establish this same
grouping structure without adding any additional aesthetics by
using the aesthetic `group
data(Orange)
## Using color
ggplot(Orange, aes(age, circumference, color = Tree)) + geom_line()
## Using group
ggplot(Orange, aes(age, circumference, group = Tree)) + geom_line()
The second part is the local specification of data within
layers of a ggplot. Just as we are able to locally specify aesthetics
for each layer, we can also specify a new data frame from which to draw
the aesthetics. This is particularly useful in situations in which all
of the components of a graphic don’t necessarily make sense being in the
same data frame. Here, for example, we create the data frame
omean
, which computes the average circumference of each of
the orange trees at each time. As each observation (row) here is
fundamentally different than the observations found in
Orange
, it makes sense that it should exist it its own
object.
## Compute mean across trees
omean <- Orange %>% group_by(age) %>%
summarize(meancirc = mean(circumference))
## Here, we limit the group aesthetic to just the first layer
ggplot(Orange, aes(age, circumference)) +
geom_line(aes(group = Tree)) +
geom_line(data = omean, mapping = aes(x = age, y = meancirc),
color = 'red', linewidth = 2)
## Here, we overwrite the inheritance by setting group = NULL
ggplot(Orange, aes(age, circumference, group = Tree)) +
geom_line() +
geom_line(data = omean, aes(x = age, y = meancirc, group = NULL),
color = 'red', linewidth = 2)
Note in the code above that we demonstrated two ways to compute the
same graphic. Just as before, each layer of a ggplot will inherit
aesthetics from the main ggplot
function. In this example,
there is no variable Tree
with which we can group
omean
. Consequently, we wither have to specify the group
aesthetic in the first geom_line()
using the data from
Orange
, or we have to manualy specify
group = NULL
in the second geom_line()
,
indicating that we do not wish to inherit from above.
These in hand, we are ready to introduce spatial data with
ggplot
.
ggplot
There are three main geometric elements that are commonly used to represent spatial data:
Aesthetics like color, size, and brightness can be used to encode additional information using these geometric elements.
A choropleth map uses colored polygons to
represent values corresponding variables measured across geographic
units. We begin by furnishing a dataset that will be used to construct
our maps. This comes from the map_data
function in
ggplot
, along with "state"
map data provided
by the maps
package. Note that the color aesthetic relates
to the color of the lines of the polygon, while fill relates to
the filled in color
## Get US state polygons from the "maps" package
states <- map_data("state")
## Graph these polygons
ggplot(data = states, aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "lightblue")
# This looks so weird
# ggplot(data = states, aes(x = -long, y = lat, group = group)) +
# geom_polygon(color = "black", fill = "lightblue")
Looking at the state_polys
dataset gives us some
insight:
head(states)
## long lat group order region subregion
## 1 -87.462 30.390 1 1 alabama <NA>
## 2 -87.485 30.372 1 2 alabama <NA>
## 3 -87.525 30.372 1 3 alabama <NA>
## 4 -87.531 30.332 1 4 alabama <NA>
## 5 -87.571 30.327 1 5 alabama <NA>
## 6 -87.588 30.327 1 6 alabama <NA>
We see that each region is defined by a series of ordered vertices, each with its own x and y (long and lat) coordinates. The group and order of these vertices determines how the edges of that region’s polygons are drawn; if these variables are altered during a data processing step like merging and joining, the polygons will not be drawn correctly.
While maps of a single color are neat, they are vastly more
interesting when the color is mapped to an additional variable. As our
states
data only includes spatial information, any
additional variables we have will need to be collected via joins.
## USArrests data from the "datasets" package
data("USArrests")
## Create a data frame to join (tolower so that values match between sets)
stateVal <- data.frame(State = tolower(rownames(USArrests)),
MurderRate = USArrests$Murder)
## Add MurderRate to the state polygons
full_data <- inner_join(x = states, y = stateVal, by = c("region" = "State"))
## Color the polygons using a fill variable
ggplot(data = full_data, aes(x = long, y = lat, group = group, fill = MurderRate)) +
geom_polygon(color = "black") +
scale_fill_continuous(type = "gradient", high = "purple", low = "green")
As we demonstrated at the beginning of the lab, we can use additional
datasets to augment an existing plot. Here, we use city data from the
maps
package, excluding Hawaii and Alaska as they don’t
appear in our states plot
data("us.cities")
cities <- us.cities %>% filter(!(country.etc %in% c("AK", "HI")))
ggplot(data = full_data, aes(x = long, y = lat)) +
geom_polygon(color = "black", aes(group = group, fill = MurderRate)) +
scale_fill_continuous(type = "gradient", high = "purple", low = "green") +
geom_point(data = cities, aes(x = long, y = lat), size = 0.2, color = 'red')
For the following question you should use the “state” and “county”
polygons found in the maps
package, as well as the
us.cities
data frame:
states <- map_data("state")
counties <- map_data("county")
data("us.cities")
Question 1: Filter the us.cities
data
to include only state capitals (capital == 2
) in the
contiguous United States (exclude cities in AK and HI). Then create a
map displaying state polygons (defined by black borders with
fill = NA
, linewidth = 0.3
and
alpha = 0.3
) and counties (defined by blue borders with
fill = "white"
, linewidth = 0.1
and
alpha = 0.5
) that also includes state capitals as points
(with color = "brown1"
). Keep in mind how layers operate in
ggplot when organizing your code.
Your end result should be similar to the map provided below:
Question 2 Included in the ggplot2
package is the dataset midwest
which includes county level
data for five states: Illinois, Indiana, Michigan, Ohio, and Wisconsin.
Using both the county and state level data from the previous questions,
make the necessary subsets and joins to create a choropleth illustrating
4 year college attainment at the county level. The final product should
look a bit like the plot below.
Note: The “state” variable in midwest
is stored as an
abbreviation rather than the full state name. Additionally, the counties
are all uppercase. Use the dataset provided below as an intermediary
needed to make the necessary joins.
state_abr <- read.csv("https://collinn.github.io/data/state_abrv.csv")
Additional info:
linewidth = 0.2
,
alpha = 0.75
linewidth = 0.5
scale_fill_continuous(type = "viridis", option = "H")
(Turbo)Though, in general, feel free to play around with different options to get your choropleth looking how you want.
While the polygon boundaries give by map_data()
are
simple enough for us to understand, they are not representative of the
bulk of spatial data that are used in real-world applications. In many
instances, geographic information is stored in a shapefile, a
complex data structure that consists of, at minimum, several separate
files:
For this portion of the lab, we are going to download and extract data from this folder which was originally obtained from Natural Earth Data. Notice that the extracted folder contains .shp, .shx, and .dbf files, along with a few others.
To begin, we’ll read this in with the readShapeSpatial()
function from maptools
package. A few things:
world <- readShapeSpatial("world_shape/ne_50m_admin_0_countries")
You’ll likely get several warning messages indicating that other packages may be preferred. We can ignore these for now.
Nearly every other object we have worked with in R (including ggplots
themselves) have an underlying data structure that is essentially a
list, meaning that all of the subset and indexing rules with which we
are familiar apply. The world
object, by contrast, is
something entirely
different. This is something we only concern ourselves with to the
degree that we use a different syntax to manipulate it.
For example, using the str()
function to print the
structure of world
, we see that it contains 5 “slots” which
are accessed with @
rather than $
str(world, max.level = 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
## ..@ data :'data.frame': 242 obs. of 168 variables:
## .. ..- attr(*, "data_types")= chr [1:168] "C" "N" "N" "C" ...
## ..@ polygons :List of 242
## ..@ plotOrder : int [1:242] 240 76 203 17 196 211 226 182 134 232 ...
## ..@ bbox : num [1:2, 1:2] -180 -90 180 83.6
## .. ..- attr(*, "dimnames")=List of 2
## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
## ..$ comment: chr "FALSE"
To convert this shape file to something that ggplot
can
work with, we use fortify()
which converts this into a data
frame of polygon coordinates. From there, we can manipulate it as we did
previously:
worldpoly <- fortify(world) # It might complain about fortify too. Ignore it
ggplot(data = worldpoly, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "black")
Note that the fortify()
function only creates a
data frame with polygon data. To use any of the additional data from our
shape file, we have to extract it with @
.
world@data
is missing an id column that would allow it to
be joined to worldpoly
, so we add this and then join the
two together
worlddat <- world@data
# This has 168 columns so we won't peek
#head(worlddat)
# id variable in world poly goes from 0 to 241 and it's a character vector. idk
worlddat$id <- as.character((1:nrow(worlddat)-1))
# Join by id
worldfull <- left_join(worldpoly, worlddat, by = "id")
## BRICS countries
BRICS <- c("Brazil", "Russia", "India", "China", "South Africa")
# Now we can use world data in our aesthetics
ggplot(data = worldfull, aes(long, lat, group = group)) +
geom_polygon(aes(fill = ifelse(NAME_EN %in% BRICS, "BRICS", "No BRICS")),
color = "black") + guides(fill = "none")
Question 3 The variable “GDP_MD` (already present in the merged data frame created above) records the gross domestic product (GDP) of each country measured in millions of US dollars for the year 2019. We can attempt to make a choropleth using this data, but given the extreme disparities in GDP between countries, the scales make this almost useless:
Instead, it is sometimes preferential to create discrete categories
out of continuous variables. For this question, use the
cut_number()
function in ggplot2
to create
groups for each of the four quartiles of the “GDP_MD” variable. See
?cut_number
for creating cuts and labels. Finally, using
the appropriate “brewer” function, set the palette to be
"YlGnBu"
(or any of the other sequential palettes). Your
resulting graph should look something like this.
This portion of the lab is centered around the leaflet
package which, like Plotly, is an external collection of open-source
software libraries that have been integrated into the R ecosystem.
Rather than copy-and-paste material from other sources, I will
instead here direct you to two places that you can get more information
on using the leaflet
package. The first is the course page
from Professor Miller, linked
here, which offers a quick crash-course on using Leaflet in R. The
second is a more comprehensive tutorial put out by the Leaflet team
itself, linked here.
This is all completely optional, and no explicit credit will be given
for completing any exercises. However, you may find either of the linked
pages helpful for your own edification, and you may find something of
interest for inclusion in your own R Shiny project after the break.