Loading [MathJax]/jax/output/HTML-CSS/jax.js
  • Preamble
    • Packages and datasets
    • Some extra tidbits for ggplot2
    • Spatial Data in ggplot
      • Choropleth Maps
      • Adding points or lines
    • Leaflet (Bonus)
  • Lab
    • Polygons, lines, and points
    • Shapefiles

Preamble

Packages and datasets

# load the following packages
# install.packages("maps")
# install.packages("maptools")
library(ggplot2)
library(dplyr)
library(maps)
library(maptools)

# Also going to set ggplot theme for entire document
theme_set(theme_bw())

This lab is going to be focused on creating maps with the ggplot2 package, or more generally, plotting spatial data. In doing so, we will install two additional packages: first, the maps package which contains a number of ancillary functions and datasets to pair with the ggplot2 package for creating maps. The second is the maptools package which can be used to read in and manipulate particular types of map files known as shapefiles.

Unfortunately, the maptools package is in the process of being deprecated, with no immediately clear successor offering the same utility and convenience. For those interested, there is a blog post detailing what this process will look like in the coming months. The primary issue is that while external libraries and data structures continue to evolve, there will be no maintainer to ensure that the maptools package continues to work correctly with these changes. Fortunately, however, none of this will stop us from working with maptools now and into the immediate future.

Some extra tidbits for ggplot2

Here, we briefly introduce two new aspects of the ggplot2 package, one of which we alluded to in the previous lab.

The first of these is the group aesthetic. You may have noticed with previous aesthetics that they appear to create an underlying grouping structure so that all observations within a particular group share the relevant aesthetic. For example, here is the Orange dataset (built into R) showing the circumference of orange five different orange trees as they age in days. By setting color = Tree, not only does each tree get its own color, but the ggplot understands which observations are supposed to be linked together in geom_line(). We can establish this same grouping structure without adding any additional aesthetics by using the aesthetic `group

data(Orange)

## Using color
ggplot(Orange, aes(age, circumference, color = Tree)) + geom_line() 

## Using group
ggplot(Orange, aes(age, circumference, group = Tree)) + geom_line()

The second part is the local specification of data within layers of a ggplot. Just as we are able to locally specify aesthetics for each layer, we can also specify a new data frame from which to draw the aesthetics. This is particularly useful in situations in which all of the components of a graphic don’t necessarily make sense being in the same data frame. Here, for example, we create the data frame omean, which computes the average circumference of each of the orange trees at each time. As each observation (row) here is fundamentally different than the observations found in Orange, it makes sense that it should exist it its own object.

## Compute mean across trees
omean <- Orange %>% group_by(age) %>% 
  summarize(meancirc = mean(circumference))

## Here, we  limit the group aesthetic to just the first layer
ggplot(Orange, aes(age, circumference)) + 
  geom_line(aes(group = Tree)) +
  geom_line(data = omean, mapping = aes(x = age, y = meancirc), 
            color = 'red', linewidth = 2)

## Here, we overwrite the inheritance by setting group = NULL
ggplot(Orange, aes(age, circumference, group = Tree)) + 
  geom_line() +
  geom_line(data = omean, aes(x = age, y = meancirc, group = NULL), 
            color = 'red', linewidth = 2)

Note in the code above that we demonstrated two ways to compute the same graphic. Just as before, each layer of a ggplot will inherit aesthetics from the main ggplot function. In this example, there is no variable Tree with which we can group omean. Consequently, we wither have to specify the group aesthetic in the first geom_line() using the data from Orange, or we have to manualy specify group = NULL in the second geom_line(), indicating that we do not wish to inherit from above.

These in hand, we are ready to introduce spatial data with ggplot.

Spatial Data in ggplot

There are three main geometric elements that are commonly used to represent spatial data:

  1. Points - simple coordinates showing the position of on an object (ie: cities, sites, etc.)
  2. Lines - ordered sets of coordinates that can be connected (ie: roadways, power lines, etc.)
  3. Polygons - ordered sets of coordinates that form an enclosed shape when connected (ie: states, districts, etc.)

Aesthetics like color, size, and brightness can be used to encode additional information using these geometric elements.

 

Choropleth Maps

A choropleth map uses colored polygons to represent values corresponding variables measured across geographic units. We begin by furnishing a dataset that will be used to construct our maps. This comes from the map_data function in ggplot, along with "state" map data provided by the maps package. Note that the color aesthetic relates to the color of the lines of the polygon, while fill relates to the filled in color

## Get US state polygons from the "maps" package
states <- map_data("state")  

## Graph these polygons
ggplot(data = states, aes(x = long, y = lat, group = group)) + 
  geom_polygon(color = "black", fill = "lightblue")

# This looks so weird
# ggplot(data = states, aes(x = -long, y = lat, group = group)) + 
#   geom_polygon(color = "black", fill = "lightblue")

Looking at the state_polys dataset gives us some insight:

head(states)
##      long    lat group order  region subregion
## 1 -87.462 30.390     1     1 alabama      <NA>
## 2 -87.485 30.372     1     2 alabama      <NA>
## 3 -87.525 30.372     1     3 alabama      <NA>
## 4 -87.531 30.332     1     4 alabama      <NA>
## 5 -87.571 30.327     1     5 alabama      <NA>
## 6 -87.588 30.327     1     6 alabama      <NA>

We see that each region is defined by a series of ordered vertices, each with its own x and y (long and lat) coordinates. The group and order of these vertices determines how the edges of that region’s polygons are drawn; if these variables are altered during a data processing step like merging and joining, the polygons will not be drawn correctly.

While maps of a single color are neat, they are vastly more interesting when the color is mapped to an additional variable. As our states data only includes spatial information, any additional variables we have will need to be collected via joins.

## USArrests data from the "datasets" package
data("USArrests")

## Create a data frame to join (tolower so that values match between sets)
stateVal <- data.frame(State = tolower(rownames(USArrests)), 
                           MurderRate = USArrests$Murder)

## Add MurderRate to the state polygons
full_data <- inner_join(x = states, y = stateVal, by = c("region" = "State"))

## Color the polygons using a fill variable
ggplot(data = full_data, aes(x = long, y = lat, group = group, fill = MurderRate)) +
                        geom_polygon(color = "black") + 
  scale_fill_continuous(type = "gradient", high = "purple", low = "green")

Adding points or lines

As we demonstrated at the beginning of the lab, we can use additional datasets to augment an existing plot. Here, we use city data from the maps package, excluding Hawaii and Alaska as they don’t appear in our states plot

data("us.cities")
cities <- us.cities %>% filter(!(country.etc %in% c("AK", "HI")))

ggplot(data = full_data, aes(x = long, y = lat)) +
                        geom_polygon(color = "black", aes(group = group, fill = MurderRate)) + 
  scale_fill_continuous(type = "gradient", high = "purple", low = "green") +
  geom_point(data = cities, aes(x = long, y = lat), size = 0.2, color = 'red')

Lab

Polygons, lines, and points

For the following question you should use the “state” and “county” polygons found in the maps package, as well as the us.cities data frame:

states <- map_data("state")
counties <- map_data("county")
data("us.cities")

Question 1: Filter the us.cities data to include only state capitals (capital == 2) in the contiguous United States (exclude cities in AK and HI). Then create a map displaying state polygons (defined by black borders with fill = NA, linewidth = 0.3 and alpha = 0.3) and counties (defined by blue borders with fill = "white", linewidth = 0.1 and alpha = 0.5) that also includes state capitals as points (with color = "brown1"). Keep in mind how layers operate in ggplot when organizing your code.

Your end result should be similar to the map provided below:

Question 2 Included in the ggplot2 package is the dataset midwest which includes county level data for five states: Illinois, Indiana, Michigan, Ohio, and Wisconsin. Using both the county and state level data from the previous questions, make the necessary subsets and joins to create a choropleth illustrating 4 year college attainment at the county level. The final product should look a bit like the plot below.

Note: The “state” variable in midwest is stored as an abbreviation rather than the full state name. Additionally, the counties are all uppercase. Use the dataset provided below as an intermediary needed to make the necessary joins.

state_abr <- read.csv("https://collinn.github.io/data/state_abrv.csv")

Additional info:

  1. County data, linewidth = 0.2, alpha = 0.75
  2. State data: linewidth = 0.5
  3. scale_fill_continuous(type = "viridis", option = "H") (Turbo)

Though, in general, feel free to play around with different options to get your choropleth looking how you want.

Shapefiles

While the polygon boundaries give by map_data() are simple enough for us to understand, they are not representative of the bulk of spatial data that are used in real-world applications. In many instances, geographic information is stored in a shapefile, a complex data structure that consists of, at minimum, several separate files:

  • .shp - file containing geometry of all features
  • .shx - file indexing the geometry
  • .dbf - file storing certain feature attributes in a tabular format (things like population, area, etc.)

For this portion of the lab, we are going to download and extract data from this folder which was originally obtained from Natural Earth Data. Notice that the extracted folder contains .shp, .shx, and .dbf files, along with a few others.

To begin, we’ll read this in with the readShapeSpatial() function from maptools package. A few things:

  1. You don’t need to change your working directory (in fact, you should not change it – we learned our lesson lab 9), but you do need to specify the path on your computer that the files are stored
  2. Note that you are not reading in the .shp, .shx, or .dbf file directly; rather, you are specifying the root names of the .shp, .shx, and .dbf files, all of which are the same
world <- readShapeSpatial("world_shape/ne_50m_admin_0_countries")

You’ll likely get several warning messages indicating that other packages may be preferred. We can ignore these for now.

Nearly every other object we have worked with in R (including ggplots themselves) have an underlying data structure that is essentially a list, meaning that all of the subset and indexing rules with which we are familiar apply. The world object, by contrast, is something entirely different. This is something we only concern ourselves with to the degree that we use a different syntax to manipulate it.

For example, using the str() function to print the structure of world, we see that it contains 5 “slots” which are accessed with @ rather than $

str(world, max.level = 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  242 obs. of  168 variables:
##   .. ..- attr(*, "data_types")= chr [1:168] "C" "N" "N" "C" ...
##   ..@ polygons   :List of 242
##   ..@ plotOrder  : int [1:242] 240 76 203 17 196 211 226 182 134 232 ...
##   ..@ bbox       : num [1:2, 1:2] -180 -90 180 83.6
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   ..$ comment: chr "FALSE"

To convert this shape file to something that ggplot can work with, we use fortify() which converts this into a data frame of polygon coordinates. From there, we can manipulate it as we did previously:

worldpoly <- fortify(world) # It might complain about fortify too. Ignore it

ggplot(data = worldpoly, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black")

Note that the fortify() function only creates a data frame with polygon data. To use any of the additional data from our shape file, we have to extract it with @. world@data is missing an id column that would allow it to be joined to worldpoly, so we add this and then join the two together

worlddat <- world@data

# This has 168 columns so we won't peek
#head(worlddat)

# id variable in world poly goes from 0 to 241 and it's a character vector. idk
worlddat$id <-  as.character((1:nrow(worlddat)-1))

# Join by id
worldfull <- left_join(worldpoly, worlddat, by = "id")

## BRICS countries
BRICS <- c("Brazil", "Russia", "India", "China", "South Africa")

# Now we can use world data in our aesthetics
ggplot(data = worldfull, aes(long, lat, group = group)) +
  geom_polygon(aes(fill = ifelse(NAME_EN %in% BRICS, "BRICS", "No BRICS")), 
               color = "black") + guides(fill = "none")

Question 3 The variable “GDP_MD` (already present in the merged data frame created above) records the gross domestic product (GDP) of each country measured in millions of US dollars for the year 2019. We can attempt to make a choropleth using this data, but given the extreme disparities in GDP between countries, the scales make this almost useless:

Instead, it is sometimes preferential to create discrete categories out of continuous variables. For this question, use the cut_number() function in ggplot2 to create groups for each of the four quartiles of the “GDP_MD” variable. See ?cut_number for creating cuts and labels. Finally, using the appropriate “brewer” function, set the palette to be "YlGnBu" (or any of the other sequential palettes). Your resulting graph should look something like this.

Leaflet (Bonus)

This portion of the lab is centered around the leaflet package which, like Plotly, is an external collection of open-source software libraries that have been integrated into the R ecosystem.

Rather than copy-and-paste material from other sources, I will instead here direct you to two places that you can get more information on using the leaflet package. The first is the course page from Professor Miller, linked here, which offers a quick crash-course on using Leaflet in R. The second is a more comprehensive tutorial put out by the Leaflet team itself, linked here. This is all completely optional, and no explicit credit will be given for completing any exercises. However, you may find either of the linked pages helpful for your own edification, and you may find something of interest for inclusion in your own R Shiny project after the break.