dplyr
For this lab we will continue to use the dplyr
and
ggplot2
packages, as well as another data set package,
nycflights13
# install.packages("nycflights13")
library(ggplot2)
library(dplyr)
library(nycflights13)
The nycflights13
package contains a number of datasets
which, together, make up a database. This data originates from
the US Bureau of Transportation pertaining to all flights that departed
from New York City in 2013. As before, we can load them explicitly with
the data
function
data("airlines")
data("flights")
data("planes")
data("weather")
data("airports")
The present lab is oriented around the topic of joins and merges which, simply enough, involve joining or merging several related datasets into a single dataset. Any structured collection of related datasets we might encounter could reasonably be called a database.
Databases represent a very natural and often intuitive way of organizing large collections of data. For example, we might consider all of the data that Grinnell College has on you – social security numbers, addresses, grades and enrollment status. While it is certainly possible to keep all of this information in one large data frame, it often makes more sense to keep related data together with a way to join separate data frames if needed. This encapsulates pretty well the idea of databases: a collection of these independent but related data frames that can be linked together on demand.
To understand how datasets within a database can be joined to one another, we need to have an understanding of the concept of keys. There are two distinct types of keys that we consider:
Part of our task will involve finding the primary keys in the dataset we are beginning with and then identifying their corresponding foreign key in the tables we wish to join.
We can investigate the concept of primary and foreign keys by
considering the collection of data frames included in our airlines
database. Let’s look first at our planes
dataset,
containing metadata for all of the planes in the FAA aircraft
registry:
head(planes)
## # A tibble: 6 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
As we can see, each row constitutes an individual plane, each of
which can be uniquely identified by it’s tail number, making
tailnum
the primary key in this dataset. Next, we consider
the data frame of airlines:
head(airlines)
## # A tibble: 6 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
In this case, the data frame contains both the full name of the
airline as well as a two letter abbreviation. While both of these
columns do uniquely identify each row, we will find that other datasets
in our database are more likely to contain the abbreviation, making
carrier
the more obvious candidate for a primary key
here.
There are sometimes situations in which there is no single column
that uniquely identifies observations in a dataset. Consider for example
the weather
data, which includes hourly snapshots of the
weather at each of the three NYC airports:
head(weather)
## # A tibble: 6 × 7
## origin year month precip pressure visib time_hour
## <chr> <int> <int> <dbl> <dbl> <dbl> <dttm>
## 1 EWR 2013 1 0 1012 10 2013-01-01 01:00:00
## 2 EWR 2013 1 0 1012. 10 2013-01-01 02:00:00
## 3 EWR 2013 1 0 1012. 10 2013-01-01 03:00:00
## 4 EWR 2013 1 0 1012. 10 2013-01-01 04:00:00
## 5 EWR 2013 1 0 1012. 10 2013-01-01 05:00:00
## 6 EWR 2013 1 0 1012. 10 2013-01-01 06:00:00
Here, each origin
contains almost 9,000 observations,
while each time_hour
has an entry for each of the three
airports. When combined, however, each provides a unique identifier for
each row – the weather at a particular airport at a specific time. When
combined, these variables together make up what is called a compound
primary key.
Given a primary key in any particular table, we are then
looking for a corresponding foreign key in a table that we are
hoping to relate it to. Given each of the primary keys we have
identified so far, consider the flights
dataset which
includes detailed information on each of the individual flights that
departed from an airport in NYC:
## Some columns hidden for space
head(flights)
## # A tibble: 6 × 7
## year carrier flight tailnum origin dest time_hour
## <int> <chr> <int> <chr> <chr> <chr> <dttm>
## 1 2013 UA 1545 N14228 EWR IAH 2013-01-01 05:00:00
## 2 2013 UA 1714 N24211 LGA IAH 2013-01-01 05:00:00
## 3 2013 AA 1141 N619AA JFK MIA 2013-01-01 05:00:00
## 4 2013 B6 725 N804JB JFK BQN 2013-01-01 05:00:00
## 5 2013 DL 461 N668DN LGA ATL 2013-01-01 06:00:00
## 6 2013 UA 1696 N39463 EWR ORD 2013-01-01 05:00:00
Here, we see a number of columns that directly relate to the other
datasets we have considered so far. For example,
planes$tailnum
corresponds to flights$tailnum
,
allowing us to relate information about individual planes (model type,
number of seat, manufacturer, etc.,) to each of the individual flights.
As each plane is associated with more than one flight, keeping all of
this information in a table would result in a tremendous amount of
repeated data.
The web of related variables between datasets within a database is known as a relational structure, a visual representation of which is given below:
We generally operate under the expectation that a primary or foreign
key is unique (though this is not always true), and we can
check this with a quick set of operations from the dplyr
package. We will first count()
how many instances of each
unique value in a column appear and then filter()
the
resulting data to if any of them have a count greater than 1. If this
final table is empty, we know our key is unique
# Unique primary key in planes
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
# time_hour is not unique for weather data
weather %>%
count(time_hour) %>%
filter(n > 1)
## # A tibble: 8,706 × 2
## time_hour n
## <dttm> <int>
## 1 2013-01-01 01:00:00 3
## 2 2013-01-01 02:00:00 3
## 3 2013-01-01 03:00:00 3
## 4 2013-01-01 04:00:00 3
## 5 2013-01-01 05:00:00 3
## 6 2013-01-01 06:00:00 3
## 7 2013-01-01 07:00:00 3
## 8 2013-01-01 08:00:00 3
## 9 2013-01-01 09:00:00 3
## 10 2013-01-01 10:00:00 3
## # ℹ 8,696 more rows
# But it is once we combine it with origin
weather %>%
count(time_hour, origin) %>%
filter(n > 1)
## # A tibble: 0 × 3
## # ℹ 3 variables: time_hour <dttm>, origin <chr>, n <int>
It is the combination of the primary and foreign keys that form a relation. Relations might be one-to-one, where each instance of the primary key appears at most once among the foreign key, or one-to-many, where the each primary key matches one or more foreign keys.
This lab will cover the topics of merging joins and filtering joins which will be described more in their respective sections. For most of these exercises we will be using a contrived database of customer and order data that is small enough for us to visually inspect the results
orders <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/orders.csv")
customers <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/customers.csv")
The goal of mutating joins is to combine variables from two different data frames X and Y.
There are three primary types of mutating joins:
NA
records in X that do not have a match in YNA
for records in X without
a match in Y and for records in Y without a match in XThere is also a right join which is identical to a
left join, but with the arguments reversed. This is sometimes useful in
conjunction with the pipe operator %>%
, but it is not
very common.
A visual illustration of these joins is given below:
Consider our customer data frames orders
and
customers
print(orders)
## order id date
## 1 1 4 Jan-01
## 2 2 8 Feb-01
## 3 3 42 Apr-15
## 4 4 50 Apr-17
print(customers)
## id name
## 1 4 Tukey
## 2 8 Wickham
## 3 15 Mason
## 4 16 Jordan
## 5 23 Patil
## 6 42 Cox
We see quickly that each of these data frames shares an identifier
column id
, making it a natural candidate for a primary key
in both data frames. We will also note that not all of the ids in the
orders dataset has a corresponding entry in the customer list, and
similarly, not every customer has an outstanding order.
The left_join()
function is used to perform a left join,
in this case adding customer information to the data frame of orders
# the first argument is always "x", the second is "y", and "by" is third
left_join(x = orders, y = customers, by = "id")
## order id date name
## 1 1 4 Jan-01 Tukey
## 2 2 8 Feb-01 Wickham
## 3 3 42 Apr-15 Cox
## 4 4 50 Apr-17 <NA>
Because this is a left join, we know that we will be keeping all of
the observations in x
, regardless of whether or not they
have a match in y
. In this case, we see that there is an id
value of 50 in our orders
data frame that is not present in
customers
, resulting in an NA
value being
added to the resulting dataset.
If we flip the arrangements of the arguments so that
x = customers
and y = orders
, we will find the
situation is changed. In other words, left joins are not
symmetric; the order of the x
and y
values will often result in different results
left_join(x = customers, y = orders, by = "id")
## id name order date
## 1 4 Tukey 1 Jan-01
## 2 8 Wickham 2 Feb-01
## 3 15 Mason NA <NA>
## 4 16 Jordan NA <NA>
## 5 23 Patil NA <NA>
## 6 42 Cox 3 Apr-15
Here, the situation is a bit different – we see that customers with
id values of 15, 16, and 23 do not have any orders listed, so these
values have been filled in with NA
values. Further, note
that as we saw in the last example that no customer had an id value of
50 in the order dataset, this order has been dropped entirely here and
does not appear after the merge.
Question 1 In this question we are going to consider
the output of an example left join between two data sets,
tab1
and tab2
tab1 <- data.frame(key = c("a", "a", "b", "b", "b", "c"), A = 1:6)
tab1
## key A
## 1 a 1
## 2 a 2
## 3 b 3
## 4 b 4
## 5 b 5
## 6 c 6
tab2 <- data.frame(key = c("a", "b", "c"), B = c("X", "Y", "C"))
tab2
## key B
## 1 a X
## 2 b Y
## 3 c C
Given our initial description of the left join, which retains all of
the elements in the x
argument, one might expect that using
x = tab2
would return a data frame that contains three
rows, one for each observation. Instead, we see that it produces a data
frame with six rows. Why do you think this happens?
left_join(tab2, tab1, by = "key")
## key B A
## 1 a X 1
## 2 a X 2
## 3 b Y 3
## 4 b Y 4
## 5 b Y 5
## 6 c C 6
Full joins, also known as outer joins, performs a joins such
that all observations from both the x
and y
arguments are included
full_join(x = orders, y = customers, by = "id")
## order id date name
## 1 1 4 Jan-01 Tukey
## 2 2 8 Feb-01 Wickham
## 3 3 42 Apr-15 Cox
## 4 4 50 Apr-17 <NA>
## 5 NA 15 <NA> Mason
## 6 NA 16 <NA> Jordan
## 7 NA 23 <NA> Patil
Note how now there is a row for each unique id value from both the
customer
and order
datasets. When a full join
is used, none of the observations are dropped, and everything is filled
with NA
. Because of this, outer joins are
symmetric, meaning the changing the order of the arguments will only
impact the ordering of the columns in the resulting data frame.
# Now it goes id, name, order, date
full_join(x = customers, y = orders, by = "id")
## id name order date
## 1 4 Tukey 1 Jan-01
## 2 8 Wickham 2 Feb-01
## 3 15 Mason NA <NA>
## 4 16 Jordan NA <NA>
## 5 23 Patil NA <NA>
## 6 42 Cox 3 Apr-15
## 7 50 <NA> 4 Apr-17
The inner_join()
function is used to perform an inner
join on two data frames, retaining only those observations that are
present in both data sets
inner_join(x = orders, y = customers, by = "id")
## order id date name
## 1 1 4 Jan-01 Tukey
## 2 2 8 Feb-01 Wickham
## 3 3 42 Apr-15 Cox
Because of this, like the outer joins, inner joins are also
symmetric, meaning again that changing the x
and
y
arguments only results in a change in the order of the
columns:
inner_join(x = customers, y = orders, by = "id")
## id name order date
## 1 4 Tukey 1 Jan-01
## 2 8 Wickham 2 Feb-01
## 3 42 Cox 3 Apr-15
Question 2 Consider two data frames named
A
and B
, each containing the primary key
variable id
. Each data frame has four rows, with the id
values in A being \(\{1, 2, 3, 4\}\)
and the id values in B
being \(\{3, 4, 5, 6\}\). Given this information,
state how many rows will be present in the results produced by the
following joins:
left_join(x = A, y = B, by = "id")
left_join(x = B, y = A, by = "id")
full_join(x = A, y = B, by = "id")
inner_join(x = A, y = B, by = "id")
by
argumentIn the ideal case, two data frames are joined by a single variable
sharing the same name in both x
and y
.
However, there are several other ways to use the by
argument when this is not the case
by = NULL
which will
automatically identify all variables that are shared between
datasets. This can be useful when there are several variables needed to
uniquely identify an observation, but it is absolutely critical
that you be sure that each of the variables refer to the same thing. For
example, considering the nycflights
data,
planes$year
refers to the year the plan was manufactured
while flight$year
refers to the year of departure.by = c("a" = "b")
when you want to use the primary key
a
in the x
dataset and the foreign key
b
in the y
dataset.Here are a few examples illustrating each
## Since they both have "id" we can leave the `by` argument empty
inner_join(x = customers, y = orders)
## id name order date
## 1 4 Tukey 1 Jan-01
## 2 8 Wickham 2 Feb-01
## 3 42 Cox 3 Apr-15
## We can specify we want to match with the orders$order
# Since there is an "id" column in each dataset we are not matching on, it
# creates a new column for customers$id with "id.y"
inner_join(x = customers, y = orders, by = c("id" = "order"))
## id name id.y date
## 1 4 Tukey 50 Apr-17
Question 3 In the flightsnyc13
database, the location of each airport (lat
and
lon
for latitude and longitude) are stored in the
airports
data file. Use the appropriate join to add
lat
and lon
of the destination airport to each
flight recorded in flights. Do not add any other variables! This might
mean that you first have to process airports
with
dplyr
functions.
Filtering joins will match observations in the same way as mutating
joins, but they will not add any variables from y
to the x
data frame.
semi_join()
will keep the observations
in x
that have a match in y
anti_join()
will drop the observations
in x
that have a match in y
An important thing to keep in mind with filtering joins: no new
columns are added. In other words, none of the columns from
y
will be merged into the columns of x
.
Here a few examples:
semi_join(x = orders, y = customers, by = "id")
## order id date
## 1 1 4 Jan-01
## 2 2 8 Feb-01
## 3 3 42 Apr-15
Here, semi_join()
filters out order 4 because there are
no customers with an id of 50 in the customers
data
frame.
anti_join(x = orders, y = customers, by = "id")
## order id date
## 1 4 50 Apr-17
In the above example, anti_join()
will filter out
everything except for order 4 because it was the only observation in the
x
data file without a match in y
Filtering joins are most useful in finding implicit missing
values. Whereas NA
values indicate the “presence of an
absence”, implicit missing values are an “absence of a presence.” For
example, we can find out if any entires are missing in the
airports
dataset by looking for flights in
flights
whose destination doesn’t have a match
anti_join(x = flights, y = airports, by = c("dest" = "faa")) %>%
distinct(dest) # return only unique values of `dest`
## # A tibble: 4 × 1
## dest
## <chr>
## 1 BQN
## 2 SJU
## 3 STT
## 4 PSE
We see that we have four implicitly missing values in our
airports
data frame.
Question #4: The code below creates a data frame,
top_dest
, containing the ten most common flight
destinations and how many flights arrived at them.
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
slice_head(n = 10)
flights
data file to
include only flights whose destination is one of the top ten airports.
Then, use dim()
to print the dimensions of the resulting
data frame.planes
data
frame\(~\)
Question 5:
visib
(visibility rating)
in the weather data frame is associated with a flight being delayed more
than 10 hours (a dep_delay
over 600). To achieve this, you
should calculate the average and standard deviation of the
visib
variable for each grouping of delay (more than/less
than 10 hours) (using mean
and sd
). You may
use the argument na.rm = TRUE
to remove missing values when
performing these calculations.count()
and
is.na()
, find the number of flights that are missing data
on the variable visib
for each grouping of delay. You may
also report the number of flights that were missing the variable
dep_delay
, which was also involved in Part A.Note: At some point you should use the ifelse()
function to create a binary variable representing whether a flight’s
delay was larger than 10 hours. An example of this function is shown
below:
## ifelse() example
values <- c(1,1,3,4,5,10)
binary_values <- ifelse(values < 4, "less than 4", "greater than or equal to 4")
binary_values
## [1] "less than 4" "less than 4"
## [3] "less than 4" "greater than or equal to 4"
## [5] "greater than or equal to 4" "greater than or equal to 4"
As you can see, ifelse()
takes a logical condition as
its first argument, returning the second argument where that condition
is TRUE
and the third argument where that condition is
FALSE
.