lubridate
This lab will cover select functions in the lubridate
package that are useful in working with dates and times:
# install.packages("lubridate")
library(lubridate)
Suppose we wanted to calculate the number of days that have elapsed
between Dec 12, 2019 and today. We could start by soliciting today’s
date using the function Sys.Date()
# Please don't get me started on the inconsistent capitalization
today <- Sys.Date()
today
## [1] "2023-10-06"
At first glance, this appears to be a character string. But if we were to try and subtract off another character string, we run into error
today - "2019-12-12"
## Error in unclass(as.Date(e1)) - e2: non-numeric argument to binary operator
As it turns out, dates, and by dates we mean specifically the
combination of month, day, and year, have a special Date
class in R. And though it has the appearance of a string (and can be
manipulated with stringr
functions), it’s underlying
representation is that of a numeric; and more precisely, a numeric
representing the number of days since January 1, 1970. If we want to be
able to subtract dates, we must do with with the as.Date()
constructor, which takes as its argument a character vector.
# Is is a Date class
class(today)
## [1] "Date"
# But underlying is a numeric (double)
typeof(today)
## [1] "double"
# We can explicitly cast it is numeric, revealing its underlying form
as.numeric(today)
## [1] 19636
# Subtract dates
today - as.Date("2019-12-12")
## Time difference of 1394 days
# Matches the underlying numeric
as.numeric(today) - as.numeric(as.Date("2019-12-12"))
## [1] 1394
Differentiating itself from Date
is another concept of
time that includes hours, minutes, and seconds known as
POSIX
(Portable Operating System Interface). And because
hours are specific to a particular timezone, this is typically included
as well:
# Current time and date
now <- Sys.time()
# Includes date, time, timezone
now
## [1] "2023-10-06 12:52:35 CDT"
# Class POSIXct
class(now)
## [1] "POSIXct" "POSIXt"
# Also a numeric, gives number of seconds to Jan 1, 1970
as.numeric(now)
## [1] 1696614755
Inconveniently, times and dates do not place nicely with one another
# Why give a warning when this should clearly be error?
today - now
## Warning: Incompatible methods ("-.Date", "-.POSIXt") for "-"
## [1] "-4643150-05-02"
Even worse, there is no nice way to create a POSIX object. Because there are so many different ways to represent time, R demands that you tell it explicitly how your character string is formatted if it’s not already exactly what it expects.
# This is apparently fine
as.POSIXct("2017-05-24 08:45")
## [1] "2017-05-24 08:45:00 CDT"
# This is apparently not fine
as.POSIXct("05/24/2017 08:45")
## Error in as.POSIXlt.character(x, tz, ...): character string is not in a standard unambiguous format
# How cumbersome is this?
as.POSIXct("05/24/2017 08:45", format = "%m/%d/%Y %H:%M", tz = "America/Chicago")
## [1] "2017-05-24 08:45:00 CDT"
lubridate
There are a two things specifically we hope to have demonstrated in this preamble that will be relevant for the lab that follows:
Date
class in R includes month, day, and year, it
looks like a character vector, but under the hood it is a number
indicating how many days from Jan 1, 1970POSIXct
class in R includes everything from
Date
, as well as hours and minutes and sometimes seconds.
It also requires a timezone. Similar to Date
, it is stored
as a numeric, but now as the number of seconds since Jan 1,
1970This lab will focus on using the lubridate
package,
significantly reducing the burden associated with handling dates in
R.
As you might expect, lubridate
comes with a suite of
functions intended to extract out the constituent parts of a date.
Component | Function |
---|---|
Year | year() |
Month | month() |
Day | day() |
Hour | hour() |
Minute | minute() |
Second | second() |
Here, we consider how these functions operate on a
POSIXct
type variable. They will work on Date
variables as well, though as Date
does not include hour,
minute, or second, these will be returned as zero
# This is POSIXct
(now <- Sys.time())
## [1] "2023-10-06 12:52:35 CDT"
year(now)
## [1] 2023
day(now)
## [1] 6
month(now)
## [1] 10
hour(now)
## [1] 12
## Returned as zero as there are no seconds
minute(today)
## [1] 0
\(~\)
As we mentioned previously, one the biggest challenges working with dates or times is the multiplicity of formats that are frequently used. For example, the date “September 1, 1939” may get recorded as any number of the following:
lubridate
provides a collection of functions that help
standardize these formats into a common representation. The functions
themselves are mdy
, ymd
, and mdy
– in other words, it asks you to indicate the order in which they are
stored and the functions handle the rest.
## Stored in mdy format
mdy("September 1st, 1939")
mdy("9/1/39") # <- We do have to be careful sometimes
mdy("9/1/1939")
## [1] "1939-09-01"
## [1] "2039-09-01"
## [1] "1939-09-01"
# Stored as year month day
ymd("1962 February 7")
ymd("1962/2/7")
## [1] "1962-02-07"
## [1] "1962-02-07"
# You can sometimes get a little crazy
dmy("30th of May 2019")
## [1] "2019-05-30"
These functions also extend to include hours minutes and seconds
mdy_hm("May 12, 2017 4:45pm", tz = "America/Chicago")
## [1] "2017-05-12 16:45:00 CDT"
mdy_hms("05-12-2017 16:45:00", tz = "America/Chicago")
## [1] "2017-05-12 16:45:00 CDT"
An important thing to keep in mind, however, is that this
lubridate
is not perfect – often without rhyme or reason
some things will work while others do not:
# Sept works, 1st works, but Sept 1st does not
mdy("Sept 11th, 2001")
mdy("Oct 1st, 2001")
mdy("Sept 1st, 2001")
## Warning: All formats failed to parse. No formats found.
## [1] "2001-11-20"
## [1] "2001-10-01"
## [1] NA
Question 1: On January 27th 1967 at 6:31 PM, the Apollo 1 spacecraft, planned to be the first manned mission of the Apollo space program, experienced a cabin fire on the landing pad in Cape Kennedy Air Force Station, Florida during a launch simulation, killing all three crew members on board. Nearly 19 years later, on January 28, 1986 at 11:39 AM, the Challenger Shuttle exploded just off the coast of Cape Canaveral, Florida. Rounding each date to the nearest day, determine how many days passed between these two events.
apollo <- "1986 Jan 27th at 6:31:19 PM UTC"
challenger <- "28 January 1967, 1139am"
\(~\)
The lubridate
package also contains a handful of
functions to help perform common calculations:
Function | Output |
---|---|
yday() |
day of the year (number from 1-365) |
wday() |
day of week (number from 1-7 or factor label when
label=TRUE is used) |
floor_date() |
rounds the date downward |
ceiling_date() |
rounds the date upward |
round_date() |
rounds the date upward/downward (whichever is closer) |
A few examples demonstrating these functions are given below:
today <- Sys.Date()
## Day of year
yday(today)
## [1] 279
## Day of week
wday(today)
## [1] 6
# label = TRUE creates an ordered factor
wday(today, label = TRUE)
## [1] Fri
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
## Rounding
floor_date(today, unit = "month") # down to nearest month
## [1] "2023-10-01"
ceiling_date(today, unit = "month") # up to nearest month
## [1] "2023-11-01"
round_date(today, unit = "month") # to whichever is closest
## [1] "2023-10-01"
Question 2: Create a date/time object for 9:15pm in Los Angeles on February 14, 2020. Then, round this date to the nearest day, then determine which day of the week that day was.
Sometimes you’ll encounter data consisting of times without an attached date. These might be times within a day such as “01:30:00” or 1:30 AM, or duration of time such as 1 hour, 30 minutes, and 0 seconds.
The lubridate
package provides a simple storage class
for times without dates that can be applied using the hms()
function.
## Example
(time <- hms("01:10:00"))
## [1] "1H 10M 0S"
#
60*hour(time) + minute(time)
## [1] 70
Because these objects are stored at 00:00:00
we can
perform arithmetic on them directly:
hms("01:10:00") - hms("01:05:00")
## [1] "5M 0S"
We can also exploit this fact to easily convert results to seconds using pipelines:
(hms("01:10:00") - hms("01:05:00")) %>% seconds()
## [1] "300S"
\(~\)
Question 3: The 2015 Boston Marathon took place on April 20th, 2015. It was the 119th running of one of the world’s most well-known races. The data below contain information, results, and splits for each finisher of the marathon:
marathon <- read.csv("https://remiller1450.github.io/data/BostonMarathon2015.csv")
Part A: A marathon is approximately 26.2 miles, making the first half 13.1 miles. Calculate the per mile pace (in seconds) for each participant in the first half of the race. Be sure to store your results.
Part B: Now calculate the pace per mile in the second half of the race. Be sure to store your results.
Part C: Now create a scatterplot displaying the
relationship between pace per mile in the first half of the race
vs. pace per mile in the second half of the race by age and sex. To do
this, you should assemble your results from Parts A and B into a data
frame, and you should also include the “Age” and “M.F” columns from the
original data when you create this data frame. A target graphic is given
below. Note: hms::scale_x_time()
and
hms::scale_y_time()
can be used to display your first half
and second half paces on a time scale. The graph shown below uses the
argument alpha = 0.2
to reduce the impact of over-plotting,
and a 45-degree line is added using geom_abline()
.
Note: If you get an error that hms
is not available, you
need to install it with install.packages("hms")
. However,
do not load the entire package with library(hms)
, as this
will cause problems with the lubridate
package. Instead, to
use a function from a package without loading the entire package, we use
the double colon ::
with the form
packagename::function
. In this case, we are wanting to use
the function scale_x_time()
from the package
hms
, hence hms::scale_x_time()
.
\(~\)