The focus of this lab is going to be on strings and string
manipulation. We will be fascilitating this with the
stringr
package
# install.packages("stringr")
library(stringr)
library(dplyr)
What is a string? Mostly simply, a string is a sequence of characters
in R
enclosed between single or double quotes. It’s
analogous to a number (or numeric) in that a string, however long, is a
single element in R
. Also like numbers, multiple strings
can be collected together to make a vector
# This is a string
a <- "abc"
# This is also a string
b <- "I think statistics is really super duper neat!"
# Strings can be combined into a character vector with `c`
vec <- c(a, b)
vec
## [1] "abc"
## [2] "I think statistics is really super duper neat!"
Note: As a general comment on naming variables and best practices, we
should avoid things like c <- "blah blah"
because
c
is a function commonly used in R. This also applies to
naming things with other common function names such as
mean
, sd
, and summary
.
Strings have a number of unique properties that will be helpful to keep in mind; one of these is the concept of length. When we refer to the length of a string, we are referring to the number of characters that the string contains:
# This is both a string and a vector of length 1
y <- "four"
# The length of the string is four
str_length(y)
## [1] 4
# The length of the vector is still one
length(y)
## [1] 1
When we have a vector of strings, str_length
will give
us length information for each string. This will be a common phenomenon
in this lab – funtions that apply to strings will also apply to each
element of a character vector
y <- c("apple", "banana", "pears")
str_length(y)
## [1] 5 6 5
As a corollary of the fact that strings have length, they also have associated indices meaning that it is possible for me to specify that I want, say, the second letter of a string
y <- "rstudio"
str_sub(y, start = 2, end = 2)
## [1] "s"
str_sub
stands for “substring” rather than “subset”,
which exists in the more useful function str_subset
(observe that all stringr functions begin with str_
). While
this is one of the less useful functions, it is helpful in illustrating
a few concepts. Using str_sub
, we can also specify that we
want all of the characters up to a certain point, all of the
characters past a certain point, or all of the characters
between two points
y <- "firefighter"
# Give me everything starting at position 5
str_sub(y, start = 5)
## [1] "fighter"
# Give me everything up to position 4
str_sub(y, end = 4)
## [1] "fire"
# Give me everything between 3 and 9
str_sub(y, start = 3, end = 9)
## [1] "refight"
str_sub
also (conveniently) works with negative numbers.
In this case, they work backwards from the end of a string
str_sub(y, start = -7, end = -3)
## [1] "fight"
We discussed coercian briefly in the second lab, but it is worth reviewing briefly here as this is a very common problem in data analysis.
Vectors in R
are all expected to be of the same type:
logicals, numerics, characters, etc.,. In the case of a mixed type,
R
coerces all of the elements to be the same.
There is a hierarchy in coercion in that all logicals can be represented
by numerics, and all numerics can be represented as characters,
etc.,.
x <- c(1, 2, 3)
x
## [1] 1 2 3
class(x)
## [1] "numeric"
x <- c(1, 2, 3, "A")
x
## [1] "1" "2" "3" "A"
class(x)
## [1] "character"
We can force coercion in the other direction, but mismatches will
result in missing (NA
) values.
as.numeric(x)
## [1] 1 2 3 NA
Often, it’s the presence of a single character that can cause issues, contaminating a dataset. Consider this data.frame, for example, in which both of our intended numeric columns are corrupted:
dat <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
dat
## ID messy_x messy_y
## 1 1 100 50
## 2 2 90 40
## 3 3 85 55
## 4 4 90 45
## 5 5 110 55*
## 6 6 Missing 60
## 7 7 115 40
## 8 8 <NA> 35
## 9 9 105 40
## 10 10 100 50
str(dat)
## 'data.frame': 10 obs. of 3 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10
## $ messy_x: chr "100" "90" "85" "90" ...
## $ messy_y: chr "50" "40" "55" "45" ...
In the first case in messy_x
, we see that the cause of
the issue is that missing data was entered as "Missing"
rather than NA
. Issues like this can generally be resolved
pretty quickly by simply coercing it back to the correct type.
dat$messy_x <- as.numeric(dat$messy_x)
dat
## ID messy_x messy_y
## 1 1 100 50
## 2 2 90 40
## 3 3 85 55
## 4 4 90 45
## 5 5 110 55*
## 6 6 NA 60
## 7 7 115 40
## 8 8 NA 35
## 9 9 105 40
## 10 10 100 50
In the second case, see see that one of the entries in annotated as
55*
. As we want to remove the annotation without losing the
entry, we will need some more tools. Problems such as this are the
primary motivation for this lab.
This lab will cover some of the most commonly used elements of the
stringr
package, as well as introduce regular
expressions
A cheat sheet for both of these together is provided here: CHEATSHEET
The stringr
package, as the name implies, was created to
assist with the manipulation of strings in R
. The
motivation for string manipulation arises from a number of avenues.
Often, there are issues with data entry, making it difficult to use and
manipulate data as intended; this is what we saw in the
messy_x
column above. More commonly, however, string
manipulation can be used to assist us directly in data processing
tasks.
The most common tasks we will have include
To that end, here are some of the more common functions we will be using, though it is by no means comprehensive
Function | Description |
---|---|
str_sub() |
Extract substring from a given start to end position |
str_replace() |
Replace the first instance of a substring with another |
str_replace_all() |
Replace all instances of a substring with another |
str_c() |
Concatenate or combine strings |
str_subset() |
Subset character vector with strings matching a pattern |
str_detect() |
Similar to str_subset , returning a logical vector |
str_extract() |
Extract matching pattern from a string |
str_count() |
Count instances of pattern in a string |
Most of the functions in the stringr
package take at
least two arguments: string and pattern. The string indicates the
character string that we are manipulating, while pattern indicates the
sequence of characters that modulate our function. This will make more
sense once we consider some examples.
The first set of functions we will consider will be those that modify
strings directly. We already saw above that str_sub
can
extract a subset of a string; it can also be used to modify specific
portions of a string
# position 1 to 3 is "sta"
x <- "sta230"
str_sub(x, 1, 3)
## [1] "sta"
# replace "sta" with "statistics"
str_sub(x, 1, 3) <- "statistics"
x
## [1] "statistics230"
Using str_sub
is a bit awkward, however, as it requires
positional indices. More commonly, there is the str_replace
function which takes the two arguments we mentioned above, string and
pattern. In this case, string
is the string we are
manipulating and pattern
indicates what we wish to replace.
Finally, there is a replacement
argument which we use to
replace pattern
x <- "sta320"
str_replace(string = x, pattern = "sta", replacement = "statistics")
## [1] "statistics320"
Like all stringr
functions, this is vectorized, meaning
that we can do this on a vector of character strings.
# A vector of hypothetical ids
ids <- c("M289", "M432", "F201", "M365")
ids <- str_replace(string = ids, pattern = "M", replacement = "MALE")
ids
## [1] "MALE289" "MALE432" "F201" "MALE365"
ids <- str_replace(ids, "F", "FEMALE")
ids
## [1] "MALE289" "MALE432" "FEMALE201" "MALE365"
Oddly (and this also applies to most stringr
functions),
these operations typically only happen on the first instance of
a pattern match. So, for example, if a particular pattern shows up
multiple times in a string, the replacement will only occur on the
first
dog <- "dog_dog_dog_dog_dog"
str_replace(dog, "dog", "GOD")
## [1] "GOD_dog_dog_dog_dog"
We can specify that we want this to apply to all instances of the
pattern with str_replace_all
str_replace_all(dog, "dog", "GOD")
## [1] "GOD_GOD_GOD_GOD_GOD"
Personally, I really like using this function to remove characters I
don’t want by replacing a pattern with ""
or
"_"
x <- "123-456-789"
str_replace_all(x, "-", "")
## [1] "123456789"
There are also a collection of mutating functions that operate as housekeeping functions, allowing us to standardize or clean up sloppy data entry. For example, consider the following:
s <- "tHe quIcK brOWn fOX juMPeD oVeR tHe LAzY dOG"
# Make all lower
str_to_lower(s)
## [1] "the quick brown fox jumped over the lazy dog"
# Make all upper
str_to_upper(s)
## [1] "THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG"
# Make a title
str_to_title(s)
## [1] "The Quick Brown Fox Jumped Over The Lazy Dog"
# Make a sentence
str_to_sentence(s)
## [1] "The quick brown fox jumped over the lazy dog"
Clearly, some of these are more useful than others.
It’s worth mentioning also a number of functions that handle extra
spaces which can sometimes occur with manual data entries.
str_trim
will trim off all leading and trailing white
space, while str_squish
will do that and remove
repeated spaces in the string
x <- " oops too much space "
str_trim(x)
## [1] "oops too much space"
str_squish(x)
## [1] "oops too much space"
Question 1 Using the functions we have learned so
far, modify the string below to produce the string
"United_States"
s <- c(" UNITED STATES ")
Next we have joining and splitting which, as the heading suggests,
involves joining or splitting strings together. Joining strings is
especially useful when we want to present summary information based on
statistics that have yet to be computed. str_c
takes a
comma separated collection of values which it will then paste together
to form a single string
val <- 1:10
str_c("The mean value of val is: ", mean(val))
## [1] "The mean value of val is: 5.5"
str_c
can also be used to turn a character vector into a
single string by using the collapse
argument
s <- c("a", "b", "c")
s
## [1] "a" "b" "c"
str_c(s, collapse = "")
## [1] "abc"
str_c(s, collapse = ", ")
## [1] "a, b, c"
We have already seen a form of string splitting with the
separate
function in dplyr
. With simple
strings, it is str_split
. Also as before, .
is
a special character that needs to be escaped with \\
s <- "cat.dog"
str_split(s, "\\.")
## [[1]]
## [1] "cat" "dog"
This one is a bit strange in that it returns a list
object instead of a character vector. This is because unlike the other
functions we have seen so far, str_split
will split every
instance of the pattern. Returning a list allows each element of the
character vector to be a different length
s <- c("oneword", "two.words", "this.has.three")
str_split(s, "\\.")
## [[1]]
## [1] "oneword"
##
## [[2]]
## [1] "two" "words"
##
## [[3]]
## [1] "this" "has" "three"
The next collection of functions deals with subsetting and
extraction. We already saw str_sub
in a few contexts, so we
will not repeat that here. The two primary functions here are
str_extract
and str_subset
. The first of
these, str_extract
, simply extracts from the string any
pattern that matches pattern
or otherwise returns
NA
x <- c("apple", "banana", "melon")
str_extract(x, "a")
## [1] "a" "a" NA
There is also str_subset
which will instead retain each
element of a character vector matching a pattern, discarding everything
else
x <- c("apple", "banana", "melon")
str_subset(x, "a")
## [1] "apple" "banana"
These two functions seem a bit silly here, but we will return to them after we introduce regular expressions.
Finally, we have pattern detection. This again is much more useful with regular expressions, but we will briefly introduce the primary functions here.
First we have str_detect
, which returns a logical vector
indicating if a pattern was found in a string
s <- c("dog", "cat", "parrot", "hamster")
str_detect(s, pattern = "o")
## [1] TRUE FALSE TRUE FALSE
This can be used similarly to str_subset
above wher the
logical vector indicates which we would keep
s <- c("dog", "cat", "parrot", "hamster")
idx <- str_detect(s, pattern = "o")
s[idx] # true and false values
## [1] "dog" "parrot"
str_subset(s, "o")
## [1] "dog" "parrot"
Also useful is str_count
which counts the number of
times a pattern appears in a string
x <- "mississippi"
str_count(x, "s")
## [1] 4
Up until this point, the stringr
functions we have
introduced may seem a bit limited in their usefulness. However, each of
the above examples is intended to provide an unambiguous illustration of
how each of these functions work. In this current section, we combine
our stringr
functions with patterns produced by regular
expressions, infinitely expanding their utility.
Regular expressions (regex) are an incredibly powerful tool for specifying patterns in a text; a short collection of examples is provided on the second page of the cheatsheet, which you will want to reference frequently for the remaining portion of this lab.
While we will not include it here in the lab for brevity’s sake,
there is a very useful function called str_view
which
allows you to see visually the pattern in a string by enclosing the
pattern in brackets. As our regular expressions get more sophisticated,
this will help visualize what exactly your pattern is finding,
especially useful when you are not getting the results you want:
## Find all the vowels
str_view("apple", "[aeiou]")
## [1] │ <a>ppl<e>
Meta-characters are characters that have a particular meaning in
regex language, meaning that they cannot be used to literally express a
string value. We have seen one of these already in R
, with
the period .
which has to be escaped. In regex,
the period is used to indicate a “wildcard” character, meaning it will
match with anything. By escaping a meta-character with \\
,
we are able to tell R
to interpret the string literally.
Here, we count the number of periods in a string
x <- "hello.world"
# As a wildcard, this counts every single character
str_count(x, ".")
## [1] 11
# We must "escape" the period if we want to count periods
str_count(x, "\\.")
## [1] 1
We won’t list out all of the meta-characters here. Instead, understand that in the next sections we will introduce a number of expressions that use special characters. To use those characters literally, we will need to escape them in strings.
Question 2 Return to the dataset we introduced at the beginning of the lab
dat <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
dat
## ID messy_x messy_y
## 1 1 100 50
## 2 2 90 40
## 3 3 85 55
## 4 4 90 45
## 5 5 110 55*
## 6 6 Missing 60
## 7 7 115 40
## 8 8 <NA> 35
## 9 9 105 40
## 10 10 100 50
Using the tools we have introduced thus far, correct this dataset so
that the missing values in messy_x
are recorded as
NA
and the annotated values in messy_y
are
modified so that it can be correctly converted to numeric.
Anchoring simply means that we wish to indicate the start or the end
of the string. For example, if instead of finding all instances
in which a string contains the letter “a” we want to find those that
start with the letter “a”, we can “anchor” our expression with
"^"
which indicates the beginning of a string
s <- c("apple", "pineapple", "banana")
str_subset(s, "a")
## [1] "apple" "pineapple" "banana"
str_subset(s, "^a")
## [1] "apple"
Likewise, we use "$"
to indicate the end of a string
s <- c("apple", "pineapple", "banana")
str_subset(s, "a")
## [1] "apple" "pineapple" "banana"
str_subset(s, "a$")
## [1] "banana"
Of course, if we actually wanted to find “^” or “$”, we would need to escape them
s <- c("h^t", "hat", "hut", "hit")
str_subset(s, "\\^")
## [1] "h^t"
Question 3 Use str_extract
to extract
the first letter of each string in this vector. Then combine those
letters with str_c
to create a single character string
s <- c("howdy", "otter", "tragedy", "danger", "octopus", "grapes")
Not such a silly function now, is it?
While regular expressions does refer to this entire process itself,
there are also sets of patterns that we refer to as regular expressions.
A pretty decent list is included on the cheat sheet. Here, we will
finally introduce the period we have seen so frequently in it’s natural
environment. In a regular expression, .
simply means any
character.
s <- c("X25YW", "X91YW", "R30WY", "X11YW")
# We want all X, followed by 2 of any character, then YW
str_subset(s, "X..YW")
## [1] "X25YW" "X91YW" "X11YW"
Some other immediately useful ones are "\d"
and
"\b"
which refer to digits and word boundaries,
respectively. Regular expressions can be combined in powerful ways:
s <- c("some words no numbers", "see you 2nite", "you are L8")
# Find strings that have digits that start of a word
str_subset(s, "\\b\\d")
## [1] "see you 2nite"
# Find strings that have digits at end of word
str_subset(s, "\\d\\b")
## [1] "you are L8"
Other common ones includes "[:alpha:]"
,
"[:punct:]"
, "[:lower:]"
, etc., . This is
especially useful with punctuation when you don’t want to escape a bunch
of characters (though, see the cheat sheet as not all punctuation falls
under "[:punct:]
)
s <- c("dogs$(@!", "^&*are", ";[co(]!ol")
# This removes "punctuation"
s <- str_replace_all(s, pattern = "[:punct:]", "")
s
## [1] "dogs$" "^are" "cool"
# This removes "symbols"
s <- str_replace_all(s, pattern = "[:symbol:]", "")
s
## [1] "dogs" "are" "cool"
Question 4 A common problem in data processing involves the inclusion of “free response” questions in which respondents are able to enter data without restriction. In the following string, respondents were asked to give their phone numbers, but no standardization was enforced when doing so:
phone_strings <- c("Home: (507)-645-5489",
"Cell: 219.917.9871",
"My work phone is 507-202-2332",
"I don't have a phone")
Using functions from this lab, extract the phone numbers
from the character vector and modify the strings so that all of the
numbers are of the form XXX-XXX-XXXX
. If a number is not
included, it can be left as an empty string (""
). Hint: A
pattern like "\\d{3}[-.]"
will identify any three digits
followed by a hyphen or period. Regular expressions can be compounded.
str_replace
can get rid of pesky things we don’t need.
Sometimes we want to express patterns as a collection of possible
matches, alternatives, or even exclude patterns. The most common of
these is the square brackets, []
which indicate that we
would like to match any one of the the patterns in brackets.
s <- c("a1", "b2", "c3", "d4")
# Only match those with 1 or 3
str_subset(s, pattern = "[13]")
## [1] "a1" "c3"
Using "^"
combined with square brackets, we can also
request anything but the patterns in brackets. We need to be
careful, though, because this may match match other patterns in a
string, returning results that you do not intend. Here, we might expect
that “a1” and “c3” are not included, but since both “a” and “c” are not
1 or 3, the pattern still flags
s <- c("a1", "b2", "c3", "d4")
# Anything BUT 1 or 3
str_subset(s, pattern = "[^13]")
## [1] "a1" "b2" "c3" "d4"
str_view(s, pattern = "[^13]")
## [1] │ <a>1
## [2] │ <b><2>
## [3] │ <c>3
## [4] │ <d><4>
By playing around a bit, we can modify expressions to get what we want
s <- c("a1", "b2", "c3", "d4")
# Does not END in 1 or 3
str_subset(s, pattern = "[^13]$")
## [1] "b2" "d4"
s <- c("dog^cat", "dogacat")
str_detect(s, "[d^c]")
## [1] TRUE TRUE
Question 5 The following code chunk takes the
rownames from the USArrests
dataset and uses them to create
a data.frame with a single column containing all of the US states
states <- data.frame(states = rownames(USArrests))
head(states)
## states
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
Using the dplyr
package along with stringr
functions introduced in this lab, create a column that counts
how many vowels are included in each state name, and then provide a
ratio of the number of vowels to the number of total letters in each
state name. Arrange the data set so that states with the highest ratio
of vowels to letters are on top. Hint: don’t forget to deal with
uppercase/lowercase letters
Quantifiers are special characters in regular expressions that allow
us to quantify patterns in particular ways. In all cases, quantifiers
modify the pattern that immediately precedes them. The first of these we
will look at is "*"
, which indicates “zero or more”
matches. Here, we use it to find strings that contain the character “a”
followed by zero or more “p”
s <- c("Apple", "Pineapple", "Pear", "Orange", "Peach", "Banana")
str_subset(s, "ap*")
## [1] "Pineapple" "Pear" "Orange" "Peach" "Banana"
This seems a little silly, but its utility becomes more apparent in the following example
s <- c("xz", "xyz", "xyyz", "xyyyz", "xyyyyz")
# We want everything that starts with "x", may or may not have "y", ends with "z"
str_detect(s, "xy*z")
## [1] TRUE TRUE TRUE TRUE TRUE
Quantifiers also allow us to further define patterns. For this next
example, we will use str_count
to count the number of
vowels in a string
s <- "iowa"
str_count(s, "[aeiou]")
## [1] 3
str_view(s, "[aeiou]")
## [1] │ <i><o>w<a>
However, we may be interested in knowing how many instances of subsequent vowels there are. In this case, we have “io” and “a”. We can specify that the pattern we are looking for will include one or more of something with “+”
s <- "iowa"
# Counts subsequent vowels as a single instance
str_count(s, "[aeiou]+")
## [1] 2
str_view(s, "[aeiou]+")
## [1] │ <io>w<a>
We can be more precise with our quantifiers using curly braces,
"{}"
to indicate exactly what quantity constitutes a
pattern. For example, suppose we want to know how many instances of
exactly two vowels occur in a string:
s <- "iowa"
str_count(s, "[aeiou]{2}")
## [1] 1
str_view(s, "[aeiou]{2}")
## [1] │ <io>wa
Of course, as with other regular expressions, we must be very precise in what we are asking, for as we ask, thus we shall receive.
Question 6 Consider the following character vector
and the str_subset
function that is intending to return all
of the strings with exactly 2 digits. What does it actually return? Why
do you think this is? Use str_view
to see what patterns are
being matched
s <- c("1one", "22two", "333three", "4444four")
# Find
str_subset(s, pattern = "\\d{2}")
## [1] "22two" "333three" "4444four"
Now modify the regular expression so that in only pulls out
"22two"
. Hint: what is in front of the digits and what is
behind them? What regular expressions can you combine to get what you
want?