The focus of this lab is going to be on strings and string
manipulation. We will be facilitating this with the stringr
package, the most common string manipulating package in R
# install.packages("stringr")
library(stringr)
library(dplyr)
What is a string? Mostly simply, a string is a sequence of characters
in R enclosed between single or double quotes. It’s
analogous to a number (or numeric) in that a string, however long, is a
single element in R. Also like numbers, multiple strings
can be collected together to make a vector
# This is a string
a <- "abc"
# This is also a string
b <- "I think statistics is really super duper neat!"
# Strings can be combined into a character vector with `c`
vec <- c(a, b)
vec
## [1] "abc"
## [2] "I think statistics is really super duper neat!"
Strings have a number of unique properties that will be helpful to keep in mind; one of these is the concept of length. When we refer to the length of a string, we are referring to the number of characters that the string contains. This is distinct from the length of a vector, which indicates the number of elements in the vector:
# This is both a vector (of length one) and a string (of length 4)
y <- "four"
# The length of the string is four
str_length(y)
## [1] 4
# The length of the vector is still one
length(y)
## [1] 1
When we have a vector of strings, str_length will give
us length information for each string. This will be a common phenomenon
in this lab – functions that apply to strings will also apply to each
element of a character vector. This is a process known as
vectorization which we will discuss in more detail later.
y <- c("apple", "banana", "pears")
str_length(y)
## [1] 5 6 5
As a corollary of the fact that strings have length, they also have associated indices meaning that it is possible for me to specify that I want, say, the second letter of a string
y <- "rstudio"
str_sub(y, start = 2, end = 2)
## [1] "s"
str_sub stands for “substring” rather than “subset”,
which exists in the more useful function str_subset.
str_sub has a somewhat unique property relative to other
functions we have seen so far in class in that it relies on positional
indexing to select which parts of a string we wish to extract. Functions
like this are useful when working with data where all entries can be
assumed to be structured the same. For example, different account codes
could be structured so that the first 3 letters indicate an industry
while the last 4 numbers indicate a billing code.
## Fake billing codes
x <- c(paste0("ANC", c(1234, 3456, 6789)), paste0("BAC", c(0246, 9135, 4680)))
x
## [1] "ANC1234" "ANC3456" "ANC6789" "BAC246" "BAC9135" "BAC4680"
## Extract only the industry code, the first 3 letters
str_sub(x, end = 3)
## [1] "ANC" "ANC" "ANC" "BAC" "BAC" "BAC"
## Extract only the account numbers, the last 4
str_sub(x, start = 4)
## [1] "1234" "3456" "6789" "246" "9135" "4680"
## Negative numbers also work
str_sub(x, start = -4)
## [1] "1234" "3456" "6789" "C246" "9135" "4680"
As an aside, this is not a terrible place to observe that all string
functions will begin with str_.
We discussed coercion briefly in our lecture, though it’s worth investigating in a little more detail here.
Vectors in R are all expected to be of the same type:
logical, numeric, characters, etc.,. In the case of a mixed type,
R coerces all of the elements to be the same.
There is a hierarchy in coercion in that all logicals can be represented
by numerics, and all numerics can be represented as characters,
etc.,.
x <- c(1, 2, 3)
x
## [1] 1 2 3
class(x)
## [1] "numeric"
x <- c(1, 2, 3, "A")
x
## [1] "1" "2" "3" "A"
class(x) # the "A" forces everything else to be a character
## [1] "character"
We can force coercion in the other direction, but mismatches will
result in missing (NA) values.
as.numeric(x)
## [1] 1 2 3 NA
Often, it’s the presence of a single character that can cause issues, contaminating a dataset. Consider this data.frame, for example, in which both of our intended numeric columns are corrupted:
dat <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
dat
## ID messy_x messy_y
## 1 1 100 50
## 2 2 90 40
## 3 3 85 55
## 4 4 90 45
## 5 5 110 55*
## 6 6 Missing 60
## 7 7 115 40
## 8 8 <NA> 35
## 9 9 105 40
## 10 10 100 50
str(dat)
## 'data.frame': 10 obs. of 3 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10
## $ messy_x: chr "100" "90" "85" "90" ...
## $ messy_y: chr "50" "40" "55" "45" ...
In the first case in messy_x, we see that the cause of
the issue is that missing data was entered as "Missing"
rather than NA. Issues like this can generally be resolved
pretty quickly by simply coercing it back to the correct type.
dat$messy_x <- as.numeric(dat$messy_x)
## Warning: NAs introduced by coercion
dat
## ID messy_x messy_y
## 1 1 100 50
## 2 2 90 40
## 3 3 85 55
## 4 4 90 45
## 5 5 110 55*
## 6 6 NA 60
## 7 7 115 40
## 8 8 NA 35
## 9 9 105 40
## 10 10 100 50
In the second case, see see that one of the entries in annotated as
55*. As we want to remove the annotation without losing the
entry, we will need some more tools. Problems such as this are the
primary motivation for this lab.
This lab will cover some of the most commonly used elements of the
stringr package, as well as introduce regular
expressions
A cheat sheet for both of these together is provided here: CHEATSHEET
The stringr package, as the name implies, was created to
assist with the manipulation of strings in R. The
motivation for string manipulation arises from a number of avenues.
Often, there are issues with data entry, making it difficult to use and
manipulate data as intended; this is what we saw in the
messy_x column above. More commonly, however, string
manipulation can be used to assist us directly in data processing
tasks.
The most common tasks we will have include
To that end, here are some of the more common functions we will be using, though it is by no means comprehensive
| Function | Description |
|---|---|
str_sub() |
Extract substring from a given start to end position |
str_replace() |
Replace the first instance of a substring with another |
str_replace_all() |
Replace all instances of a substring with another |
str_c() |
Concatenate or combine strings |
str_subset() |
Subset character vector with strings matching a pattern |
str_detect() |
Similar to str_subset, returning a logical vector |
str_extract() |
Extract matching pattern from a string |
str_count() |
Count instances of pattern in a string |
Most of the functions in the stringr package take at
least two arguments: a string and a pattern. The string indicates
the character string that we are manipulating, while pattern indicates
the sequence of characters that modulate our function. This will make
more sense once we consider some examples.
The first set of functions we will consider will be those that modify
strings directly. We already saw above that str_sub can
extract a subset of a string; it can also be used to modify specific
portions of a string
# position 1 to 3 is "sta"
x <- "sta230"
str_sub(x, 1, 3)
## [1] "sta"
# replace "sta" with "statistics"
str_sub(x, 1, 3) <- "statistics"
x
## [1] "statistics230"
Using str_sub is a bit awkward, however, as it requires
positional indices. More commonly, there is the str_replace
function which takes the two arguments we mentioned above, string and
pattern. In this case, string is the string we are
manipulating and pattern indicates what we wish to replace.
Finally, there is a replacement argument which we use to
replace pattern
x <- "sta320"
str_replace(string = x, pattern = "sta", replacement = "statistics")
## [1] "statistics320"
Like all stringr functions, this is vectorized, meaning
that we can do this on a vector of character strings.
# A vector of hypothetical ids
ids <- c("M289", "M432", "F201", "M365")
ids <- str_replace(string = ids, pattern = "M", replacement = "MALE")
ids
## [1] "MALE289" "MALE432" "F201" "MALE365"
ids <- str_replace(ids, "F", "FEMALE")
ids
## [1] "MALE289" "MALE432" "FEMALE201" "MALE365"
Oddly (and this also applies to most stringr functions),
these operations typically only happen on the first instance of
a pattern match. So, for example, if a particular pattern shows up
multiple times in a string, the replacement will only occur on the
first
dog <- "dog_dog_dog_dog_dog"
str_replace(dog, "dog", "GOD")
## [1] "GOD_dog_dog_dog_dog"
We can specify that we want this to apply to all instances of the
pattern with str_replace_all
str_replace_all(dog, "dog", "GOD")
## [1] "GOD_GOD_GOD_GOD_GOD"
Personally, I really like using this function to remove characters I
don’t want by replacing a pattern with "" or
"_"
x <- "123-456-789"
str_replace_all(x, "-", "")
## [1] "123456789"
There are also a collection of mutating functions that operate as housekeeping functions, allowing us to standardize or clean up sloppy data entry. For example, consider the following:
s <- "tHe quIcK brOWn fOX juMPeD oVeR tHe LAzY dOG"
# Make all lower
str_to_lower(s)
## [1] "the quick brown fox jumped over the lazy dog"
# Make all upper
str_to_upper(s)
## [1] "THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG"
# Make a title
str_to_title(s)
## [1] "The Quick Brown Fox Jumped Over The Lazy Dog"
# Make a sentence
str_to_sentence(s)
## [1] "The quick brown fox jumped over the lazy dog"
It’s worth mentioning also a number of functions that handle extra
spaces which can sometimes occur with manual data entries.
str_trim will trim off all leading and trailing white
space, while str_squish will do that and remove
repeated spaces in the string
x <- " oops too much space "
str_trim(x)
## [1] "oops too much space"
str_squish(x)
## [1] "oops too much space"
Next we have joining and splitting which, as the heading suggests,
involves joining or splitting strings together. Joining strings is
especially useful when we want to present summary information based on
statistics that have yet to be computed. str_c takes a
comma separated collection of values which it will then paste together
to form a single string
val <- 1:10
str_c("The mean value of val is: ", mean(val))
## [1] "The mean value of val is: 5.5"
str_c can also be used to turn a character vector into a
single string by using the collapse argument
s <- c("a", "b", "c")
s
## [1] "a" "b" "c"
str_c(s, collapse = "")
## [1] "abc"
str_c(s, collapse = ", ")
## [1] "a, b, c"
We have already seen a form of string splitting with the
separate function in dplyr. With simple
strings, it is str_split. Also as before, . is
a special character that needs to be escaped with \\
s <- "cat.dog"
str_split(s, "\\.")
## [[1]]
## [1] "cat" "dog"
This one is a bit strange in that it returns a list
object instead of a character vector. This is because unlike the other
functions we have seen so far, str_split will split every
instance of the pattern. Returning a list allows each element of the
character vector to be a different length
s <- c("oneword", "two.words", "this.has.three")
str_split(s, "\\.")
## [[1]]
## [1] "oneword"
##
## [[2]]
## [1] "two" "words"
##
## [[3]]
## [1] "this" "has" "three"
Lists will also be another topic later in the semester. For now, it is sufficient to think of them as a more general version of a vector.
The next collection of functions deals with subsetting and
extraction. We already saw str_sub in a few contexts, so we
will not repeat that here. The two primary functions here are
str_extract and str_subset. The first of
these, str_extract, simply extracts from the string any
pattern that matches pattern or otherwise returns
NA
x <- c("apple", "banana", "melon")
str_extract(x, "a")
## [1] "a" "a" NA
There is also str_subset which will instead retain each
element of a character vector matching a pattern, discarding everything
else
x <- c("apple", "banana", "melon")
str_subset(x, "a")
## [1] "apple" "banana"
These two functions seem a bit silly here, as of course searching for an “a” will return an “a”. However, we will soon introduce regular expressions, giving a far more powerful tool for identifying patterns in strings.
Finally, we have pattern detection. Like str_subset(),
these functions are vastly more useful when used in conjunction with
regular expressions. For now, we will introduce them in a simpler
context.
First we have str_detect, which returns a logical vector
indicating if a pattern was found in a string
s <- c("dog", "cat", "parrot", "hamster")
str_detect(s, pattern = "o")
## [1] TRUE FALSE TRUE FALSE
This can be used similarly to str_subset above where the
logical vector indicates which we would keep
s <- c("dog", "cat", "parrot", "hamster")
idx <- str_detect(s, pattern = "o")
s[idx] # true and false values
## [1] "dog" "parrot"
str_subset(s, "o")
## [1] "dog" "parrot"
Also useful is str_count which counts the number of
times a pattern appears in a string
x <- "mississippi"
str_count(x, "s")
## [1] 4
Up until this point, the stringr functions we have
introduced may seem a bit limited in their usefulness. However, each of
the above examples is intended to provide an unambiguous illustration of
how each of these functions work. In this current section, we combine
our stringr functions with patterns produced by regular
expressions, infinitely expanding their utility.
Regular expressions (regex) are an incredibly powerful tool for specifying patterns in a text; a short collection of examples is provided on the second page of the cheatsheet, which you will want to reference frequently for the remaining portion of this lab.
While we will not include it here in the lab for brevity’s sake,
there is a very useful function called str_view which
allows you to see visually the pattern in a string by enclosing the
pattern in brackets. As our regular expressions get more sophisticated,
this will help visualize what exactly your pattern is finding,
especially useful when you are not getting the results you want. In the
example below, we are taking the string "apple" and
identifying as a pattern via regex any characters contained within the
set aeiou. The result of str_view() indicates
that both "a" and "e" are identified with that
regex.
## Find all the vowels
str_view("apple", "[aeiou]")
## [1] │ <a>ppl<e>
Meta-characters are characters that have a particular meaning in
regex language, meaning that they cannot be used to literally express a
string value. We have seen one of these already in R, with
the period . which has to be escaped. In regex,
the period is used to indicate a “wildcard” character, meaning it will
match with anything. By escaping a meta-character with \\,
we are able to tell R to interpret the string literally.
Here, we count the number of periods in a string
x <- "hello.world"
# As a wildcard, this counts every single character
str_count(x, ".")
## [1] 11
# We must "escape" the period if we want to count periods
str_count(x, "\\.")
## [1] 1
We won’t list out all of the meta-characters here. Instead,
understand that in the next sections we will introduce a number of
expressions that use special characters. To use those characters
literally, we will need to escape them in strings using
\\.
Anchoring simply means that we wish to indicate the start or the end
of the string. For example, if instead of finding all instances
in which a string contains the letter “a” we want to find those that
start with the letter “a”, we can “anchor” our expression with
"^" which indicates the beginning of a string
s <- c("apple", "pineapple", "banana")
## Subset all strings that contain the letter "a"
str_subset(s, "a")
## [1] "apple" "pineapple" "banana"
## Subset all strings that begin with the letter "a"
str_subset(s, "^a")
## [1] "apple"
Likewise, we use "$" to indicate the end of a string
## Subset all strings that end with the letter "a"
str_subset(s, "a$")
## [1] "banana"
Of course, if we actually wanted to find “^” or “$”, we would need to escape them
s <- c("h^t", "hat", "hut", "hit")
str_subset(s, "\\^")
## [1] "h^t"
While the term “regular expressions” does refer to the entire process
of specifying patterns, we also refer to the specific patterns
themselves as regular expressions. Regular expressions constitute an
excellent example in which complicated string processing tasks can be
simplified with the use of an LLM; however, there are a few that are
common enough that they are worth exploring in some detail. A great
place to start will be the period that we have so frequently
encountered. In a regular expression, . is a placeholder
that is intended to represent any character.
s <- c("X25YW", "X91YW", "R30WY", "X11YW")
# We want all X, followed by 2 of any character, then YW
str_subset(s, "X..YW")
## [1] "X25YW" "X91YW" "X11YW"
Some other immediately useful ones are "\d" and
"\b" which refer to digits and word boundaries,
respectively. Regular expressions can be combined in powerful ways:
s <- c("some words no numbers", "see you 2nite", "you are L8")
# Find strings that have digits that start of a word
str_subset(s, "\\b\\d")
## [1] "see you 2nite"
# Find strings that have digits at end of word
str_subset(s, "\\d\\b")
## [1] "you are L8"
(We read “\b\d” as “\b” is the boundary and, immediately following the boundary, we find a digit “\d”. Likewise, “\d\b” indicates digit then boundary).
Other common ones includes "[:alpha:]",
"[:punct:]", "[:lower:]", etc., . This is
especially useful with punctuation when you don’t want to escape a bunch
of characters (though, see the cheat sheet as not all punctuation falls
under "[:punct:])
s <- c("dogs$(@!", "^&*are", ";[co(]!ol")
# This removes "punctuation"
s <- str_replace_all(s, pattern = "[:punct:]", "")
s
## [1] "dogs$" "^are" "cool"
# This removes "symbols"
s <- str_replace_all(s, pattern = "[:symbol:]", "")
s
## [1] "dogs" "are" "cool"
Sometimes we want to express patterns as a collection of possible
matches, alternatives, or even exclude patterns. The most common of
these is the square brackets, [] which indicate that we
would like to match any one of the the patterns in brackets.
s <- c("a1", "b2", "c3", "d4")
# Only match those with 1 or 3
str_subset(s, pattern = "[13]")
## [1] "a1" "c3"
Using "^" combined with square brackets, we can also
request anything but the patterns in brackets. We need to be
careful, though, because this may match other patterns in a string,
returning results that you do not intend. Here, we might expect that
“a1” and “c3” are not included, but since both “a” and “c” are not 1 or
3, the pattern still flags
s <- c("a1", "b2", "c3", "d4")
# Anything BUT 1 or 3
str_subset(s, pattern = "[^13]")
## [1] "a1" "b2" "c3" "d4"
## str_view confirms that each element of s matches this pattern somewhere
str_view(s, pattern = "[^13]")
## [1] │ <a>1
## [2] │ <b><2>
## [3] │ <c>3
## [4] │ <d><4>
By playing around a bit, we can modify expressions to get what we want. In this case, we use “$” to indicate that we don’t want the pattern 1 or 3 occurring at the end of a string (recall from above that “$” is used to indicate the end of a string).
s <- c("a1", "b2", "c3", "d4")
# Does not END in 1 or 3
str_subset(s, pattern = "[^13]$")
## [1] "b2" "d4"
Quantifiers are special characters in regular expressions that allow
us to quantify patterns in particular ways. In all cases, quantifiers
modify the pattern that immediately precedes them. The first of these we
will look at is "*", which indicates “zero or more”
matches. Here, we use it to find strings that contain the character “a”
followed by zero or more “p”
s <- c("Apple", "Pineapple", "Pear", "Orange", "Peach", "Banana")
str_subset(s, "ap*")
## [1] "Pineapple" "Pear" "Orange" "Peach" "Banana"
Notice that the “a” in “Pear” is not followed by any “p”; this is still OK because “*” indicates zero or more
This seems a little silly, but its utility becomes more apparent in the following example
s <- c("xz", "xyz", "xyyz", "xyyyz", "xyyyyz")
# We want everything that starts with "x", may or may not have "y", ends with "z"
str_detect(s, "xy*z")
## [1] TRUE TRUE TRUE TRUE TRUE
Quantifiers also allow us to further define patterns. For this next
example, we will use str_count to count the number of
vowels in a string
s <- "iowa"
str_count(s, "[aeiou]")
## [1] 3
str_view(s, "[aeiou]")
## [1] │ <i><o>w<a>
However, we may be interested in knowing how many instances of uninterrupted vowels there are separated by consonants. In this case, we have “io” and “a” with a non-vowel in between. We can specify that the pattern we are looking for will include one or more of something with “+”
s <- "iowa"
# Counts subsequent vowels as a single instance
str_count(s, "[aeiou]+")
## [1] 2
str_view(s, "[aeiou]+")
## [1] │ <io>w<a>
We can be more precise with our quantifiers using curly braces,
"{}" to indicate exactly what quantity constitutes a
pattern. For example, suppose we want to know how many instances of
exactly two vowels occur in a string:
s <- "iowa"
str_count(s, "[aeiou]{2}")
## [1] 1
str_view(s, "[aeiou]{2}")
## [1] │ <io>wa
Of course, as with other regular expressions, we must be very precise in what we are asking, for as we ask, thus we shall receive.
Question 1 Using the functions we have learned so
far, modify the string below to produce the string
"United_States"
s <- c(" UNITED STATES ")
Question 2 Return to the dataset we introduced at the beginning of the lab
dat <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
dat
## ID messy_x messy_y
## 1 1 100 50
## 2 2 90 40
## 3 3 85 55
## 4 4 90 45
## 5 5 110 55*
## 6 6 Missing 60
## 7 7 115 40
## 8 8 <NA> 35
## 9 9 105 40
## 10 10 100 50
Using the tools we have introduced thus far, correct this dataset so
that the missing values in messy_x are recorded as
NA and the annotated values in messy_y are
modified so that it can be correctly converted to numeric.
Question 3 Use str_extract to extract
the first letter of each string in this vector. Then combine those
letters with str_c to create a single character string
s <- c("howdy", "otter", "tragedy", "danger", "octopus", "grapes")
Not such a silly function now, is it?
Question 4 A common problem in data processing involves the inclusion of “free response” questions in which respondents are able to enter data without restriction. In the following string, respondents were asked to give their phone numbers, but no standardization was enforced when doing so:
phone_strings <- c("Home: (507)-645-5489",
"Cell: 219.917.9871",
"My work phone is 507-202-2332",
"I don't have a phone")
Using functions from this lab, extract the phone numbers
from the character vector and modify the strings so that all of the
numbers are of the form XXX-XXX-XXXX. If a number is not
included, it can be left as an empty string ("").
Hints:
"\\d{3}[-.]" will identify any three
digits followed by a hyphen or periodstr_replace can get rid of pesky things we don’t
needQuestion 5 The following code chunk takes the
rownames from the USArrests dataset and uses them to create
a data.frame with a single column containing all of the US states
states <- data.frame(states = rownames(USArrests))
head(states)
## states
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
Using the dplyr package along with stringr
functions introduced in this lab, create a column that counts
how many vowels are included in each state name, and then provide a
ratio of the number of vowels to the number of total letters in each
state name. Arrange the data set so that states with the highest ratio
of vowels to letters are on top. Hint: don’t forget to deal with
uppercase/lowercase letters
Question 6 Consider the following character vector
and the str_subset function that is intending to return all
of the strings with exactly 2 digits. What does it actually return? Why
do you think this is? Use str_view to see what patterns are
being matched
s <- c("1one", "22two", "333three", "4444four")
# Find
str_subset(s, pattern = "\\d{2}")
## [1] "22two" "333three" "4444four"
Now modify the regular expression so that in only pulls out
"22two". Hint: what is in front of the digits and what is
behind them? What regular expressions can you combine to get what you
want?
Question 7: Below is a collection of IDs pulled from a data base. A valid ID will consist of a single capital letter, followed by three numbers
ids <- data.frame(ID = c("ID: Q781","no id","ID: B847","ID: Z12","ID: M120",
"ID: A123", NA,"ID: H064","ID: AA123","ID: R006","ID: D555",
"ID: K410","ID: G777", "ID: C019","ID: L999", "NA", "ID: E902","ID: J888",
"ID: P333","ID: F301","ID: N450", "ID is R569"))
For this problem, you should
NAQuestion 8: A collaborator has just sent you a data.frame like the one below recording name and demographic data on a list of subjects:
df <- data.frame(Subject = c(
"name: Alice, sex: F, age: 22",
"name: Bob, sex: Male, age: 9",
"name: Carol Smith, sex: Female, age: 31",
"name: David Crockett, sex: M, age: 45",
"name: Emily Johnson, sex: F, age: 18",
"name: Frank, sex: Male, age: 27",
"name: Grace Lee, sex: Female, age: 54",
"name: Henry, sex: M, age: 6",
"name: Isabella Martinez, sex: Female, age: 39",
"name: Jack, sex: M, age: 72"
))
Clean this data frame so that there are three columns: name, sex, and age. Then, apply the appropriate data summaries to find the average age for each sex.
Question 9: Below is a list of emails that were collected from an identity harvesting site. Before we engage in white-collar crime, we need to verify that the email addresses we have are valid. A valid email will have the following properties:
emails <- data.frame(email = c(
"katie35@gmail.com",
"hilary@grinnell.edu",
"abbi12@site.org",
"kendall7@iastate.gov",
"brianna99@gmail.com",
"alex@site.org",
"megan88@grinnell.edu",
"madison@iastate.gov",
"jenny4@gmail.com",
"kelly@site.org",
"hilaryduff@gmail.com",
"abby@grinnell.education",
"kendall@iastate.gov",
"brianna@gmail.com",
"alex12@site.org",
"megan@grinnell.edu",
"madison_7@gmail.com",
"jenny!@site.org",
"noatsymbolgmail.com",
"kelly3@iastate.g0v"
))
For this problem you should: