Preamble

The focus of this lab is going to be on strings and string manipulation. We will be fascilitating this with the stringr package

# install.packages("stringr")
library(stringr)
library(dplyr)

Strings

What is a string? Mostly simply, a string is a sequence of characters in R enclosed between single or double quotes. It’s analogous to a number (or numeric) in that a string, however long, is a single element in R. Also like numbers, multiple strings can be collected together to make a vector

# This is a string
a <- "abc"

# This is also a string
b <- "I think statistics is really super duper neat!"

# Strings can be combined into a character vector with `c`
vec <- c(a, b)
vec
## [1] "abc"                                           
## [2] "I think statistics is really super duper neat!"

Note: As a general comment on naming variables and best practices, we should avoid things like c <- "blah blah" because c is a function commonly used in R. This also applies to naming things with other common function names such as mean, sd, and summary.

Strings have a number of unique properties that will be helpful to keep in mind; one of these is the concept of length. When we refer to the length of a string, we are referring to the number of characters that the string contains:

# This is both a string and a vector of length 1
y <- "four"

# The length of the string is four
str_length(y)
## [1] 4
# The length of the vector is still one
length(y)
## [1] 1

When we have a vector of strings, str_length will give us length information for each string. This will be a common phenomenon in this lab – funtions that apply to strings will also apply to each element of a character vector

y <- c("apple", "banana", "pears")
str_length(y)
## [1] 5 6 5

As a corollary of the fact that strings have length, they also have associated indices meaning that it is possible for me to specify that I want, say, the second letter of a string

y <- "rstudio"
str_sub(y, start = 2, end = 2)
## [1] "s"

str_sub stands for “substring” rather than “subset”, which exists in the more useful function str_subset (observe that all stringr functions begin with str_). While this is one of the less useful functions, it is helpful in illustrating a few concepts. Using str_sub, we can also specify that we want all of the characters up to a certain point, all of the characters past a certain point, or all of the characters between two points

y <- "firefighter"

# Give me everything starting at position 5
str_sub(y, start = 5)
## [1] "fighter"
# Give me everything up to position 4
str_sub(y, end = 4)
## [1] "fire"
# Give me everything between 3 and 9
str_sub(y, start = 3, end = 9)
## [1] "refight"

str_sub also (conveniently) works with negative numbers. In this case, they work backwards from the end of a string

str_sub(y, start = -7, end = -3)
## [1] "fight"

Coercion

We discussed coercian briefly in the second lab, but it is worth reviewing briefly here as this is a very common problem in data analysis.

Vectors in R are all expected to be of the same type: logicals, numerics, characters, etc.,. In the case of a mixed type, R coerces all of the elements to be the same. There is a hierarchy in coercion in that all logicals can be represented by numerics, and all numerics can be represented as characters, etc.,.

x <- c(1, 2, 3)
x
## [1] 1 2 3
class(x)
## [1] "numeric"
x <- c(1, 2, 3, "A")
x
## [1] "1" "2" "3" "A"
class(x)
## [1] "character"

We can force coercion in the other direction, but mismatches will result in missing (NA) values.

as.numeric(x)
## [1]  1  2  3 NA

Often, it’s the presence of a single character that can cause issues, contaminating a dataset. Consider this data.frame, for example, in which both of our intended numeric columns are corrupted:

dat <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
dat
##    ID messy_x messy_y
## 1   1     100      50
## 2   2      90      40
## 3   3      85      55
## 4   4      90      45
## 5   5     110     55*
## 6   6 Missing      60
## 7   7     115      40
## 8   8    <NA>      35
## 9   9     105      40
## 10 10     100      50
str(dat)
## 'data.frame':    10 obs. of  3 variables:
##  $ ID     : int  1 2 3 4 5 6 7 8 9 10
##  $ messy_x: chr  "100" "90" "85" "90" ...
##  $ messy_y: chr  "50" "40" "55" "45" ...

In the first case in messy_x, we see that the cause of the issue is that missing data was entered as "Missing" rather than NA. Issues like this can generally be resolved pretty quickly by simply coercing it back to the correct type.

dat$messy_x <- as.numeric(dat$messy_x)
dat
##    ID messy_x messy_y
## 1   1     100      50
## 2   2      90      40
## 3   3      85      55
## 4   4      90      45
## 5   5     110     55*
## 6   6      NA      60
## 7   7     115      40
## 8   8      NA      35
## 9   9     105      40
## 10 10     100      50

In the second case, see see that one of the entries in annotated as 55*. As we want to remove the annotation without losing the entry, we will need some more tools. Problems such as this are the primary motivation for this lab.

Lab

This lab will cover some of the most commonly used elements of the stringr package, as well as introduce regular expressions

A cheat sheet for both of these together is provided here: CHEATSHEET

stringr

The stringr package, as the name implies, was created to assist with the manipulation of strings in R. The motivation for string manipulation arises from a number of avenues. Often, there are issues with data entry, making it difficult to use and manipulate data as intended; this is what we saw in the messy_x column above. More commonly, however, string manipulation can be used to assist us directly in data processing tasks.

The most common tasks we will have include

  1. Mutating strings
  2. Joining and splitting strings
  3. Subsetting and extracting strings
  4. Detecting patterns in strings

To that end, here are some of the more common functions we will be using, though it is by no means comprehensive

Function Description
str_sub() Extract substring from a given start to end position
str_replace() Replace the first instance of a substring with another
str_replace_all() Replace all instances of a substring with another
str_c() Concatenate or combine strings
str_subset() Subset character vector with strings matching a pattern
str_detect() Similar to str_subset, returning a logical vector
str_extract() Extract matching pattern from a string
str_count() Count instances of pattern in a string

Most of the functions in the stringr package take at least two arguments: string and pattern. The string indicates the character string that we are manipulating, while pattern indicates the sequence of characters that modulate our function. This will make more sense once we consider some examples.

Mutating Strings

Replace strings

The first set of functions we will consider will be those that modify strings directly. We already saw above that str_sub can extract a subset of a string; it can also be used to modify specific portions of a string

# position 1 to 3 is "sta"
x <- "sta230"
str_sub(x, 1, 3)
## [1] "sta"
# replace "sta" with "statistics"
str_sub(x, 1, 3) <- "statistics"
x
## [1] "statistics230"

Using str_sub is a bit awkward, however, as it requires positional indices. More commonly, there is the str_replace function which takes the two arguments we mentioned above, string and pattern. In this case, string is the string we are manipulating and pattern indicates what we wish to replace. Finally, there is a replacement argument which we use to replace pattern

x <- "sta320"
str_replace(string = x, pattern = "sta", replacement = "statistics")
## [1] "statistics320"

Like all stringr functions, this is vectorized, meaning that we can do this on a vector of character strings.

# A vector of hypothetical ids
ids <- c("M289", "M432", "F201", "M365")

ids <- str_replace(string = ids, pattern = "M", replacement = "MALE")
ids
## [1] "MALE289" "MALE432" "F201"    "MALE365"
ids <- str_replace(ids, "F", "FEMALE")
ids
## [1] "MALE289"   "MALE432"   "FEMALE201" "MALE365"

Oddly (and this also applies to most stringr functions), these operations typically only happen on the first instance of a pattern match. So, for example, if a particular pattern shows up multiple times in a string, the replacement will only occur on the first

dog <- "dog_dog_dog_dog_dog"
str_replace(dog, "dog", "GOD")
## [1] "GOD_dog_dog_dog_dog"

We can specify that we want this to apply to all instances of the pattern with str_replace_all

str_replace_all(dog, "dog", "GOD")
## [1] "GOD_GOD_GOD_GOD_GOD"

Personally, I really like using this function to remove characters I don’t want by replacing a pattern with "" or "_"

x <- "123-456-789"
str_replace_all(x, "-", "")
## [1] "123456789"

Misc mutations

There are also a collection of mutating functions that operate as housekeeping functions, allowing us to standardize or clean up sloppy data entry. For example, consider the following:

s <- "tHe quIcK brOWn fOX juMPeD oVeR tHe LAzY dOG"

# Make all lower
str_to_lower(s)
## [1] "the quick brown fox jumped over the lazy dog"
# Make all upper
str_to_upper(s)
## [1] "THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG"
# Make a title
str_to_title(s)
## [1] "The Quick Brown Fox Jumped Over The Lazy Dog"
# Make a sentence
str_to_sentence(s)
## [1] "The quick brown fox jumped over the lazy dog"

Clearly, some of these are more useful than others.

It’s worth mentioning also a number of functions that handle extra spaces which can sometimes occur with manual data entries. str_trim will trim off all leading and trailing white space, while str_squish will do that and remove repeated spaces in the string

x <- "   oops   too much    space "  
str_trim(x)
## [1] "oops   too much    space"
str_squish(x)
## [1] "oops too much space"

Question 1 Using the functions we have learned so far, modify the string below to produce the string "United_States"

s <- c("  UNITED  STATES  ")

Joining and splitting

Next we have joining and splitting which, as the heading suggests, involves joining or splitting strings together. Joining strings is especially useful when we want to present summary information based on statistics that have yet to be computed. str_c takes a comma separated collection of values which it will then paste together to form a single string

val <- 1:10
str_c("The mean value of val is: ", mean(val))
## [1] "The mean value of val is: 5.5"

str_c can also be used to turn a character vector into a single string by using the collapse argument

s <- c("a", "b", "c")
s
## [1] "a" "b" "c"
str_c(s, collapse = "")
## [1] "abc"
str_c(s, collapse = ", ")
## [1] "a, b, c"

We have already seen a form of string splitting with the separate function in dplyr. With simple strings, it is str_split. Also as before, . is a special character that needs to be escaped with \\

s <- "cat.dog"
str_split(s, "\\.")
## [[1]]
## [1] "cat" "dog"

This one is a bit strange in that it returns a list object instead of a character vector. This is because unlike the other functions we have seen so far, str_split will split every instance of the pattern. Returning a list allows each element of the character vector to be a different length

s <- c("oneword", "two.words", "this.has.three")
str_split(s, "\\.")
## [[1]]
## [1] "oneword"
## 
## [[2]]
## [1] "two"   "words"
## 
## [[3]]
## [1] "this"  "has"   "three"

Subsetting and Extracting

The next collection of functions deals with subsetting and extraction. We already saw str_sub in a few contexts, so we will not repeat that here. The two primary functions here are str_extract and str_subset. The first of these, str_extract, simply extracts from the string any pattern that matches pattern or otherwise returns NA

x <- c("apple", "banana", "melon")
str_extract(x, "a")
## [1] "a" "a" NA

There is also str_subset which will instead retain each element of a character vector matching a pattern, discarding everything else

x <- c("apple", "banana", "melon")
str_subset(x, "a")
## [1] "apple"  "banana"

These two functions seem a bit silly here, but we will return to them after we introduce regular expressions.

Detecting patterns

Finally, we have pattern detection. This again is much more useful with regular expressions, but we will briefly introduce the primary functions here.

First we have str_detect, which returns a logical vector indicating if a pattern was found in a string

s <- c("dog", "cat", "parrot", "hamster")
str_detect(s, pattern = "o")
## [1]  TRUE FALSE  TRUE FALSE

This can be used similarly to str_subset above wher the logical vector indicates which we would keep

s <- c("dog", "cat", "parrot", "hamster")
idx <- str_detect(s, pattern = "o")
s[idx] # true and false values
## [1] "dog"    "parrot"
str_subset(s, "o")
## [1] "dog"    "parrot"

Also useful is str_count which counts the number of times a pattern appears in a string

x <- "mississippi"
str_count(x, "s")
## [1] 4

Regular Expressions

Up until this point, the stringr functions we have introduced may seem a bit limited in their usefulness. However, each of the above examples is intended to provide an unambiguous illustration of how each of these functions work. In this current section, we combine our stringr functions with patterns produced by regular expressions, infinitely expanding their utility.

Regular expressions (regex) are an incredibly powerful tool for specifying patterns in a text; a short collection of examples is provided on the second page of the cheatsheet, which you will want to reference frequently for the remaining portion of this lab.

View

While we will not include it here in the lab for brevity’s sake, there is a very useful function called str_view which allows you to see visually the pattern in a string by enclosing the pattern in brackets. As our regular expressions get more sophisticated, this will help visualize what exactly your pattern is finding, especially useful when you are not getting the results you want:

## Find all the vowels
str_view("apple", "[aeiou]")
## [1] │ <a>ppl<e>

Meta-characters

Meta-characters are characters that have a particular meaning in regex language, meaning that they cannot be used to literally express a string value. We have seen one of these already in R, with the period . which has to be escaped. In regex, the period is used to indicate a “wildcard” character, meaning it will match with anything. By escaping a meta-character with \\, we are able to tell R to interpret the string literally. Here, we count the number of periods in a string

x <- "hello.world"

# As a wildcard, this counts every single character
str_count(x, ".")
## [1] 11
# We must "escape" the period if we want to count periods
str_count(x, "\\.")
## [1] 1

We won’t list out all of the meta-characters here. Instead, understand that in the next sections we will introduce a number of expressions that use special characters. To use those characters literally, we will need to escape them in strings.

Question 2 Return to the dataset we introduced at the beginning of the lab

dat <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
dat
##    ID messy_x messy_y
## 1   1     100      50
## 2   2      90      40
## 3   3      85      55
## 4   4      90      45
## 5   5     110     55*
## 6   6 Missing      60
## 7   7     115      40
## 8   8    <NA>      35
## 9   9     105      40
## 10 10     100      50

Using the tools we have introduced thus far, correct this dataset so that the missing values in messy_x are recorded as NA and the annotated values in messy_y are modified so that it can be correctly converted to numeric.

Anchoring

Anchoring simply means that we wish to indicate the start or the end of the string. For example, if instead of finding all instances in which a string contains the letter “a” we want to find those that start with the letter “a”, we can “anchor” our expression with "^" which indicates the beginning of a string

s <- c("apple", "pineapple", "banana")
str_subset(s, "a")
## [1] "apple"     "pineapple" "banana"
str_subset(s, "^a")
## [1] "apple"

Likewise, we use "$" to indicate the end of a string

s <- c("apple", "pineapple", "banana")
str_subset(s, "a")
## [1] "apple"     "pineapple" "banana"
str_subset(s, "a$")
## [1] "banana"

Of course, if we actually wanted to find “^” or “$”, we would need to escape them

s <- c("h^t", "hat", "hut", "hit")
str_subset(s, "\\^")
## [1] "h^t"

Question 3 Use str_extract to extract the first letter of each string in this vector. Then combine those letters with str_c to create a single character string

s <- c("howdy", "otter", "tragedy", "danger", "octopus", "grapes")

Not such a silly function now, is it?

Regular Expressions

While regular expressions does refer to this entire process itself, there are also sets of patterns that we refer to as regular expressions. A pretty decent list is included on the cheat sheet. Here, we will finally introduce the period we have seen so frequently in it’s natural environment. In a regular expression, . simply means any character.

s <- c("X25YW", "X91YW", "R30WY", "X11YW")

# We want all X, followed by 2 of any character, then YW
str_subset(s, "X..YW")
## [1] "X25YW" "X91YW" "X11YW"

Some other immediately useful ones are "\d" and "\b" which refer to digits and word boundaries, respectively. Regular expressions can be combined in powerful ways:

s <- c("some words no numbers", "see you 2nite", "you are L8")

# Find strings that have digits that start of a word
str_subset(s, "\\b\\d")
## [1] "see you 2nite"
# Find strings that have digits at end of word
str_subset(s, "\\d\\b")
## [1] "you are L8"

Other common ones includes "[:alpha:]", "[:punct:]", "[:lower:]", etc., . This is especially useful with punctuation when you don’t want to escape a bunch of characters (though, see the cheat sheet as not all punctuation falls under "[:punct:])

s <- c("dogs$(@!", "^&*are", ";[co(]!ol")
# This removes "punctuation"
s <- str_replace_all(s, pattern = "[:punct:]", "")
s
## [1] "dogs$" "^are"  "cool"
# This removes "symbols"
s <- str_replace_all(s, pattern = "[:symbol:]", "")
s
## [1] "dogs" "are"  "cool"

Question 4 A common problem in data processing involves the inclusion of “free response” questions in which respondents are able to enter data without restriction. In the following string, respondents were asked to give their phone numbers, but no standardization was enforced when doing so:

phone_strings <- c("Home: (507)-645-5489", 
                   "Cell: 219.917.9871", 
                   "My work phone is 507-202-2332",
                   "I don't have a phone")

Using functions from this lab, extract the phone numbers from the character vector and modify the strings so that all of the numbers are of the form XXX-XXX-XXXX. If a number is not included, it can be left as an empty string (""). Hint: A pattern like "\\d{3}[-.]" will identify any three digits followed by a hyphen or period. Regular expressions can be compounded. str_replace can get rid of pesky things we don’t need.

Alternates

Sometimes we want to express patterns as a collection of possible matches, alternatives, or even exclude patterns. The most common of these is the square brackets, [] which indicate that we would like to match any one of the the patterns in brackets.

s <- c("a1", "b2", "c3", "d4")

# Only match those with 1 or 3
str_subset(s, pattern = "[13]")
## [1] "a1" "c3"

Using "^" combined with square brackets, we can also request anything but the patterns in brackets. We need to be careful, though, because this may match match other patterns in a string, returning results that you do not intend. Here, we might expect that “a1” and “c3” are not included, but since both “a” and “c” are not 1 or 3, the pattern still flags

s <- c("a1", "b2", "c3", "d4")

# Anything BUT 1 or 3
str_subset(s, pattern = "[^13]")
## [1] "a1" "b2" "c3" "d4"
str_view(s, pattern = "[^13]")
## [1] │ <a>1
## [2] │ <b><2>
## [3] │ <c>3
## [4] │ <d><4>

By playing around a bit, we can modify expressions to get what we want

s <- c("a1", "b2", "c3", "d4")

# Does not END in 1 or 3
str_subset(s, pattern = "[^13]$")
## [1] "b2" "d4"
s <- c("dog^cat", "dogacat")
str_detect(s, "[d^c]")
## [1] TRUE TRUE

Question 5 The following code chunk takes the rownames from the USArrests dataset and uses them to create a data.frame with a single column containing all of the US states

states <- data.frame(states = rownames(USArrests))
head(states)
##       states
## 1    Alabama
## 2     Alaska
## 3    Arizona
## 4   Arkansas
## 5 California
## 6   Colorado

Using the dplyr package along with stringr functions introduced in this lab, create a column that counts how many vowels are included in each state name, and then provide a ratio of the number of vowels to the number of total letters in each state name. Arrange the data set so that states with the highest ratio of vowels to letters are on top. Hint: don’t forget to deal with uppercase/lowercase letters

Quantifiers

Quantifiers are special characters in regular expressions that allow us to quantify patterns in particular ways. In all cases, quantifiers modify the pattern that immediately precedes them. The first of these we will look at is "*", which indicates “zero or more” matches. Here, we use it to find strings that contain the character “a” followed by zero or more “p”

s <- c("Apple", "Pineapple", "Pear", "Orange", "Peach", "Banana")
str_subset(s, "ap*")
## [1] "Pineapple" "Pear"      "Orange"    "Peach"     "Banana"

This seems a little silly, but its utility becomes more apparent in the following example

s <- c("xz", "xyz", "xyyz", "xyyyz", "xyyyyz")

# We want everything that starts with "x", may or may not have "y", ends with "z"
str_detect(s, "xy*z")
## [1] TRUE TRUE TRUE TRUE TRUE

Quantifiers also allow us to further define patterns. For this next example, we will use str_count to count the number of vowels in a string

s <- "iowa"
str_count(s, "[aeiou]")
## [1] 3
str_view(s, "[aeiou]")
## [1] │ <i><o>w<a>

However, we may be interested in knowing how many instances of subsequent vowels there are. In this case, we have “io” and “a”. We can specify that the pattern we are looking for will include one or more of something with “+”

s <- "iowa"
# Counts subsequent vowels as a single instance
str_count(s, "[aeiou]+")
## [1] 2
str_view(s, "[aeiou]+")
## [1] │ <io>w<a>

We can be more precise with our quantifiers using curly braces, "{}" to indicate exactly what quantity constitutes a pattern. For example, suppose we want to know how many instances of exactly two vowels occur in a string:

s <- "iowa"
str_count(s, "[aeiou]{2}")
## [1] 1
str_view(s, "[aeiou]{2}")
## [1] │ <io>wa

Of course, as with other regular expressions, we must be very precise in what we are asking, for as we ask, thus we shall receive.

Question 6 Consider the following character vector and the str_subset function that is intending to return all of the strings with exactly 2 digits. What does it actually return? Why do you think this is? Use str_view to see what patterns are being matched

s <- c("1one", "22two", "333three", "4444four")

# Find
str_subset(s, pattern = "\\d{2}")
## [1] "22two"    "333three" "4444four"

Now modify the regular expression so that in only pulls out "22two". Hint: what is in front of the digits and what is behind them? What regular expressions can you combine to get what you want?