Preamble

The focus of this lab is going to be on strings and string manipulation. We will be facilitating this with the stringr package, the most common string manipulating package in R

# install.packages("stringr")
library(stringr)
library(dplyr)

Strings

What is a string? Mostly simply, a string is a sequence of characters in R enclosed between single or double quotes. It’s analogous to a number (or numeric) in that a string, however long, is a single element in R. Also like numbers, multiple strings can be collected together to make a vector

# This is a string
a <- "abc"

# This is also a string
b <- "I think statistics is really super duper neat!"

# Strings can be combined into a character vector with `c`
vec <- c(a, b)
vec
## [1] "abc"                                           
## [2] "I think statistics is really super duper neat!"

Strings have a number of unique properties that will be helpful to keep in mind; one of these is the concept of length. When we refer to the length of a string, we are referring to the number of characters that the string contains. This is distinct from the length of a vector, which indicates the number of elements in the vector:

# This is both a vector (of length one) and a string (of length 4)
y <- "four"

# The length of the string is four
str_length(y)
## [1] 4
# The length of the vector is still one
length(y)
## [1] 1

When we have a vector of strings, str_length will give us length information for each string. This will be a common phenomenon in this lab – functions that apply to strings will also apply to each element of a character vector. This is a process known as vectorization which we will discuss in more detail later.

y <- c("apple", "banana", "pears")
str_length(y)
## [1] 5 6 5

As a corollary of the fact that strings have length, they also have associated indices meaning that it is possible for me to specify that I want, say, the second letter of a string

y <- "rstudio"
str_sub(y, start = 2, end = 2)
## [1] "s"

str_sub stands for “substring” rather than “subset”, which exists in the more useful function str_subset. str_sub has a somewhat unique property relative to other functions we have seen so far in class in that it relies on positional indexing to select which parts of a string we wish to extract. Functions like this are useful when working with data where all entries can be assumed to be structured the same. For example, different account codes could be structured so that the first 3 letters indicate an industry while the last 4 numbers indicate a billing code.

## Fake billing codes
x <- c(paste0("ANC", c(1234, 3456, 6789)), paste0("BAC", c(0246, 9135, 4680)))
x
## [1] "ANC1234" "ANC3456" "ANC6789" "BAC246"  "BAC9135" "BAC4680"
## Extract only the industry code, the first 3 letters
str_sub(x, end = 3)
## [1] "ANC" "ANC" "ANC" "BAC" "BAC" "BAC"
## Extract only the account numbers, the last 4
str_sub(x, start = 4)
## [1] "1234" "3456" "6789" "246"  "9135" "4680"
## Negative numbers also work
str_sub(x, start = -4)
## [1] "1234" "3456" "6789" "C246" "9135" "4680"

As an aside, this is not a terrible place to observe that all string functions will begin with str_.

Coercion

We discussed coercion briefly in our lecture, though it’s worth investigating in a little more detail here.

Vectors in R are all expected to be of the same type: logical, numeric, characters, etc.,. In the case of a mixed type, R coerces all of the elements to be the same. There is a hierarchy in coercion in that all logicals can be represented by numerics, and all numerics can be represented as characters, etc.,.

x <- c(1, 2, 3)
x
## [1] 1 2 3
class(x)
## [1] "numeric"
x <- c(1, 2, 3, "A")
x
## [1] "1" "2" "3" "A"
class(x) # the "A" forces everything else to be a character
## [1] "character"

We can force coercion in the other direction, but mismatches will result in missing (NA) values.

as.numeric(x)
## [1]  1  2  3 NA

Often, it’s the presence of a single character that can cause issues, contaminating a dataset. Consider this data.frame, for example, in which both of our intended numeric columns are corrupted:

dat <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
dat
##    ID messy_x messy_y
## 1   1     100      50
## 2   2      90      40
## 3   3      85      55
## 4   4      90      45
## 5   5     110     55*
## 6   6 Missing      60
## 7   7     115      40
## 8   8    <NA>      35
## 9   9     105      40
## 10 10     100      50
str(dat)
## 'data.frame':    10 obs. of  3 variables:
##  $ ID     : int  1 2 3 4 5 6 7 8 9 10
##  $ messy_x: chr  "100" "90" "85" "90" ...
##  $ messy_y: chr  "50" "40" "55" "45" ...

In the first case in messy_x, we see that the cause of the issue is that missing data was entered as "Missing" rather than NA. Issues like this can generally be resolved pretty quickly by simply coercing it back to the correct type.

dat$messy_x <- as.numeric(dat$messy_x)
## Warning: NAs introduced by coercion
dat
##    ID messy_x messy_y
## 1   1     100      50
## 2   2      90      40
## 3   3      85      55
## 4   4      90      45
## 5   5     110     55*
## 6   6      NA      60
## 7   7     115      40
## 8   8      NA      35
## 9   9     105      40
## 10 10     100      50

In the second case, see see that one of the entries in annotated as 55*. As we want to remove the annotation without losing the entry, we will need some more tools. Problems such as this are the primary motivation for this lab.

Lab

This lab will cover some of the most commonly used elements of the stringr package, as well as introduce regular expressions

A cheat sheet for both of these together is provided here: CHEATSHEET

stringr

The stringr package, as the name implies, was created to assist with the manipulation of strings in R. The motivation for string manipulation arises from a number of avenues. Often, there are issues with data entry, making it difficult to use and manipulate data as intended; this is what we saw in the messy_x column above. More commonly, however, string manipulation can be used to assist us directly in data processing tasks.

The most common tasks we will have include

  1. Mutating strings
  2. Joining and splitting strings
  3. Subsetting and extracting strings
  4. Detecting patterns in strings

To that end, here are some of the more common functions we will be using, though it is by no means comprehensive

Function Description
str_sub() Extract substring from a given start to end position
str_replace() Replace the first instance of a substring with another
str_replace_all() Replace all instances of a substring with another
str_c() Concatenate or combine strings
str_subset() Subset character vector with strings matching a pattern
str_detect() Similar to str_subset, returning a logical vector
str_extract() Extract matching pattern from a string
str_count() Count instances of pattern in a string

Most of the functions in the stringr package take at least two arguments: a string and a pattern. The string indicates the character string that we are manipulating, while pattern indicates the sequence of characters that modulate our function. This will make more sense once we consider some examples.

Mutating Strings

Replace strings

The first set of functions we will consider will be those that modify strings directly. We already saw above that str_sub can extract a subset of a string; it can also be used to modify specific portions of a string

# position 1 to 3 is "sta"
x <- "sta230"
str_sub(x, 1, 3)
## [1] "sta"
# replace "sta" with "statistics"
str_sub(x, 1, 3) <- "statistics"
x
## [1] "statistics230"

Using str_sub is a bit awkward, however, as it requires positional indices. More commonly, there is the str_replace function which takes the two arguments we mentioned above, string and pattern. In this case, string is the string we are manipulating and pattern indicates what we wish to replace. Finally, there is a replacement argument which we use to replace pattern

x <- "sta320"
str_replace(string = x, pattern = "sta", replacement = "statistics")
## [1] "statistics320"

Like all stringr functions, this is vectorized, meaning that we can do this on a vector of character strings.

# A vector of hypothetical ids
ids <- c("M289", "M432", "F201", "M365")

ids <- str_replace(string = ids, pattern = "M", replacement = "MALE")
ids
## [1] "MALE289" "MALE432" "F201"    "MALE365"
ids <- str_replace(ids, "F", "FEMALE")
ids
## [1] "MALE289"   "MALE432"   "FEMALE201" "MALE365"

Oddly (and this also applies to most stringr functions), these operations typically only happen on the first instance of a pattern match. So, for example, if a particular pattern shows up multiple times in a string, the replacement will only occur on the first

dog <- "dog_dog_dog_dog_dog"
str_replace(dog, "dog", "GOD")
## [1] "GOD_dog_dog_dog_dog"

We can specify that we want this to apply to all instances of the pattern with str_replace_all

str_replace_all(dog, "dog", "GOD")
## [1] "GOD_GOD_GOD_GOD_GOD"

Personally, I really like using this function to remove characters I don’t want by replacing a pattern with "" or "_"

x <- "123-456-789"
str_replace_all(x, "-", "")
## [1] "123456789"

Miscellaneous mutations

There are also a collection of mutating functions that operate as housekeeping functions, allowing us to standardize or clean up sloppy data entry. For example, consider the following:

s <- "tHe quIcK brOWn fOX juMPeD oVeR tHe LAzY dOG"

# Make all lower
str_to_lower(s)
## [1] "the quick brown fox jumped over the lazy dog"
# Make all upper
str_to_upper(s)
## [1] "THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG"
# Make a title
str_to_title(s)
## [1] "The Quick Brown Fox Jumped Over The Lazy Dog"
# Make a sentence
str_to_sentence(s)
## [1] "The quick brown fox jumped over the lazy dog"

It’s worth mentioning also a number of functions that handle extra spaces which can sometimes occur with manual data entries. str_trim will trim off all leading and trailing white space, while str_squish will do that and remove repeated spaces in the string

x <- "   oops   too much    space "  
str_trim(x)
## [1] "oops   too much    space"
str_squish(x)
## [1] "oops too much space"

Joining and splitting

Next we have joining and splitting which, as the heading suggests, involves joining or splitting strings together. Joining strings is especially useful when we want to present summary information based on statistics that have yet to be computed. str_c takes a comma separated collection of values which it will then paste together to form a single string

val <- 1:10
str_c("The mean value of val is: ", mean(val))
## [1] "The mean value of val is: 5.5"

str_c can also be used to turn a character vector into a single string by using the collapse argument

s <- c("a", "b", "c")
s
## [1] "a" "b" "c"
str_c(s, collapse = "")
## [1] "abc"
str_c(s, collapse = ", ")
## [1] "a, b, c"

We have already seen a form of string splitting with the separate function in dplyr. With simple strings, it is str_split. Also as before, . is a special character that needs to be escaped with \\

s <- "cat.dog"
str_split(s, "\\.")
## [[1]]
## [1] "cat" "dog"

This one is a bit strange in that it returns a list object instead of a character vector. This is because unlike the other functions we have seen so far, str_split will split every instance of the pattern. Returning a list allows each element of the character vector to be a different length

s <- c("oneword", "two.words", "this.has.three")
str_split(s, "\\.")
## [[1]]
## [1] "oneword"
## 
## [[2]]
## [1] "two"   "words"
## 
## [[3]]
## [1] "this"  "has"   "three"

Lists will also be another topic later in the semester. For now, it is sufficient to think of them as a more general version of a vector.

Subsetting and Extracting

The next collection of functions deals with subsetting and extraction. We already saw str_sub in a few contexts, so we will not repeat that here. The two primary functions here are str_extract and str_subset. The first of these, str_extract, simply extracts from the string any pattern that matches pattern or otherwise returns NA

x <- c("apple", "banana", "melon")
str_extract(x, "a")
## [1] "a" "a" NA

There is also str_subset which will instead retain each element of a character vector matching a pattern, discarding everything else

x <- c("apple", "banana", "melon")
str_subset(x, "a")
## [1] "apple"  "banana"

These two functions seem a bit silly here, as of course searching for an “a” will return an “a”. However, we will soon introduce regular expressions, giving a far more powerful tool for identifying patterns in strings.

Detecting patterns

Finally, we have pattern detection. Like str_subset(), these functions are vastly more useful when used in conjunction with regular expressions. For now, we will introduce them in a simpler context.

First we have str_detect, which returns a logical vector indicating if a pattern was found in a string

s <- c("dog", "cat", "parrot", "hamster")
str_detect(s, pattern = "o")
## [1]  TRUE FALSE  TRUE FALSE

This can be used similarly to str_subset above where the logical vector indicates which we would keep

s <- c("dog", "cat", "parrot", "hamster")
idx <- str_detect(s, pattern = "o")
s[idx] # true and false values
## [1] "dog"    "parrot"
str_subset(s, "o")
## [1] "dog"    "parrot"

Also useful is str_count which counts the number of times a pattern appears in a string

x <- "mississippi"
str_count(x, "s")
## [1] 4

Regular Expressions

Up until this point, the stringr functions we have introduced may seem a bit limited in their usefulness. However, each of the above examples is intended to provide an unambiguous illustration of how each of these functions work. In this current section, we combine our stringr functions with patterns produced by regular expressions, infinitely expanding their utility.

Regular expressions (regex) are an incredibly powerful tool for specifying patterns in a text; a short collection of examples is provided on the second page of the cheatsheet, which you will want to reference frequently for the remaining portion of this lab.

View

While we will not include it here in the lab for brevity’s sake, there is a very useful function called str_view which allows you to see visually the pattern in a string by enclosing the pattern in brackets. As our regular expressions get more sophisticated, this will help visualize what exactly your pattern is finding, especially useful when you are not getting the results you want. In the example below, we are taking the string "apple" and identifying as a pattern via regex any characters contained within the set aeiou. The result of str_view() indicates that both "a" and "e" are identified with that regex.

## Find all the vowels
str_view("apple", "[aeiou]")
## [1] │ <a>ppl<e>

Meta-characters

Meta-characters are characters that have a particular meaning in regex language, meaning that they cannot be used to literally express a string value. We have seen one of these already in R, with the period . which has to be escaped. In regex, the period is used to indicate a “wildcard” character, meaning it will match with anything. By escaping a meta-character with \\, we are able to tell R to interpret the string literally. Here, we count the number of periods in a string

x <- "hello.world"

# As a wildcard, this counts every single character
str_count(x, ".")
## [1] 11
# We must "escape" the period if we want to count periods
str_count(x, "\\.")
## [1] 1

We won’t list out all of the meta-characters here. Instead, understand that in the next sections we will introduce a number of expressions that use special characters. To use those characters literally, we will need to escape them in strings using \\.

Anchoring

Anchoring simply means that we wish to indicate the start or the end of the string. For example, if instead of finding all instances in which a string contains the letter “a” we want to find those that start with the letter “a”, we can “anchor” our expression with "^" which indicates the beginning of a string

s <- c("apple", "pineapple", "banana")

## Subset all strings that contain the letter "a"
str_subset(s, "a")
## [1] "apple"     "pineapple" "banana"
## Subset all strings that begin with the letter "a"
str_subset(s, "^a")
## [1] "apple"

Likewise, we use "$" to indicate the end of a string

## Subset all strings that end with the letter "a"
str_subset(s, "a$")
## [1] "banana"

Of course, if we actually wanted to find “^” or “$”, we would need to escape them

s <- c("h^t", "hat", "hut", "hit")
str_subset(s, "\\^")
## [1] "h^t"

Regular Expressions

While the term “regular expressions” does refer to the entire process of specifying patterns, we also refer to the specific patterns themselves as regular expressions. Regular expressions constitute an excellent example in which complicated string processing tasks can be simplified with the use of an LLM; however, there are a few that are common enough that they are worth exploring in some detail. A great place to start will be the period that we have so frequently encountered. In a regular expression, . is a placeholder that is intended to represent any character.

s <- c("X25YW", "X91YW", "R30WY", "X11YW")

# We want all X, followed by 2 of any character, then YW
str_subset(s, "X..YW")
## [1] "X25YW" "X91YW" "X11YW"

Some other immediately useful ones are "\d" and "\b" which refer to digits and word boundaries, respectively. Regular expressions can be combined in powerful ways:

s <- c("some words no numbers", "see you 2nite", "you are L8")

# Find strings that have digits that start of a word
str_subset(s, "\\b\\d")
## [1] "see you 2nite"
# Find strings that have digits at end of word
str_subset(s, "\\d\\b")
## [1] "you are L8"

(We read “\b\d” as “\b” is the boundary and, immediately following the boundary, we find a digit “\d”. Likewise, “\d\b” indicates digit then boundary).

Other common ones includes "[:alpha:]", "[:punct:]", "[:lower:]", etc., . This is especially useful with punctuation when you don’t want to escape a bunch of characters (though, see the cheat sheet as not all punctuation falls under "[:punct:])

s <- c("dogs$(@!", "^&*are", ";[co(]!ol")
# This removes "punctuation"
s <- str_replace_all(s, pattern = "[:punct:]", "")
s
## [1] "dogs$" "^are"  "cool"
# This removes "symbols"
s <- str_replace_all(s, pattern = "[:symbol:]", "")
s
## [1] "dogs" "are"  "cool"

Alternates

Sometimes we want to express patterns as a collection of possible matches, alternatives, or even exclude patterns. The most common of these is the square brackets, [] which indicate that we would like to match any one of the the patterns in brackets.

s <- c("a1", "b2", "c3", "d4")

# Only match those with 1 or 3
str_subset(s, pattern = "[13]")
## [1] "a1" "c3"

Using "^" combined with square brackets, we can also request anything but the patterns in brackets. We need to be careful, though, because this may match other patterns in a string, returning results that you do not intend. Here, we might expect that “a1” and “c3” are not included, but since both “a” and “c” are not 1 or 3, the pattern still flags

s <- c("a1", "b2", "c3", "d4")

# Anything BUT 1 or 3
str_subset(s, pattern = "[^13]")
## [1] "a1" "b2" "c3" "d4"
## str_view confirms that each element of s matches this pattern somewhere
str_view(s, pattern = "[^13]")
## [1] │ <a>1
## [2] │ <b><2>
## [3] │ <c>3
## [4] │ <d><4>

By playing around a bit, we can modify expressions to get what we want. In this case, we use “$” to indicate that we don’t want the pattern 1 or 3 occurring at the end of a string (recall from above that “$” is used to indicate the end of a string).

s <- c("a1", "b2", "c3", "d4")

# Does not END in 1 or 3
str_subset(s, pattern = "[^13]$")
## [1] "b2" "d4"

Quantifiers

Quantifiers are special characters in regular expressions that allow us to quantify patterns in particular ways. In all cases, quantifiers modify the pattern that immediately precedes them. The first of these we will look at is "*", which indicates “zero or more” matches. Here, we use it to find strings that contain the character “a” followed by zero or more “p”

s <- c("Apple", "Pineapple", "Pear", "Orange", "Peach", "Banana")
str_subset(s, "ap*")
## [1] "Pineapple" "Pear"      "Orange"    "Peach"     "Banana"

Notice that the “a” in “Pear” is not followed by any “p”; this is still OK because “*” indicates zero or more

This seems a little silly, but its utility becomes more apparent in the following example

s <- c("xz", "xyz", "xyyz", "xyyyz", "xyyyyz")

# We want everything that starts with "x", may or may not have "y", ends with "z"
str_detect(s, "xy*z")
## [1] TRUE TRUE TRUE TRUE TRUE

Quantifiers also allow us to further define patterns. For this next example, we will use str_count to count the number of vowels in a string

s <- "iowa"
str_count(s, "[aeiou]")
## [1] 3
str_view(s, "[aeiou]")
## [1] │ <i><o>w<a>

However, we may be interested in knowing how many instances of uninterrupted vowels there are separated by consonants. In this case, we have “io” and “a” with a non-vowel in between. We can specify that the pattern we are looking for will include one or more of something with “+”

s <- "iowa"
# Counts subsequent vowels as a single instance
str_count(s, "[aeiou]+")
## [1] 2
str_view(s, "[aeiou]+")
## [1] │ <io>w<a>

We can be more precise with our quantifiers using curly braces, "{}" to indicate exactly what quantity constitutes a pattern. For example, suppose we want to know how many instances of exactly two vowels occur in a string:

s <- "iowa"
str_count(s, "[aeiou]{2}")
## [1] 1
str_view(s, "[aeiou]{2}")
## [1] │ <io>wa

Of course, as with other regular expressions, we must be very precise in what we are asking, for as we ask, thus we shall receive.

Practice Problems

Question 1 Using the functions we have learned so far, modify the string below to produce the string "United_States"

s <- c("  UNITED  STATES  ")

Question 2 Return to the dataset we introduced at the beginning of the lab

dat <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
dat
##    ID messy_x messy_y
## 1   1     100      50
## 2   2      90      40
## 3   3      85      55
## 4   4      90      45
## 5   5     110     55*
## 6   6 Missing      60
## 7   7     115      40
## 8   8    <NA>      35
## 9   9     105      40
## 10 10     100      50

Using the tools we have introduced thus far, correct this dataset so that the missing values in messy_x are recorded as NA and the annotated values in messy_y are modified so that it can be correctly converted to numeric.


Question 3 Use str_extract to extract the first letter of each string in this vector. Then combine those letters with str_c to create a single character string

s <- c("howdy", "otter", "tragedy", "danger", "octopus", "grapes")

Not such a silly function now, is it?


Question 4 A common problem in data processing involves the inclusion of “free response” questions in which respondents are able to enter data without restriction. In the following string, respondents were asked to give their phone numbers, but no standardization was enforced when doing so:

phone_strings <- c("Home: (507)-645-5489", 
                   "Cell: 219.917.9871", 
                   "My work phone is 507-202-2332",
                   "I don't have a phone")

Using functions from this lab, extract the phone numbers from the character vector and modify the strings so that all of the numbers are of the form XXX-XXX-XXXX. If a number is not included, it can be left as an empty string ("").

Hints:

  • A pattern like "\\d{3}[-.]" will identify any three digits followed by a hyphen or period
  • Regular expressions can be compounded
  • str_replace can get rid of pesky things we don’t need

Question 5 The following code chunk takes the rownames from the USArrests dataset and uses them to create a data.frame with a single column containing all of the US states

states <- data.frame(states = rownames(USArrests))
head(states)
##       states
## 1    Alabama
## 2     Alaska
## 3    Arizona
## 4   Arkansas
## 5 California
## 6   Colorado

Using the dplyr package along with stringr functions introduced in this lab, create a column that counts how many vowels are included in each state name, and then provide a ratio of the number of vowels to the number of total letters in each state name. Arrange the data set so that states with the highest ratio of vowels to letters are on top. Hint: don’t forget to deal with uppercase/lowercase letters


Question 6 Consider the following character vector and the str_subset function that is intending to return all of the strings with exactly 2 digits. What does it actually return? Why do you think this is? Use str_view to see what patterns are being matched

s <- c("1one", "22two", "333three", "4444four")

# Find
str_subset(s, pattern = "\\d{2}")
## [1] "22two"    "333three" "4444four"

Now modify the regular expression so that in only pulls out "22two". Hint: what is in front of the digits and what is behind them? What regular expressions can you combine to get what you want?


Question 7: Below is a collection of IDs pulled from a data base. A valid ID will consist of a single capital letter, followed by three numbers

ids <- data.frame(ID = c("ID: Q781","no id","ID: B847","ID: Z12","ID: M120",
            "ID: A123", NA,"ID: H064","ID: AA123","ID: R006","ID: D555",
            "ID: K410","ID: G777", "ID: C019","ID: L999", "NA", "ID: E902","ID: J888",
            "ID: P333","ID: F301","ID: N450", "ID is R569"))

For this problem, you should

  • Identify which strings contain a valid ID
  • Modify the entries so that only the ID number is retained
  • Replace invalid or missing IDs with NA

Question 8: A collaborator has just sent you a data.frame like the one below recording name and demographic data on a list of subjects:

df <- data.frame(Subject = c(
    "name: Alice, sex: F, age: 22",
    "name: Bob, sex: Male, age: 9",
    "name: Carol Smith, sex: Female, age: 31",
    "name: David Crockett, sex: M, age: 45",
    "name: Emily Johnson, sex: F, age: 18",
    "name: Frank, sex: Male, age: 27",
    "name: Grace Lee, sex: Female, age: 54",
    "name: Henry, sex: M, age: 6",
    "name: Isabella Martinez, sex: Female, age: 39",
    "name: Jack, sex: M, age: 72"
  ))

Clean this data frame so that there are three columns: name, sex, and age. Then, apply the appropriate data summaries to find the average age for each sex.


Question 9: Below is a list of emails that were collected from an identity harvesting site. Before we engage in white-collar crime, we need to verify that the email addresses we have are valid. A valid email will have the following properties:

  • The username will contain only alphanumeric characters, dashes, or underscores
  • It will contain an @ symbol
  • The domain will only contain alphanumeric characters
  • Domain extensions can only be 3 lowercase letters
emails <- data.frame(email = c(
  "katie35@gmail.com",
  "hilary@grinnell.edu",
  "abbi12@site.org",
  "kendall7@iastate.gov",
  "brianna99@gmail.com",
  "alex@site.org",
  "megan88@grinnell.edu",
  "madison@iastate.gov",
  "jenny4@gmail.com",
  "kelly@site.org",
  "hilaryduff@gmail.com",     
  "abby@grinnell.education",  
  "kendall@iastate.gov",      
  "brianna@gmail.com",       
  "alex12@site.org",          
  "megan@grinnell.edu",       
  "madison_7@gmail.com",      
  "jenny!@site.org",           
  "noatsymbolgmail.com",      
  "kelly3@iastate.g0v"        
))

For this problem you should:

  1. Retain only valid email addresses (you do not need to correct invalid ones)
  2. Create a column of usernames and a column of domains
  3. Create a table to determine if there are more .gov, .com, or .edu addresses