Twitter is a popular social media network in which users can send and receive short messages (called “tweets”) on any topic they wish. For this question you will analysis data contained in the file Ghostbusters.txt (which can be read using the code below). This file contains 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.
## Because the format of twitter data is differently than what we're used to
## we'll need the "scan" function to read it into R. "scan" will search for particular
## characters and use them to define each element of the object it returns
tweets <- scan("https://raw.githubusercontent.com/ds4stats/case-studies/master/twitter-sentiment/Ghostbusters.txt", what = "")
Rather than a traditional data set, data here represents
a 5000 length character vector which will be used for the problems
related to question 1. To reduce the risk of traumatic brain injury, it
is advised that you refrain from reading any of the tweets directly.
Part A: For Part A, use the stringr
package, write code to clean these data by removing the Unicode values
(strings like <U+00A0>, which typically are used to
represent non-alphanumeric characters, emojis, or format information
such as newlines and tabs). To do this, you should assume that anything
appearing inside of the characters < and
> can be removed (including < and
> themselves).
Part B: On twitter, a user may echo another
user’s tweet to share it with their own followers by “retweeting”. In
this dataset, all retweets begin with the letters “RT” followed
by “@” and the original user’s twitter name. For this question, write
code that stores retweets into a separate data set, then use the
length function to find the number of tweets in this
dataset. Or, if you end up with a logical, you may use
sum() which totals the number registered as
TRUE (since TRUE is equivalent to
1).
Part C: Excluding retweets from this problem, find the number of tweets where “hate” or “hated” (of any capitalization) appear, and the number of tweets where “love”, “loved”, or “looved” or “loooooved” (i.e., all variants with any number of “o”s or other capitalization) appear.
Part D: When an individual embeds an image in
their tweet, retweets, or links to another webpage, the resulting object
is rendered in the text as "https://t.co/" followed by a
string of 10 additional characters (alphanumerics only)
identifying the original tweet, image, or link which we will call the
identifier. For example, the first tweet in the tweets data
is a retweet which references the original as
"https://t.co/mgkk9E7CKR". Here, the identifier is
"mgkk9E7CKR". In this problem, we are going to do an
informal analysis to see if the assignment of the identifier digits
appears to be random. More specifically, we are going to see if letters
and numbers occur at the same expected rate
To do this, we are going to begin by considering all of the identifiers that appear in the dataset. Then, across all of the identifiers, we should determine the total frequency of numeric and alphabetical values To determine the rate at which these appear, we should then divide the total frequency by the number of total possible values. For example, there are the numeric digits 0-9, so we should divide the total number of numerics by 10. The resulting value will give us the rate at which numerics appear. Having done this, we should do the same thing for alphabetical values. For reference, the rates for each of these categories should have values between 500-610.
Hint: Using str_extract_all() will return a list. We can
turn this back to a vector using unlist(). This would look
something like this:
# This creates a list of each occurrence of the pattern
(ll <- str_extract_all("aeiou", "[aeiou]"))
## [[1]]
## [1] "a" "e" "i" "o" "u"
# This returns it to a character vector
unlist(ll)
## [1] "a" "e" "i" "o" "u"
\(~\)
The following datasets, sourced Northwestern University’s Storybench, are 6-months of news article titles, dates, and teasers for two news organizations - ABC 7 New York and KCRA in California.
ny_stories <- read.csv("https://storybench.org/reinventingtv/abc7ny.csv")
ca_stories <- read.csv("https://storybench.org/reinventingtv/kcra.csv")
combined <- rbind(data.frame(ny_stories, location = "NY"), data.frame(ca_stories, location = "CA"))
The data frame “combined” contains stories from both organizations with an additional column, “location”, indicating the source.
na.omit to remove the entries that don’t parse
correctly)dplyr
and stringr functions, create a summary of this data that
shows how many headlines contain the words “China”, how many contain
“Russia”, and how many contain “Germany” for each the California and New
York affiliates. Either print out the data frame or use
knitr::kable(df) to create a table in R Markdown that looks
like the one below (of course, yours should have numbers in it)Describe any patterns that you find.
| location | China | Russia | Germany |
|---|---|---|---|
| CA | X | X | X |
| NY | X | X | X |