Please document your answers to all homework questions using R Markdown, submitting your compiled output as a zipped .html folder \(~\)
Twitter is a popular social media network in which users can send and receive short messages (called “tweets”) on any topic they wish. For this question you will analysis data contained in the file Ghostbusters.txt (which can be read using the code below). This file contains 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.
## Because the format of twitter data is differently than what we're used to
## we'll need the "scan" function to read it into R. "scan" will search for particular
## characters and use them to define each element of the object it returns
tweets <- scan("https://raw.githubusercontent.com/ds4stats/case-studies/master/twitter-sentiment/Ghostbusters.txt", what = "")
Rather than a traditional data set, data
here represents
a 5000 length character vector which will be used for the problems
related to question 1. To reduce the risk of traumatic brain injury, it
is advised that you refrain from reading any of the tweets directly.
For Part A, use the stringr
package, write code to clean
these data by removing the Unicode values (strings like
<U+00A0>
, which typically are used to represent
non-alphanumeric characters, emojis, or format information such as
newlines and tabs). To do this, you should assume that anything
appearing inside of the characters <
and
>
can be removed (including <
and
>
themselves).
On twitter, a user may echo another user’s tweet to share it with
their own followers by “retweeting”. In this dataset, all retweets
begin with the letters “RT” followed by “@” and the original
user’s twitter name. For this question, write code that stores retweets
into a separate data set, then use the length
function to
find the number of tweets in this dataset. Or, if you end up with a
logical, you may use sum()
which totals the number
registered as TRUE
(since TRUE
is equivalent
to 1
).
Excluding retweets from this problem, find the number of tweets where “hate” or “hated” (of any capitalization) appear, and the number of tweets where “love”, “loved”, or “looved” or “loooooved” (i.e., all variants with any number of “o”s or other capitalization) appear.
When an individual embeds an image in their tweet, retweets, or links
to another webpage, the resulting object is rendered in the text as
"https://t.co/"
followed by a string of 10 additional
characters (alphanumerics only) identifying the original tweet,
image, or link which we will call the identifier. For example, the first
tweet in the tweets
data is a retweet which references the
original as "https://t.co/mgkk9E7CKR"
. Here, the identifier
is "mgkk9E7CKR"
. In this problem, we are going to do an
informal analysis to see if the assignment of the identifier digits
appears to be random. More specifically, we are going to see if letters
and numbers occur at the same expected rate
To do this, we are going to begin by considering all of the identifiers that appear in the dataset. Then, across all of the identifiers, we should determine the total frequency of numeric and alphabetical values To determine the rate at which these appear, we should then divide the total frequency by the number of total possible values. For example, there are the numeric digits 0-9, so we should divide the total number of numerics by 10. The resulting value will give us the rate at which numerics appear. Having done this, we should do the same thing for alphabetical values. For reference, the rates for each of these categories should have values between 500-610.
Hint: Using str_extract_all()
will return a list. We can
turn this back to a vector using unlist()
. This would look
something like this:
# This creates a list of each occurrence of the pattern
(ll <- str_extract_all("aeiou", "[aeiou]"))
## [[1]]
## [1] "a" "e" "i" "o" "u"
# This returns it to a character vector
unlist(ll)
## [1] "a" "e" "i" "o" "u"
\(~\)
The following datasets, sourced Northwestern University’s Storybench, are 6-months of news article titles, dates, and teasers for two news organizations - ABC 7 New York and KCRA in California.
ny_stories <- read.csv("https://storybench.org/reinventingtv/abc7ny.csv")
ca_stories <- read.csv("https://storybench.org/reinventingtv/kcra.csv")
combined <- rbind(data.frame(ny_stories, location = "NY"), data.frame(ca_stories, location = "CA"))
The data frame “combined” contains stories from both organizations with an additional column, “location”, indicating the source.
Find the number of total headlines published on each day of the week
by each of the news organizations, published both before and after noon.
Then recreate the bar chart given below. (Use na.omit
to
remove the entries that don’t parse correctly)
Using the appropriate dplyr
and stringr
functions, create a summary of this data that shows how many headlines
contain the words “China”, how many contain “Russia”, and how many
contain “Germany” for each the California and New York affiliates.
Either print out the data frame or use knitr::kable(df)
to
create a table in R Markdown that looks like the one below (of course,
yours should have numbers in it)
Describe any patterns that you find.
location | China | Russia | Germany |
---|---|---|---|
CA | X | X | X |
NY | X | X | X |
Manipulate the data to determine the total number of capital words included in the teasers each month. Which month had the most capital letters? Which had the fewest?