Homework #3

Question 1

Twitter is a popular social media network in which users can send and receive short messages (called “tweets”) on any topic they wish. For this question you will analysis data contained in the file Ghostbusters.txt (which can be read using the code below). This file contains 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.

## Because the format of twitter data is differently than what we're used to
## we'll need the "scan" function to read it into R. "scan" will search for particular
## characters and use them to define each element of the object it returns
tweets <- scan("https://raw.githubusercontent.com/ds4stats/case-studies/master/twitter-sentiment/Ghostbusters.txt", what = "")

Rather than a traditional data set, data here represents a 5000 length character vector which will be used for the problems related to question 1. To reduce the risk of traumatic brain injury, it is advised that you refrain from reading any of the tweets directly.

Part A: For Part A, use the stringr package, write code to clean these data by removing the Unicode values (strings like <U+00A0>, which typically are used to represent non-alphanumeric characters, emojis, or format information such as newlines and tabs). To do this, you should assume that anything appearing inside of the characters < and > can be removed (including < and > themselves).
Part B: On twitter, a user may echo another user’s tweet to share it with their own followers by “retweeting”. In this dataset, all retweets begin with the letters “RT” followed by “@” and the original user’s twitter name. For this question, write code that stores retweets into a separate data set, then use the length function to find the number of tweets in this dataset. Or, if you end up with a logical, you may use sum() which totals the number registered as TRUE (since TRUE is equivalent to 1).
Part C: Excluding retweets from this problem, find the number of tweets where “hate” or “hated” (of any capitalization) appear, and the number of tweets where “love”, “loved”, or “looved” or “loooooved” (i.e., all variants with any number of “o”s or other capitalization) appear.
Part D: When an individual embeds an image in their tweet, retweets, or links to another webpage, the resulting object is rendered in the text as "https://t.co/" followed by a string of 10 additional characters (alphanumerics only) identifying the original tweet, image, or link which we will call the identifier. For example, the first tweet in the tweets data is a retweet which references the original as "https://t.co/mgkk9E7CKR". Here, the identifier is "mgkk9E7CKR". In this problem, we are going to do an informal analysis to see if the assignment of the identifier digits appears to be random. More specifically, we are going to see if letters and numbers occur at the same expected rate

To do this, we are going to begin by considering all of the identifiers that appear in the dataset. Then, across all of the identifiers, we should determine the total frequency of numeric and alphabetical values To determine the rate at which these appear, we should then divide the total frequency by the number of total possible values. For example, there are the numeric digits 0-9, so we should divide the total number of numerics by 10. The resulting value will give us the rate at which numerics appear. Having done this, we should do the same thing for alphabetical values. For reference, the rates for each of these categories should have values between 500-610.

Does the rate at which numbers appear seem similar to the rate the alphabetical values appear?
Does the rate seem the same between upper and lower case letters?
What should be the rate across all alphanumeric values? How does the rate of numbers, upper and lower case letters compare?
(Extra Credit) See if you can create a bar chart demonstrating the frequency of appearance for each of the numeric digits. Does anything seem over-represented?

Hint: Using str_extract_all() will return a list. We can turn this back to a vector using unlist(). This would look something like this:

# This creates a list of each occurrence of the pattern
(ll <- str_extract_all("aeiou", "[aeiou]"))

## [[1]]
## [1] "a" "e" "i" "o" "u"

# This returns it to a character vector
unlist(ll)

## [1] "a" "e" "i" "o" "u"

\(~\)

Question 2

The following datasets, sourced Northwestern University’s Storybench, are 6-months of news article titles, dates, and teasers for two news organizations - ABC 7 New York and KCRA in California.

ny_stories <- read.csv("https://storybench.org/reinventingtv/abc7ny.csv")
ca_stories <- read.csv("https://storybench.org/reinventingtv/kcra.csv")
combined <- rbind(data.frame(ny_stories, location = "NY"), data.frame(ca_stories, location = "CA"))

The data frame “combined” contains stories from both organizations with an additional column, “location”, indicating the source.

Part A: Find the number of total headlines published on each day of the week by each of the news organizations, published both before and after noon. Then recreate the bar chart given below. (Use na.omit to remove the entries that don’t parse correctly)

Part B: Using the appropriate dplyr and stringr functions, create a summary of this data that shows how many headlines contain the words “China”, how many contain “Russia”, and how many contain “Germany” for each the California and New York affiliates. Either print out the data frame or use knitr::kable(df) to create a table in R Markdown that looks like the one below (of course, yours should have numbers in it)

Describe any patterns that you find.

location	China	Russia	Germany
CA	X	X	X
NY	X	X	X

Part C: Manipulate the data to determine the total number of capital words included in the teasers each month. Which month had the most capital letters? Which had the fewest?