Question 1

Twitter is a popular social media network in which users can send and receive short messages (called “tweets”) on any topic they wish. For this question you will analysis data contained in the file Ghostbusters.txt (which can be read using the code below). This file contains 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.

## Because the format of twitter data is differently than what we're used to
## we'll need the "scan" function to read it into R. "scan" will search for particular
## characters and use them to define each element of the object it returns
tweets <- scan("https://raw.githubusercontent.com/ds4stats/case-studies/master/twitter-sentiment/Ghostbusters.txt", what = "")

Rather than a traditional data set, data here represents a 5000 length character vector which will be used for the problems related to question 1. To reduce the risk of traumatic brain injury, it is advised that you refrain from reading any of the tweets directly.

To do this, we are going to begin by considering all of the identifiers that appear in the dataset. Then, across all of the identifiers, we should determine the total frequency of numeric and alphabetical values To determine the rate at which these appear, we should then divide the total frequency by the number of total possible values. For example, there are the numeric digits 0-9, so we should divide the total number of numerics by 10. The resulting value will give us the rate at which numerics appear. Having done this, we should do the same thing for alphabetical values. For reference, the rates for each of these categories should have values between 500-610.

  1. Does the rate at which numbers appear seem similar to the rate the alphabetical values appear?
  2. Does the rate seem the same between upper and lower case letters?
  3. What should be the rate across all alphanumeric values? How does the rate of numbers, upper and lower case letters compare?
  4. (Extra Credit) See if you can create a bar chart demonstrating the frequency of appearance for each of the numeric digits. Does anything seem over-represented?

Hint: Using str_extract_all() will return a list. We can turn this back to a vector using unlist(). This would look something like this:

# This creates a list of each occurrence of the pattern
(ll <- str_extract_all("aeiou", "[aeiou]"))
## [[1]]
## [1] "a" "e" "i" "o" "u"
# This returns it to a character vector
unlist(ll)
## [1] "a" "e" "i" "o" "u"

\(~\)

Question 2

The following datasets, sourced Northwestern University’s Storybench, are 6-months of news article titles, dates, and teasers for two news organizations - ABC 7 New York and KCRA in California.

ny_stories <- read.csv("https://storybench.org/reinventingtv/abc7ny.csv")
ca_stories <- read.csv("https://storybench.org/reinventingtv/kcra.csv")
combined <- rbind(data.frame(ny_stories, location = "NY"), data.frame(ca_stories, location = "CA"))

The data frame “combined” contains stories from both organizations with an additional column, “location”, indicating the source.

Describe any patterns that you find.

location China Russia Germany
CA X X X
NY X X X