Homework 3

Question 1

The babynames package contains a dataset called babynames documenting the number and frequency of all names that appear at least 5 times within a given year as recorded by the US Social Security Administration.

The code below will load the package that contains the dataset. You will need to install the package first before it can be used.

# install.packages("babynames")
library(babynames)

Part A

Create a subset of babynames that contains information on the names "Ryan", "Jeff", "Shonda", "Jonathan", "Collin", "Anna". Next, use ggplot and the geom geom_line(), colored by name, to create a line chart of each name’s frequency by year. What is happening that is making this graph look so chaotic? Look at the data frame and explain the issue in 1-2 sentences.

Part B

Create a new graph that fixes the problem you identified in part A and appropriately displays the frequency of each name over time. Don’t forget the logical operators covered in Lab 1.

Question 2

The data frame economics is included in the ggplot2 package and contains US economic data provided by the US Federal Reserve

library(ggplot2)
data(economics)

Part A

Write code that transforms the economics data from wide to long format, so that each row contains one economic outcome (pce, psavert, unemped, and unemploy) and its associated value for that year.

Part B

Using the data you created in Part A, create a line graph in ggplot the places uempmed and psavert on the y-axis and date on the x-axis. Use the color aesthetic to differentiate each of the variables. In the space below your code, briefly described if these variables appear to be related and what that relationship might be. (Hint: See ?economics for a descriptor of the variables)

Question 3

For Question 3, we need to install that Lahman package which is a database of Major League Baseball statistics collected by Sean Lahman from the 1871-2016 seasons. The database contains several tables (data frames) which can be loaded into our environment using the data() function.

# install.packages("Lahman")
library(Lahman)
data("Teams")
data("People")
data("Batting")

Part A

For Part A, use the group_by() and summarize() functions to find the total number of home runs for each player in the Batting data fame. Then, store the top 30 players (with the most career home runs) in a separate data frame. Hint: While there are a number of ways to select the top 30 players, the dplyr function slice_head() might be useful (?slice_head)

Part B

It has been hypothesized in several sports that an athlete’s birth month is related to future success in sports. For this final part, include birth month information from the People data frame into the data frame you created in Part A (containing the top 30 home run hitters). Then, create a data visualization exploring whether birth months appear to be uniformly distributed among the players.

Question 4

Using the Teams data frame in the Lahman package, display the top ten teams in terms of “slugging percentage” (SLG) since 1969.

SLG is computed as the team’s total bases divided by the total “at bats” (AB in the data set). To find the total number of bases, you should assign a value of 1 for singles, 2 for doubles, 3 for triples, and 4 for home runs (that is, the sum of all of these will give you the total number of bases).

Hint: The variables X2B, X3B, and HR represent doubles, triples, and home runs, respectively. There is no variable for singles, but one can be computed using the variable H which represents the total number of hits. If we subtract the total number of doubles, triples, and home runs from the hits, we will be left with the total number of singles.

Sample output of only the first three teams is printed below to help validate your own solutions:

##   yearID teamID       SLG
## 1   2019    HOU 0.4954570
## 2   2019    MIN 0.4940684
## 3   2003    BOS 0.4908996

Homework 3

your-name-here

2023-09-18

Question 1

Part A

Part B

Question 2

Part A

Part B

Question 3

Part A

Part B

Question 4