Due on Gradescope Thursday, Feb 12 at 10pm.

Book Exercises

Chapter 4

Write a short sentence indicating that you read the chapter and have internalized as a critical part of your core identity the importance of maintaining a legible style when writing code.

Chapter 9

Bonus Exercises

Question 1

The data frame economics is included in the ggplot2 package and contains US economic data provided by the US Federal Reserve

library(ggplot2)

data(economics)

Using the economics dataset, do the following:

  • Only retain those variables associated with date, psavert (personal savings rate), and uempmed (median duration of unemployment)
  • Pivot the data so that each row contains one economic outcome
  • Create a line graph in ggplot using color to differentiate the metrics. Briefly, what relationship do you see between median duration of unemployment and personal savings rate?

Question 2

For this question, we need to install that Lahman package which is a database of Major League Baseball statistics collected by Sean Lahman from the 1871-2016 seasons. The database contains several data.frames which can be loaded into our environment using the data() function.

# install.packages("Lahman")
library(Lahman)
data("Teams")
data("People")
data("Batting")

Part A: Use the group_by() and summarize() functions to find the total number of home runs for each player in the Batting data fame. Then, store the top 30 players (with the most career home runs) in a separate data frame. Hint: While there are a number of ways to select the top 30 players, the dplyr function slice_head() might be useful (?slice_head)

Part B: It has been hypothesized in several sports that an athlete’s birth month is related to future success in sports. Using your data from Part A, join the birth month information from the People data frame. Then create a data visualization exploring whether birth months appear to be uniformly distributed among the players. Be sure that every month is represented on your axis, even if no players have a birthday in that month.

Question 3

Using the Teams data frame in the Lahman package, display the top ten teams in terms of “slugging percentage” (SLG) since 1969.

SLG is computed as the team’s total bases divided by the total “at bats” (AB in the data set). To find the total number of bases, you should assign a value of 1 for singles, 2 for doubles, 3 for triples, and 4 for home runs (that is, the sum of all of these will give you the total number of bases).

Hint: The variables X2B, X3B, and HR represent doubles, triples, and home runs, respectively. There is no variable for singles, but one can be computed using the variable H which represents the total number of hits. If we subtract the total number of doubles, triples, and home runs from the hits, we will be left with the total number of singles.

Sample output of only the first three teams is printed below to help validate your own solutions:

##   yearID teamID     SLG
## 1   2023    ATL 0.50080
## 2   2019    HOU 0.49546
## 3   2019    MIN 0.49407