The babynames
package contains a dataset called
babynames
documenting the number and frequency of all names
that appear at least 5 times within a given year as recorded by the US
Social Security Administration.
The code below will load the package that contains the dataset. You will need to install the package first before it can be used.
# install.packages("babynames")
library(babynames)
Create a subset of babynames
that contains information
on the names
"Ryan", "Jeff", "Shonda", "Jonathan", "Collin", "Anna"
.
Next, use ggplot
and the geom geom_line()
,
colored by name, to create a line chart of each name’s frequency by
year. What is happening that is making this graph look so chaotic? Look
at the data frame and explain the issue in 1-2 sentences.
Create a new graph that fixes the problem you identified in part A and appropriately displays the frequency of each name over time. Don’t forget the logical operators covered in Lab 1.
The data frame economics
is included in the
ggplot2
package and contains US economic data provided by
the US Federal Reserve
library(ggplot2)
data(economics)
Write code that transforms the economics data from wide to long
format, so that each row contains one economic outcome
(pce
, psavert
, unemped
, and
unemploy
) and its associated value for that year.
Using the data you created in Part A, create a line graph in
ggplot
the places uempmed
and
psavert
on the y-axis and date
on the x-axis.
Use the color
aesthetic to differentiate each of the
variables. In the space below your code, briefly described if these
variables appear to be related and what that relationship might be.
(Hint: See ?economics
for a descriptor of the
variables)
For Question 3, we need to install that Lahman
package
which is a database of Major League Baseball statistics collected by
Sean Lahman from the 1871-2016 seasons. The database contains several
tables (data frames) which can be loaded into our environment using the
data()
function.
# install.packages("Lahman")
library(Lahman)
data("Teams")
data("People")
data("Batting")
For Part A, use the group_by()
and
summarize()
functions to find the total number of home runs
for each player in the Batting
data fame. Then, store the
top 30 players (with the most career home runs) in a separate data
frame. Hint: While there are a number of ways to select the top
30 players, the dplyr
function slice_head()
might be useful (?slice_head
)
It has been hypothesized in several sports that an athlete’s
birth month is related to future success in sports. For this final
part, include birth month information from the People
data
frame into the data frame you created in Part A (containing the top 30
home run hitters). Then, create a data visualization exploring whether
birth months appear to be uniformly distributed among the players.
Using the Teams
data frame in the Lahman
package, display the top ten teams in terms of “slugging percentage”
(SLG) since 1969.
SLG is computed as the team’s total bases divided by the total “at
bats” (AB
in the data set). To find the total number of
bases, you should assign a value of 1 for singles, 2 for doubles, 3 for
triples, and 4 for home runs (that is, the sum of all of these will give
you the total number of bases).
Hint: The variables X2B
, X3B
, and
HR
represent doubles, triples, and home runs, respectively.
There is no variable for singles, but one can be computed using the
variable H
which represents the total number of hits. If we
subtract the total number of doubles, triples, and home runs from the
hits, we will be left with the total number of singles.
Sample output of only the first three teams is printed below to help validate your own solutions:
## yearID teamID SLG
## 1 2019 HOU 0.4954570
## 2 2019 MIN 0.4940684
## 3 2003 BOS 0.4908996