ggplot2
When creating your markdown file for this homework, remember to format your submission according to these rules:
The data frame diamonds
, like mpg
, is
included in the ggplot2
package. This data records the
attributes of several thousand diamonds sold by a wholesale online
retailer. For this question, your goal is to recreate the graph shown
below as closely as possible. A few hints:
alpha = 0.3
is used in one of the layers
to give each point 30% opacitylibrary(ggplot2)
data("diamonds")
Simpson’s paradox is a phenomenon in statistics in which a trend that exists in a large group disappears or even reverses within subsets of a group. An example of this arises when considering the relationship between body mass and longevity. When all animals are considered together, it is generally true that larger animals have longer life spans than shorter ones (for example, elephants tend to live longer than mice). However, within a particular species, the opposite is true – small body sizes tend to correspond with greater longevity (that is, smaller elephants tend to live longer than larger ones). In other words, when considered in aggregate, one trend appears to be true, and when considered in subsets, the opposite appears to be the case.
For this problem, we will use the College Scorecard data from 2019 recording attributes and outcomes for all primarily undergraduate institutions in the United States with at least 400 full-time students. Note: this data was introduced in the bonus lab, but it is not necessary that you have done it there to do this problem.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
\(~\)
Using the colleges
data, create a scatter plot with
Enrollment
on the x-axis and Net_Tuition
on
the y-axis. What do you notice about this graph? Create a second scatter
plot, this time adding the log2
transformation to the
x-axis. Does anything stand out in this transformed plot?
Using the transformed plot in Part 1, add a smoothing layer with the
argument method = "lm"
to create a line of best fit. What
appears to be the relationship between tuition and enrollment?
To the transformed plot in Part 1 (that is, without the smoothing
layer), now add color to the scatter plot based on the variable
Private
. How do the two groups compare?
Create a plot with the following characteristics:
"log2"
scale.
Color the points according to Private
What appears to be the relationship between tuition and enrollment in private and public colleges? How is this different than the relationship between enrollment and tuition between all of the colleges? Use the plot you created to justify your answer.
For this question, we are going to use the HairEyeColor
dataset provided in R. As it is stored as a table by default, we will
begin by turning it into a data frame
hair <- as.data.frame(HairEyeColor)
Using this data, create a ggplot with the following characteristics:
Sex
on the x-axis and Freq
on the
y-axis and color the groups by eye colorstat = "identity"
to the layer so that it knows that the
variable Freq
represents a countUsing the plot that you made, answer the following questions:
Within each hair color, does the distribution of eye color appear to be similar or different between sexes?
Are there any trends that you notice in the relationship between hair and eye color? Specifically:
Create a subset of the hair
dataset to only include
individuals with brown hair. Create a bar chart similar to Part 1 (sex
on x-axis, frequency on y-axis, color with eyes), but this time changing
the position
argument so that within each sex, the bars sum
up to one. Based on this plot, which sex seems to have a greater
composition of brown eyes? Which appears to have a greater composition
of blue eyes?