This page represents a brief summary of the topics covered thus far in the course and should not be considered comprehensive. Links to the relevant sections in the labs are included when available.
This first section is a review of some of the fundamentals of the
R
language. This includes common data types, subsetting and
indexing data, logicals, and some useful functions that we have covered
thus far.
While there are a tremendous number of data types in R
,
the two that we have worked with most frequently are vectors
and data frames (Lab
1).
Vectors are one dimensional objects, with each element of the vector being of the same type. The types of vectors are:
c("dog", "cat")
1:5, c(1.2, 2.4)
TRUE
or FALSE
valuesggplot2
. More on factors
in Lab
2Data frames are the most common objects we have been working with and consist of a list of vectors, each of which is the same length. Typically, we thing of data frames as having rows as observations and columns as variables
Logicals
typically refer to either a vector of TRUE/FALSE
values or
the set of statements that result in a logical vector.
A natural way of constructing logical vectors is with the use of logical operators, the main set of which is given below:
Operator | Description |
---|---|
== |
equal to |
!= |
not equal to |
< |
less than (strict) |
<= |
less than or equal to |
> |
greater than (strict) |
>= |
greater than or equal to |
& |
and |
| |
or |
! |
not |
A logical operator, when used on a vector, will return a logical vector the same length as the original vector. For example, consider here a vector of length five:
# Vector of length 5
x <- 1:5
# Logical opertator returns a logical vector, also length 5
x < 3
## [1] TRUE TRUE FALSE FALSE FALSE
Logical vectors are most commonly used to subset an object, returning only those values where the logical vector is true
# Only return values of x where x < 3
x[x < 3]
## [1] 1 2
Logical vectors can be combined using |
and
&
. It is best practice to use parenthesis to ensure the
correct order of operations when these are computed
x[(x < 2) | (x >= 4)]
## [1] 1 4 5
We have also learned about the %in%
operator, which
allows us to see which elements of one character vector are included in
another
x <- c("a", "b", "c", "d", "e")
y <- c("b", "d")
x %in% y
## [1] FALSE TRUE FALSE TRUE FALSE
Note that we read this as “which x
is in
y
”. If we were to flip the order, the question (and the
length of the resulting vector) would be different
y %in% x
## [1] TRUE TRUE
Subsetting involves specifying portions of an R
object
we wish to retain; we saw examples of this in the section above with
logical vectors. More commonly, we talk about subsetting in the case of
data frames where we either want to keep some subset of the variables
(columns), some subset of the observations (rows), or both. We will
illustrate these with a small toy data frame
df <- data.frame(x = 1:5,
y = c("a", "b", "c", "d", "e"),
z = 6:10)
df
## x y z
## 1 1 a 6
## 2 2 b 7
## 3 3 c 8
## 4 4 d 9
## 5 5 e 10
When subsetting columns, we can do this either with the
$
for a single variable, or by using [,]
(note
the comma) syntax for one or more variables
# Only get the column 'x'
df$x
## [1] 1 2 3 4 5
## Get y and z
df[, c("y", "z")]
## y z
## 1 a 6
## 2 b 7
## 3 c 8
## 4 d 9
## 5 e 10
We can subset the rows of a data frame by passing a logical vector on
the left side of the comma in [,]
# Only keep those with x < 3
df[df$x < 3, ]
## x y z
## 1 1 a 6
## 2 2 b 7
We also have the subset
function, which takes as
arguments a data frame and a logical vector
subset(df, x > 2 & y %in% "d")
## x y z
## 4 4 d 9
Of course, dplyr
gives us a much simpler way of
performing most of these operations with filter
and
select
.
A collection of useful functions we have seen so far include
str
, dim
, class
,
print
, subset
, mean
,
sd
, max
, min
, range
,
IQR
, quantile
, summary
,
table
, and prop.table
For functions that compute a statistic (i..e, mean
or
quantile
), there is usually the option
na.rm = FALSE
to not include NA
values. You
can learn more about any function in R
using the
documentation with ?
, i.e., ?quantile
.
ggplot2
Every ggplot consists of three parts:
Here, we briefly review a few of the major components that go into successfully creating a ggplot
We will discuss layers and aesthetics together, noting that aesthetics are what are used to define how the layers are rendered.
Aesthetics are mappings from variables in the dataset, and can
include both numerical and categorical variables. These include both the
x and y axes, color, shape, and fill. Aesthetics are added with the
aes()
function
library(ggplot2)
## Use aesthetics to map x and y axes
ggplot(mpg, aes(displ, hwy, color = class)) +
geom_point()
The variables used in aesthetics can either be continuous or
discrete, and this will change the choice of aesthetics or layers that
we wish to use. For example, a scatter plot created with
geom_point()
is useful when comparing the relationship of
two continuous variables, while a box plot is useful for a continuous
variable with a discrete (and potentially several groups).
In some cases, a variable being continuous or discrete will change
the way that the aesthetic is mapped. For example, color uses a
continuous scale for continuous variables and a discrete palette when
variables are discrete. This is particularly important to remember as
variables that are nominally categorical may be stored as a numeric in
the data frame. This was the case in the mpg
dataset where
cyl
(number of cylinders) describes a category of vehicles
but is stored as an integer. To treat a numeric variable as discrete, we
must do so explicitly with factor()
.
## Cylinder as a numeric gives a continuous color scale
ggplot(mpg, aes(displ, hwy, color = cyl)) +
geom_point()
## Explicitly calling factor makes it discrete
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point()
It’s worth noting in this case that neither of these is necessarily correct or incorrect, as both graphs are capable of providing different information. On the left, for example, we see the obvious transition from low number of cylinders to high as displacement increases, while on the right hand graph, we see that they tend to group in clusters. Ultimately, the correct aesthetic to use is the one that most clearly represents the relationships you are trying to show.
Any aesthetics specified in ggplot()
will apply to all
subsequent layers; any aesthetics specified in a layer will apply
only to that layer. This has implications for summary layers
such as geom_smooth()
, for example, which will create
smoothing data based on groupings provided by the aesthetic
## Color applies to all subsequent layers
ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point() +
geom_smooth()
## Color applies *only* to geom_point
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
geom_smooth()
## Color applies *only* to geom_smooth
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(aes(color = drv))
This section is primarily a review of basic theme elements and aesthetic scales
There are a number of pre-built themes that come included in ggplot,
serving as a template from which you can further tailor your graphics.
Here, for example, we consider a black and white theme, generated by
adding theme_bw()
to our constructed plot:
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point() +
labs(x = "Displacement", y = "Highway", title = "Snappy title", color = "Cylinders") +
theme_bw() # adds a black and white theme
Other pre-built themes:
theme_bw()
theme_linedraw()
, theme_light()
and
theme_dark()
theme_minimal()
theme_classic()
theme_void()
You can judge the differences in these themes below:
While the pre-bulit theme functions are able to adjust a number of
features all at once, we can also use the theme
function to
modify things one at a time. Things that are common to modify with a
theme function include rotating or adjusting labels or changing the size
or font of plot and legend titles. Often, these include the use of an
element function (such as element_text
) to modify the
details. Finally, we can use labs()
or xlab()
and ylab()
to change the labels on the x or y axis
Here is an example of changing multiple elements at once.
ggplot(mpg, aes(class, hwy)) +
geom_boxplot() +
theme_bw() +
labs(x = "vehicle class", y = "fuel economy (highway)", title = "title") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5),
plot.title = element_text(color = 'steelblue', face = 'italic', size = 30))
Note: if you use both a pre-built theme and the
theme()
function, you will want to use the pre-built theme
first; otherwise, any changes you made in theme()
will be
over-written
## Theme then rotate
ggplot(mpg, aes(class, hwy)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, color = 'red', size = 20)) +
ggtitle("prebuilt then theme")
## If rotate first, theme will override
ggplot(mpg, aes(class, hwy)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, color = 'red', size = 15)) +
theme_bw() +
ggtitle("theme then prebuilt")
Scales are the functions that mediate the relationship between the
data and aesthetics. We can modify these directly with scale funtions
which take the form of scale_(aes name)_(type)
. For
example, to modify a continuous x axis, we would use
scale_x_continuous
. We will briefly cover again scale
functions for the axes as well as those for the other asthetics.
We use the axes scales for primarily two reasons: to transform the
default scale to something else using the trans
argument or
to modify the breaks and labels of the axis. Using a continuous or
discrete scale will depend on the variable type. Here is an example of
both
ggplot(mpg, aes(displ, drv)) +
geom_boxplot() +
scale_x_continuous(trans = "log2", breaks = c(1.5, 2, 2.5, 6)) +
scale_y_discrete(labels = c(`4` = "four", r = "rear", f = "front"))
Just as with the axes, these can be continuous or discrete, though
often with discrete scales we will use scale_color_brewer()
in order to choose from a palette. This also holds for the
fill
aesthetic which is often used for box plot, bar charts
and histograms. Note also that with aesthetics, we can change the title
of the legend with labs()
ggplot(mpg, aes(hwy, fill = class)) + geom_boxplot() +
scale_fill_brewer(palette = "Dark2") +
labs(fill = "Vehicle Class")
For the continuous color scales, we can use either a gradient or the viridis scale
ggplot(mpg, aes(displ, hwy, color = cty)) +
geom_point() +
labs(color = "Fuel\nEconomy") + # \n means add a new line
scale_color_continuous(type = "viridis", option = "D") + ggtitle("Viridis")
dplyr