This page represents a brief summary of the topics covered thus far in the course and should not be considered comprehensive. Links to the relevant sections in the labs are included when available.

Overview of R

This first section is a review of some of the fundamentals of the R language. This includes common data types, subsetting and indexing data, logicals, and some useful functions that we have covered thus far.

Data types and subsetting

While there are a tremendous number of data types in R, the two that we have worked with most frequently are vectors and data frames (Lab 1).

Vectors are one dimensional objects, with each element of the vector being of the same type. The types of vectors are:

  • Character - Useful for storing names of objects, created with single or double quotes, i.e., c("dog", "cat")
  • Numeric and integer - Whole numbers or decimals, i.e., 1:5, c(1.2, 2.4)
  • Logical - Contains only TRUE or FALSE values
  • Factors - Categorical variables that can be ordered, useful when treating numerics as categories in ggplot2. More on factors in Lab 2

Data frames are the most common objects we have been working with and consist of a list of vectors, each of which is the same length. Typically, we thing of data frames as having rows as observations and columns as variables

Logicals

Logicals typically refer to either a vector of TRUE/FALSE values or the set of statements that result in a logical vector.

A natural way of constructing logical vectors is with the use of logical operators, the main set of which is given below:

Operator Description
== equal to
!= not equal to
< less than (strict)
<= less than or equal to
> greater than (strict)
>= greater than or equal to
& and
| or
! not

A logical operator, when used on a vector, will return a logical vector the same length as the original vector. For example, consider here a vector of length five:

# Vector of length 5
x <- 1:5

# Logical opertator returns a logical vector, also length 5
x < 3
## [1]  TRUE  TRUE FALSE FALSE FALSE

Logical vectors are most commonly used to subset an object, returning only those values where the logical vector is true

# Only return values of x where x < 3
x[x < 3]
## [1] 1 2

Logical vectors can be combined using | and &. It is best practice to use parenthesis to ensure the correct order of operations when these are computed

x[(x < 2) | (x >= 4)]
## [1] 1 4 5

We have also learned about the %in% operator, which allows us to see which elements of one character vector are included in another

x <- c("a", "b", "c", "d", "e")
y <- c("b", "d")

x %in% y
## [1] FALSE  TRUE FALSE  TRUE FALSE

Note that we read this as “which x is in y”. If we were to flip the order, the question (and the length of the resulting vector) would be different

y %in% x
## [1] TRUE TRUE

Subsetting

Subsetting involves specifying portions of an R object we wish to retain; we saw examples of this in the section above with logical vectors. More commonly, we talk about subsetting in the case of data frames where we either want to keep some subset of the variables (columns), some subset of the observations (rows), or both. We will illustrate these with a small toy data frame

df <- data.frame(x = 1:5, 
                 y = c("a", "b", "c", "d", "e"), 
                 z = 6:10)
df
##   x y  z
## 1 1 a  6
## 2 2 b  7
## 3 3 c  8
## 4 4 d  9
## 5 5 e 10

When subsetting columns, we can do this either with the $ for a single variable, or by using [,] (note the comma) syntax for one or more variables

# Only get the column 'x'
df$x
## [1] 1 2 3 4 5
## Get y and z
df[, c("y", "z")]
##   y  z
## 1 a  6
## 2 b  7
## 3 c  8
## 4 d  9
## 5 e 10

We can subset the rows of a data frame by passing a logical vector on the left side of the comma in [,]

# Only keep those with x < 3
df[df$x < 3, ]
##   x y z
## 1 1 a 6
## 2 2 b 7

We also have the subset function, which takes as arguments a data frame and a logical vector

subset(df, x > 2 & y %in% "d")
##   x y z
## 4 4 d 9

Of course, dplyr gives us a much simpler way of performing most of these operations with filter and select.

Useful functions

A collection of useful functions we have seen so far include str, dim, class, print, subset, mean, sd, max, min, range, IQR, quantile, summary, table, and prop.table

For functions that compute a statistic (i..e, mean or quantile), there is usually the option na.rm = FALSE to not include NA values. You can learn more about any function in R using the documentation with ?, i.e., ?quantile.

ggplot2

Every ggplot consists of three parts:

  1. Data - the information we wish to visually present
  2. Aesthetics - A relation between the data and visual properties
  3. Layers - How the data is rendered

Here, we briefly review a few of the major components that go into successfully creating a ggplot

Aesthetics (and facets)

We will discuss layers and aesthetics together, noting that aesthetics are what are used to define how the layers are rendered.

Aesthetics are mappings from variables in the dataset, and can include both numerical and categorical variables. These include both the x and y axes, color, shape, and fill. Aesthetics are added with the aes() function

library(ggplot2)

## Use aesthetics to map x and y axes
ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point()

Continuous and Discrete

The variables used in aesthetics can either be continuous or discrete, and this will change the choice of aesthetics or layers that we wish to use. For example, a scatter plot created with geom_point() is useful when comparing the relationship of two continuous variables, while a box plot is useful for a continuous variable with a discrete (and potentially several groups).

In some cases, a variable being continuous or discrete will change the way that the aesthetic is mapped. For example, color uses a continuous scale for continuous variables and a discrete palette when variables are discrete. This is particularly important to remember as variables that are nominally categorical may be stored as a numeric in the data frame. This was the case in the mpg dataset where cyl (number of cylinders) describes a category of vehicles but is stored as an integer. To treat a numeric variable as discrete, we must do so explicitly with factor().

## Cylinder as a numeric gives a continuous color scale
ggplot(mpg, aes(displ, hwy, color = cyl)) + 
  geom_point()

## Explicitly calling factor makes it discrete
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) + 
  geom_point()

It’s worth noting in this case that neither of these is necessarily correct or incorrect, as both graphs are capable of providing different information. On the left, for example, we see the obvious transition from low number of cylinders to high as displacement increases, while on the right hand graph, we see that they tend to group in clusters. Ultimately, the correct aesthetic to use is the one that most clearly represents the relationships you are trying to show.

Local specification of aesthetics

Any aesthetics specified in ggplot() will apply to all subsequent layers; any aesthetics specified in a layer will apply only to that layer. This has implications for summary layers such as geom_smooth(), for example, which will create smoothing data based on groupings provided by the aesthetic

## Color applies to all subsequent layers
ggplot(mpg, aes(displ, hwy, color = drv)) +
  geom_point() + 
  geom_smooth()
 
## Color applies *only* to geom_point
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = drv)) + 
  geom_smooth()

## Color applies *only* to geom_smooth  
ggplot(mpg, aes(displ, hwy)) +
  geom_point() + 
  geom_smooth(aes(color = drv))

Themes and Scales

This section is primarily a review of basic theme elements and aesthetic scales

Themes

There are a number of pre-built themes that come included in ggplot, serving as a template from which you can further tailor your graphics. Here, for example, we consider a black and white theme, generated by adding theme_bw() to our constructed plot:

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) + 
  geom_point() + 
  labs(x = "Displacement", y = "Highway", title = "Snappy title", color = "Cylinders") +
  theme_bw() # adds a black and white theme

Other pre-built themes:

  • theme_bw()
  • theme_linedraw(), theme_light() and theme_dark()
  • theme_minimal()
  • theme_classic()
  • theme_void()

You can judge the differences in these themes below:

While the pre-bulit theme functions are able to adjust a number of features all at once, we can also use the theme function to modify things one at a time. Things that are common to modify with a theme function include rotating or adjusting labels or changing the size or font of plot and legend titles. Often, these include the use of an element function (such as element_text) to modify the details. Finally, we can use labs() or xlab() and ylab() to change the labels on the x or y axis

Here is an example of changing multiple elements at once.

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot() +
  theme_bw() +
  labs(x = "vehicle class", y = "fuel economy (highway)", title = "title") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5), 
        plot.title = element_text(color = 'steelblue', face = 'italic', size = 30))

Note: if you use both a pre-built theme and the theme() function, you will want to use the pre-built theme first; otherwise, any changes you made in theme() will be over-written

## Theme then rotate
ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, color = 'red', size = 20)) + 
  ggtitle("prebuilt then theme")

## If rotate first, theme will override
ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, color = 'red', size = 15)) + 
  theme_bw() +
  ggtitle("theme then prebuilt")

Scales

Scales are the functions that mediate the relationship between the data and aesthetics. We can modify these directly with scale funtions which take the form of scale_(aes name)_(type). For example, to modify a continuous x axis, we would use scale_x_continuous. We will briefly cover again scale functions for the axes as well as those for the other asthetics.

Axes scales

We use the axes scales for primarily two reasons: to transform the default scale to something else using the trans argument or to modify the breaks and labels of the axis. Using a continuous or discrete scale will depend on the variable type. Here is an example of both

ggplot(mpg, aes(displ, drv)) + 
  geom_boxplot() + 
  scale_x_continuous(trans = "log2", breaks = c(1.5, 2, 2.5, 6)) + 
  scale_y_discrete(labels = c(`4` = "four", r = "rear", f = "front"))

Color scales

Just as with the axes, these can be continuous or discrete, though often with discrete scales we will use scale_color_brewer() in order to choose from a palette. This also holds for the fill aesthetic which is often used for box plot, bar charts and histograms. Note also that with aesthetics, we can change the title of the legend with labs()

ggplot(mpg, aes(hwy, fill = class)) + geom_boxplot() +
  scale_fill_brewer(palette = "Dark2") +
  labs(fill = "Vehicle Class")

For the continuous color scales, we can use either a gradient or the viridis scale

ggplot(mpg, aes(displ, hwy, color = cty)) +
  geom_point() + 
  labs(color = "Fuel\nEconomy") + # \n means add a new line
  scale_color_continuous(type = "viridis", option = "D") + ggtitle("Viridis")

Bar charts

Here

Tidy data

Pivots (long and wide)

dplyr

Joins and Merges