Introduction to R and R Markdown

R is an interpreted programming language, which allows you to have the computer execute any piece of code without having to first be compiled. This makes it a great option for interactive data analysis. This lab will introduce both R and R Studio, an “Integrated Development Environment” or IDE that allows us to interface with R while enjoying all of the human comforts associated with advanced civilization. The goals of this lab are threefold:

  1. Introduce R and R Studio
  2. Learn the basics of report generation with R Markdown
  3. Reacquaint ourselves with some of the basics of the R programming language

The Layout of R Studio

After you open RStudio, the first thing you’ll want to do is open a file to work in. You can do this by navigating: File -> New File -> R Script, which will open a new window in the top left of the RStudio user interface for you to work in. At this point you should see four panels:

  1. Your R Source document (Markdown or R Script) (top left)
  2. The Console (bottom left)
  3. Your Environment (top right)
  4. The Files/Plots/Help viewer (bottom right)

An R Script is like a text-file that stores your code while you work on it. At any time, you can send some or all of the code in your R Script to the Console for the computer to execute. You can also type commands directly into the Console. The Console will echo any code you tell it to run, and will display the textual/numeric output that the code generates.

The Environment shows you the names of datasets, variables, and user-created functions that have been loaded into your workspace and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code and a few other useful entities (like help documentation and file trees).

R is made up of a base language, augmented by external or 3rd party code called packages which are stored in a repository. Let’s start by installing some of the packages that are going to be most widely used in this course. Copy the lines below into your Console and run them by pressing Enter.

## Install packages to be used in class
install.packages(c("dplyr", "tidyr", "tinytex", "rmarkdown", 
                   "knitr", "ggplot2", "gridExtra"))

## Install local installation of latex for PDF compilation
tinytex::install_tinytex()

We load a package in R with the library() function, always at the top of our R document

## Load ggplot2
library(ggplot2)

R Markdown

R code is generally written in either (1) an R Script, a plain text file that stores code while you write it or (2) an R Markdown document, also a plain text file that permits the interweaving of R code alongside text and graphics, utilizing the “Markdown” authoring framework. In essence, this allows you to go from a plain text R file to a PDF or HTML document in moments. An R Markdown file thus permits you to both:

  1. Write and execute R code
  2. Generate high quality, reproducible reports

A major emphasis in this course is going to be the production of high quality reports. As such, it will be worthwhile to acquaint ourselves with how R Markdown uses plain text to create this documents. You’ll note that you are reading this lab as an HTML document through a browser. You can download and review the source that was used to generate this document.

On your own machine, we can create a new R Markdown file by selecting: File -> New File -> R Markdown. You’ll be given a prompt to compile to either PDF and HTML. For this and future assignments, choose PDF. At the top of R Studio is a blue yarn ball that says “Knit”. Pressing this will generate a PDF from the default text included when you opened the document (a process called “knitting”).

Now let’s examine some of the components of this a little more closely

Header and Setup

At the top of the document is the header (written in YAML):

  • The section begins and ends with three - characters
  • It contains the title, author, date, etc., that appear at the top of the page
  • It is used to format details of the document, i.e., table of contents, page numbers, etc.,

For the title, you should put the name of the assignment (e.g., “Lab 1 – Introduction”), and for author you should include your name. You are welcome to include a data in whichever format you choose, or you are welcome to delete this line altogether. Other than this, I would avoid messing with this section. Sometimes a space is accidentally added before the YAML or somewhere within, resulting in the document failing to knit.

Just below the YAML, we see the next core feature of an R Markdown document: a code chunk. A code chunk has the following properties:

  • Code chunks are initiated by \(\text{```\{r\}}\) and closed by \(\text{```}\). You can also do this with Ctrl+Alt+I
  • The \(\text{```}\) wrappers tell R Markdown that what appears inside is code that should be executed
  • You can execute the R code in a chunk by highlighting the relevant code and using Ctrl-Enter, or by pressing the green “Play” button
  • Any plots created in an R code chunk will be present in the final document.

This first thing you’ll see on any new R Markdown document will be a code chunk labeled “setup” along with “include=FALSE”. Inside this block is the following code:

knitr::opts_chunk$set(echo = TRUE)

This allows us to modify chunk options, qualifiers that impact how code chunks are rendered. For our purposes, there are two ways to modify chunk options:

  1. We can modify all chunks at once by including additional arguments in the setup chunk
  2. We can specify changes in each chunk individually

Inspecting the source code for this document, we see the following options are set:

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, 
                      fig.align = 'center', fig.width = 5, fig.height = 4)

The echo argument specifies if R output should be included in our document; generally, the answer to this is yes. The next two, warning = FALSE and message = FALSE are used to suppress warnings and start up messages that come from loading packages. For any presentable document, these should always be set to false. The last three arguments are meant to mediate the production of figures and graphs. fig.align centers our plots, while fig.width and fig.height specify dimensions. Note that these settings are subsequently applied to all code chunks.

In addition to setting global options, we can also change options for an individual chunk. This is done by placing the arguments within the \(\text{```\{r\}}\) notation at the beginning of the chunk. Check the source code to see how the following plots are resized

## This uses standard settings
plot(1:10)

(This code chunk starts with \(\text{```\{r, fig.height=2, fig.width-8\}}\) )

## This plot has modified width and height
plot(1:10)

In general, we want to use the global options to set a reasonable default for all of our graphics. Sometimes, a particular graph will need more height or width to be seen clearly. In these cases, changing the individual chunk options allows us to do so without changing the output for the rest of the document.


Following the YAML and the setup code chunk is a boiler plate template introducing how R Markdown works. This should all be deleted before submitting any documents.

Question 1: Create a new R Markdown document with the appropriate YAML and setup. Use a single # to create a header titled Question 1. Then, for each of the code chunks below, copy them into your R Markdown document and adjust the individual figure settings so that the graphics look nice on the page when full screen (do not change the figures themselves)

library(ggplot2)
library(gridExtra)

## Plot 1
ggplot(mpg, aes(cty, hwy, color = drv)) + 
  geom_point() + 
  facet_grid(rows = vars(class))
## Plot 2
ggplot(mpg, aes(trans)) + 
  geom_bar()
## Plot 3
p1 <- ggplot(mtcars, aes(mpg)) + 
  geom_histogram(color = "black", fill = "gray80", bins = 8)
p2 <- ggplot(mtcars, aes(disp)) + 
  geom_histogram(color = "black", fill = "gray80", bins = 8)
p3 <- ggplot(mtcars, aes(qsec)) + 
  geom_histogram(color = "black", fill = "gray80", bins = 8)
p4 <- ggplot(mtcars, aes(wt)) + 
  geom_histogram(color = "black", fill = "gray80", bins = 8)
grid.arrange(p1, p2, p3, p4, nrow = 1)

Markdown Basics

Next, we have a number of formatting options that will render the text in markdown:

  • Section headers are constructed using #: the number of # will determine the level (size) of the header. You will need to put a space after # or else the header will not render correctly (e.g., “# Header” will work, “#Header” will not)
  • Enclosing text in tick marks can be used to type non-exectued code: `summary()` -> summary()
  • The use of asterisks can make text bold or italics: **bold** -> bold and *italics* -> italics
  • Dashes, hyphens, or numbers can be used to make lists (such as this one). Be sure to leave a blank line before starting the list and a blank line after entering. New list items should be separated with a newline

Finally, R Markdown allows us to type ordinary text outside of code chunks, creating a streamlined way to integrate written text into the same document as your code and output.

Question 2: Create a header with a single # (you will do this for all questions) titled Question 2. Then doing the following:

  • Consult with yourself and two classmates. Create a numerical list of your and your classmates names. Put the names in bold
  • Under each name, create a dashed list of your 3 least favorite animals. Put each animal in italics
  • Verify that the list is rendering correctly after you knit

Covering the R Basics

Most likely you will have seen and used R (and R Markdown) previously. In this section, we will briefly review some of the key aspects of R and how it is used as a computing environment. All of the code in this section can be placed in a code chunk and run or, alternatively, copied directly into the console. I encourage you to run and try the code using either method, but it does not (and should not) be included with the assignment for submission.

Operations and Assignment

In many ways, R functions like a calculator, using standard notation for arithmetic that follow orders of operations

4 + 6 - (24/6)
## [1] 6
5^2 + 2*2
## [1] 29

We can assign the results of an operation using the <- operator, allowing us to use them for subsequent operations. Comments are included in R code using a #

## Assign variables for intermediate computation
x <- 2^2
y <- 5 + 4
z <- (x + y)^2
z
## [1] 169

Typing a variable on a line by itself will cause it’s value to print. You’ll find any variables that you create in the top right section of R Studio under the Environment panel. You can overwrite the values of a variable by simply reassigning them

## Current value of x
x
## [1] 4
## Change value of x top 5
x <- 5
x
## [1] 5

We can combine multiple values into a single object through the use of vectors. These are created with c() for “combine” or “concatenate”

vec <- c(2, 4, 6, 8, 10)
vec
## [1]  2  4  6  8 10

If you wish to remove variables from your environment, you can either click the broomstick icon, removing all values, or you can remove variables individually with the function rm()

## Remove x from the environment
rm(x)
x
## Error: object 'x' not found

Functions and Documentation

Often when working with R, we will be interacting with our data through the use of functions, or pre-built units of code that translates zero or more inputs into one or more outputs. For example, we can find the natural log the value 4 with the log() function

log(4)
## [1] 1.3863

Many times functions will accept multiple inputs, also known as arguments. Here, we find the log value of 4 in base 2

log(x = 4, base = 2)
## [1] 2

Over time you’ll end up memorizing the arguments of common R functions; however, while you’re learning I strongly encourage you to read the help documentation for any R function used in your code. You can access a function’s documentation by typing a ? in front of the function name and submitting to the console.

?log

Typically, we will type this in the console. Leaving ?log in your R Markdown script will sometimes prevent it from knitting correctly.

Question 3: Look at the documentation for the function mean(). What arguments does it take? Describe what is happening in the following code:

x <- c(1.42, 4.7, 3.9, 8.14, 4.19, 2.55, 5.85, 2.96, 3.56, 4.06)
mean(x, trim = 0.1)
## [1] 3.9712

Question 4: Why does the code below return NA? Read the documentation and make the necessary changes to find a median value of 5

x <- c(3, 6, 4, 9, 2, NA, 7)
median(x)
## [1] NA

Loading data and using the environment

While data in R comes in a wide variety of forms, the most frequent objects we will interact with are called data.frames. A data.frame is a data structure consisting of rows and columns, resulting in a rectangular shape. Consider the following data.frame, made up of 3 equally lengthed vectors:

df <- data.frame(Name = c("Andy", "Betty", "Carl"), 
                 Age = c(24, 36, 48), 
                 Siblings = c(0, 3, 2))
df
##    Name Age Siblings
## 1  Andy  24        0
## 2 Betty  36        3
## 3  Carl  48        2

There are a few terms here to be familiar with:

  1. A variable is a quantity that is measured, such as a name, age, or number of siblings. These make up the columns of a data frame. (Note: the use of “variable” here is distinct from the use of “variable” in an R context, where we are referring to stored objects in our environment pane)
  2. A value is the particular state of a variable, once measured. Here, the values of the variable Age are 24, 36, and 48
  3. An observation is a collection of measurements or values collected on a single object (e.g., a person). Observations consist of several values, each associated with a variable. The observations make up the rows of our data.frame
  4. This data is tabular, table shaped, or rectangular. This means that each row is an observation, each column is a variable, and each cell represents the value of a variable for a particular observation

While data.frames can be created manually, it is far more common for data to be imported into R, either from data locally stored on your computer, data saved in an R package, or data imported from a URL. For example, the R package ggplot2 contains a dataset called mpg stored as a data.frame. We can access this in the following way:

library(ggplot2) # Load the package
data(mpg) # Load data into environment
head(mpg) # head() will show the first few rows of a data.frame
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Once the data is loaded into the environment, you should see it in the Environment pane in the top right. Immediately, we see that the mpg dataset consists of 234 observations (rows) and 11 variables (columns). Click the blue arrow to the left of the data.frame will show you the names of the variables included in the dataset. Clicking the white grid on the right of the pane will open the data.frame in View mode. For data that is included in base R or loaded from a package, you can see a description of the data, along with the variables, by checking the documentation

?mpg # Documentation for mpg dataset

Alternatively, we can load data directly in from a .csv (comma-separated values) file. To do so, we either specify a path to a local data frame or provide a URL. We do this with the read.csv() function, assigning the resulting data.frame to a value of our choosing. Here, we load the Happy Planet dataset from a URL. As this is being read directly, there will not be any documentation available

## Read dataset from URL
planet <- read.csv("https://collinn.github.io/data/HappyPlanet.csv")
head(planet)
##     Country Region Happiness LifeExpectancy Footprint  HLY   HPI HPIRank
## 1   Albania      7       5.5           76.2       2.2 41.7 47.91      54
## 2   Algeria      3       5.6           71.7       1.7 40.1 51.23      40
## 3    Angola      4       4.3           41.7       0.9 17.8 26.78     130
## 4 Argentina      1       7.1           74.8       2.5 53.4 58.95      15
## 5   Armenia      7       5.0           71.7       1.4 36.1 48.28      48
## 6 Australia      2       7.9           80.9       7.8 63.7 36.64     102
##   GDPperCapita   HDI Population
## 1         5316 0.801       3.15
## 2         7062 0.733      32.85
## 3         2335 0.446      16.10
## 4        14280 0.869      38.75
## 5         4945 0.775       3.02
## 6        31794 0.962      20.40

Question 5: Read the planet data into R and answer the following:

  • What constitutes an observation in this dataset? How many total observations are there?
  • How many variables are in this dataset? List two, then give an example of a value in this data (relating a variable to an observation)