R is an interpreted programming language, which allows
you to have the computer execute any piece of code without having to
first be compiled. This makes it a great option for interactive data
analysis. This lab will introduce both R and R Studio, an “Integrated
Development Environment” or IDE that allows us to interface with R while
enjoying all of the human comforts associated with advanced
civilization. The goals of this lab are threefold:
After you open RStudio, the first thing you’ll want to
do is open a file to work in. You can do this by navigating: File ->
New File -> R Script, which will open a new window in the top left of
the RStudio user interface for you to work in. At this
point you should see four panels:
An R Script is like a text-file that stores your code while you work on it. At any time, you can send some or all of the code in your R Script to the Console for the computer to execute. You can also type commands directly into the Console. The Console will echo any code you tell it to run, and will display the textual/numeric output that the code generates.
The Environment shows you the names of datasets, variables, and user-created functions that have been loaded into your workspace and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code and a few other useful entities (like help documentation and file trees).
R is made up of a base language, augmented by external or 3rd party code called packages which are stored in a repository. Let’s start by installing some of the packages that are going to be most widely used in this course. Copy the lines below into your Console and run them by pressing Enter.
## Install packages to be used in class
install.packages(c("dplyr", "tidyr", "tinytex", "rmarkdown",
"knitr", "ggplot2", "gridExtra"))
## Install local installation of latex for PDF compilation
tinytex::install_tinytex()
We load a package in R with the library() function,
always at the top of our R document
## Load ggplot2
library(ggplot2)
R code is generally written in either (1) an R Script, a plain text
file that stores code while you write it or (2) an R Markdown document,
also a plain text file that permits the interweaving of R
code alongside text and graphics, utilizing the “Markdown” authoring
framework. In essence, this allows you to go from a plain text R file to
a PDF or HTML document in moments. An R Markdown file thus permits you
to both:
R codeA major emphasis in this course is going to be the production of high quality reports. As such, it will be worthwhile to acquaint ourselves with how R Markdown uses plain text to create this documents. You’ll note that you are reading this lab as an HTML document through a browser. You can download and review the source that was used to generate this document.
On your own machine, we can create a new R Markdown file by selecting: File -> New File -> R Markdown. You’ll be given a prompt to compile to either PDF and HTML. For this and future assignments, choose PDF. At the top of R Studio is a blue yarn ball that says “Knit”. Pressing this will generate a PDF from the default text included when you opened the document (a process called “knitting”).
Now let’s examine some of the components of this a little more closely
At the top of the document is the header (written in YAML):
-
charactersFor the title, you should put the name of the assignment (e.g., “Lab 1 – Introduction”), and for author you should include your name. You are welcome to include a data in whichever format you choose, or you are welcome to delete this line altogether. Other than this, I would avoid messing with this section. Sometimes a space is accidentally added before the YAML or somewhere within, resulting in the document failing to knit.
Just below the YAML, we see the next core feature of an R Markdown document: a code chunk. A code chunk has the following properties:
R code in a chunk by highlighting
the relevant code and using Ctrl-Enter, or by pressing the
green “Play” buttonR code chunk will be present in
the final document.This first thing you’ll see on any new R Markdown document will be a code chunk labeled “setup” along with “include=FALSE”. Inside this block is the following code:
knitr::opts_chunk$set(echo = TRUE)
This allows us to modify chunk options, qualifiers that impact how code chunks are rendered. For our purposes, there are two ways to modify chunk options:
setup chunkInspecting the source code for this document, we see the following options are set:
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE,
fig.align = 'center', fig.width = 5, fig.height = 4)
The echo argument specifies if R output should be
included in our document; generally, the answer to this is yes. The next
two, warning = FALSE and message = FALSE are
used to suppress warnings and start up messages that come from loading
packages. For any presentable document, these should always be set to
false. The last three arguments are meant to mediate the production of
figures and graphs. fig.align centers our plots, while
fig.width and fig.height specify dimensions.
Note that these settings are subsequently applied to all code
chunks.
In addition to setting global options, we can also change options for an individual chunk. This is done by placing the arguments within the \(\text{```\{r\}}\) notation at the beginning of the chunk. Check the source code to see how the following plots are resized
## This uses standard settings
plot(1:10)
(This code chunk starts with \(\text{```\{r, fig.height=2, fig.width-8\}}\) )
## This plot has modified width and height
plot(1:10)
In general, we want to use the global options to set a reasonable default for all of our graphics. Sometimes, a particular graph will need more height or width to be seen clearly. In these cases, changing the individual chunk options allows us to do so without changing the output for the rest of the document.
Following the YAML and the setup code chunk is a boiler plate template introducing how R Markdown works. This should all be deleted before submitting any documents.
Question 1: Create a new R Markdown document with the appropriate YAML and setup. Use a single # to create a header titled Question 1. Then, for each of the code chunks below, copy them into your R Markdown document and adjust the individual figure settings so that the graphics look nice on the page when full screen (do not change the figures themselves)
library(ggplot2)
library(gridExtra)
## Plot 1
ggplot(mpg, aes(cty, hwy, color = drv)) +
geom_point() +
facet_grid(rows = vars(class))
## Plot 2
ggplot(mpg, aes(trans)) +
geom_bar()
## Plot 3
p1 <- ggplot(mtcars, aes(mpg)) +
geom_histogram(color = "black", fill = "gray80", bins = 8)
p2 <- ggplot(mtcars, aes(disp)) +
geom_histogram(color = "black", fill = "gray80", bins = 8)
p3 <- ggplot(mtcars, aes(qsec)) +
geom_histogram(color = "black", fill = "gray80", bins = 8)
p4 <- ggplot(mtcars, aes(wt)) +
geom_histogram(color = "black", fill = "gray80", bins = 8)
grid.arrange(p1, p2, p3, p4, nrow = 1)
Next, we have a number of formatting options that will render the text in markdown:
summary()Finally, R Markdown allows us to type ordinary text outside of code chunks, creating a streamlined way to integrate written text into the same document as your code and output.
Question 2: Create a header with a single # (you will do this for all questions) titled Question 2. Then doing the following:
Most likely you will have seen and used R (and R Markdown) previously. In this section, we will briefly review some of the key aspects of R and how it is used as a computing environment. All of the code in this section can be placed in a code chunk and run or, alternatively, copied directly into the console. I encourage you to run and try the code using either method, but it does not (and should not) be included with the assignment for submission.
In many ways, R functions like a calculator, using standard notation for arithmetic that follow orders of operations
4 + 6 - (24/6)
## [1] 6
5^2 + 2*2
## [1] 29
We can assign the results of an operation using the
<- operator, allowing us to use them for subsequent
operations. Comments are included in R code using a #
## Assign variables for intermediate computation
x <- 2^2
y <- 5 + 4
z <- (x + y)^2
z
## [1] 169
Typing a variable on a line by itself will cause it’s value to print. You’ll find any variables that you create in the top right section of R Studio under the Environment panel. You can overwrite the values of a variable by simply reassigning them
## Current value of x
x
## [1] 4
## Change value of x top 5
x <- 5
x
## [1] 5
We can combine multiple values into a single object through the use
of vectors. These are created with c() for
“combine” or “concatenate”
vec <- c(2, 4, 6, 8, 10)
vec
## [1] 2 4 6 8 10
If you wish to remove variables from your environment, you can either
click the broomstick icon, removing all values, or you can
remove variables individually with the function rm()
## Remove x from the environment
rm(x)
x
## Error: object 'x' not found
Often when working with R, we will be interacting with our data
through the use of functions, or pre-built units of code that
translates zero or more inputs into one or more
outputs. For example, we can find the natural log the value 4
with the log() function
log(4)
## [1] 1.3863
Many times functions will accept multiple inputs, also known as arguments. Here, we find the log value of 4 in base 2
log(x = 4, base = 2)
## [1] 2
Over time you’ll end up memorizing the arguments of common
R functions; however, while you’re learning I strongly
encourage you to read the help documentation for any
R function used in your code. You can access a function’s
documentation by typing a ? in front of the function name
and submitting to the console.
?log
Typically, we will type this in the console. Leaving
?log in your R Markdown script will sometimes prevent it
from knitting correctly.
Question 3: Look at the documentation for the
function mean(). What arguments does it take? Describe what
is happening in the following code:
x <- c(1.42, 4.7, 3.9, 8.14, 4.19, 2.55, 5.85, 2.96, 3.56, 4.06)
mean(x, trim = 0.1)
## [1] 3.9712
Question 4: Why does the code below return NA? Read the documentation and make the necessary changes to find a median value of 5
x <- c(3, 6, 4, 9, 2, NA, 7)
median(x)
## [1] NA
While data in R comes in a wide variety of forms, the most frequent objects we will interact with are called data.frames. A data.frame is a data structure consisting of rows and columns, resulting in a rectangular shape. Consider the following data.frame, made up of 3 equally lengthed vectors:
df <- data.frame(Name = c("Andy", "Betty", "Carl"),
Age = c(24, 36, 48),
Siblings = c(0, 3, 2))
df
## Name Age Siblings
## 1 Andy 24 0
## 2 Betty 36 3
## 3 Carl 48 2
There are a few terms here to be familiar with:
Age are 24, 36,
and 48While data.frames can be created manually, it is far more common for
data to be imported into R, either from data locally stored on your
computer, data saved in an R package, or data imported from a URL. For
example, the R package ggplot2 contains a dataset called
mpg stored as a data.frame. We can access this in the
following way:
library(ggplot2) # Load the package
data(mpg) # Load data into environment
head(mpg) # head() will show the first few rows of a data.frame
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
Once the data is loaded into the environment, you should see it in
the Environment pane in the top right. Immediately, we see that the
mpg dataset consists of 234 observations (rows) and 11
variables (columns). Click the blue arrow to the left of the data.frame
will show you the names of the variables included in the dataset.
Clicking the white grid on the right of the pane will open the
data.frame in View mode. For data that is included in base R or loaded
from a package, you can see a description of the data, along with the
variables, by checking the documentation
?mpg # Documentation for mpg dataset
Alternatively, we can load data directly in from a .csv
(comma-separated values) file. To do so, we either specify a path to a
local data frame or provide a URL. We do this with the
read.csv() function, assigning the resulting data.frame to
a value of our choosing. Here, we load the Happy Planet dataset from a
URL. As this is being read directly, there will not be any documentation
available
## Read dataset from URL
planet <- read.csv("https://collinn.github.io/data/HappyPlanet.csv")
head(planet)
## Country Region Happiness LifeExpectancy Footprint HLY HPI HPIRank
## 1 Albania 7 5.5 76.2 2.2 41.7 47.91 54
## 2 Algeria 3 5.6 71.7 1.7 40.1 51.23 40
## 3 Angola 4 4.3 41.7 0.9 17.8 26.78 130
## 4 Argentina 1 7.1 74.8 2.5 53.4 58.95 15
## 5 Armenia 7 5.0 71.7 1.4 36.1 48.28 48
## 6 Australia 2 7.9 80.9 7.8 63.7 36.64 102
## GDPperCapita HDI Population
## 1 5316 0.801 3.15
## 2 7062 0.733 32.85
## 3 2335 0.446 16.10
## 4 14280 0.869 38.75
## 5 4945 0.775 3.02
## 6 31794 0.962 20.40
Question 5: Read the planet data into R and answer the following: