This is an R Markdown document.
At present, you are reading it formatted as an HTML document, but as we have seen, it can also be rendered as a PDF – when working with an .Rmd file, you can choose how it is presented by selecting the arrow next to the Knit ball of yarn an choose between “Knit to HTML” or “Knit to PDF”. For most of the work in this class, we will be knitting to a PDF to facilitate submissions to Gradescope.
When you Knit an R Markdown file, it will ask you to first save and name your document. I recommend sticking to a consistent format throughout the semester: something like lastname_firstname_[lab/hwk]#.Rmd
. For example, I would name this nolte_collin_lab1.Rmd
. After you knit, this will also be the name of your pdf.
For this lab, we will be following along with this document which we can begin by downloading the .Rmd file associated with it here; this is allow us to see side-by-side how the .Rmd file looks compared to the finished output.
For example, reading this online we see that this sentence is bold, while in the .Rmd file we see that the sentence is a plain-text sentence surrounded by two asterisks (**) on each side. This is an example of using markdown to format our documents. For this lab, we will be following along with the .Rmd file, filling bits in as prompted. Let’s begin by introducing some of the more common markdown syntax that we will use in this class.
We can use .Rmd files to write text just as we would with a standard word processor. As we saw previously, using two asterisks creates bold text, whereas using single asterisks creates text that is italicized.
Headers can be created using the # symbol. Using one creates a large header, two a smaller header, and so on. You can see examples of this in the text above (this section on “Text Editing” uses two to create a header). Take note that you need to leave a space between the # and the header text – forgetting to do so will render the text exactly as you have written it.
#For example, this will not show up as a header
We can also create sequential lists. To do so, start new lines with sequential numbers
We can create un-ordered lists with using stars or dashes
We can also create horizontal dividers with a line of stars (at least 3, but any number greater is ok). This is a good way to separate your answers from a given problem
Things are more clear if I put my answers between bars, but I need to be sure to leave a blank line between my text and the bottom star
Question 1:
Create a new R Markdown file and copy the entirety of this question over to the new file (we will do this for all questions in this lab). Then, proceed with the instructions below.
Between the stars below, do the following:
Once you have done this, Knit to PDF by clicking the bar of yarn above and verify that everything looks like it should.
Whereas R is the programming language responsible for doing our relevant computation, our interactions with R will be primarily through R Studio. When you first open R Studio, your screen will partitioned into 3 panes: Console, Environment, and Files (and often, Editor)
The console makes up the entire left-hand side of the screen, and it is here that we can interact with R directly. Some information will appear on start-up, followed by a prompt (indicated with >
). Try copying and pasting the following lines into your console one at a time and pressing enter:
5 + 2
sqrt(25)
pi
(3^2 + 1)*5
x <- 5
The environment panel is located in the top right and includes all of the named variables that we have created in R. If you copied the lines above correctly, you should see a single line here, indicating that the variable x
has the value 5
. When starting a new R session, this pane will be empty.
In the bottom right is the Output pane which, in addition to giving us information about our file system also includes tabs for Plots and Help. Copy each of the lines below to see how this pane changes
?sqrt
plot(1:10)
You can click any of the tabs in this pane to change what is currently being shown.
Finally, we have the editor. If you start a new R Studio session, there will be nothing in the editor, but this will change if we open either a .R or .Rmd file. An R Script is the simplest type of file which can be created from the menus at the top:
File -> New File -> R Script
The editor pane will appear with a blank R file in which you can write and store R code. Try copying the R commands from the Console section into an R Script, click on the line with code and click “Ctrl+Enter” on your keyboard. R Studio should execute this code and display the results in the Console. You do not have to save this file, and you can close it when you are finished.
While it is true that R scripts (files that end in .R) are used for the vast majority of written R code, we will primarily be using R Markdown files (files that end in .Rmd).
Perhaps the most obvious utility in using R Markdown documents for this course is their ability to interweave regular text with R. We can include and run R code by including it in a code chunk, such as the one shown below:
# This is a code comment, beginning with #
x <- 4
x^2
## [1] 16
This code will run and display its results when we push the Knit button, but we can also run the code without knitting by place by again clicking the line with our cursor and hitting “Ctrl+Enter”
For this portion of the lab, we will be introducing some of the basic mechanics in the R language. This will include vectors
, similar to the variables we discussed in class, as well as data.frames
, the primary data type we will be using for collections of observations.
You are welcome to follow along on the course website or read the R Markdown directly. In either event, I highly encourage you to try running the code that we see, either by coping it from the website and pasting into the console, or by running with the R Markdown code chunks.
Vectors in R are the simplest type of object. Every vector has two attributes that will dictate how we use it: length and type. We have already seen vectors of length one, which are created by assigning a number to a name using <-
# This is a numeric vector of length 1
x <- 4
x
## [1] 4
We can create vectors of greater than length one using c()
, which is short for “combine”
y <- c(2, 4, 6, 8, 10)
y
## [1] 2 4 6 8 10
Sequences of vectors can also be created with :
z <- 1:5
z
## [1] 1 2 3 4 5
Every element of a vector has a position that we can access directly using square brackets []
in a process called subsetting. Here, we grab the 3rd element of the vector w
w <- c(3, 6, 9, 12)
w[3]
## [1] 9
Question 2:
Again, copy the entirety of this question into the R Markdown file you created for Question 1.
Let’s practice creating vectors and subsetting with a short exercise.
x
that has all of the numbers from 11 to 20Just as we saw in class that variables can be quantitative or categorical, so it is with vectors in R. There are primarily 3 types of vectors we need to be aware of:
x <- c(1, 2, 3)
x <- c("a", "b", "c")
x <- c(TRUE, FALSE, TRUE)
Don’t worry about memorizing everything – we will continue to review them together as they appear in the course. For now, we will be satisfied with simply having made our introductions.
As we mentioned in class, the type of a variable will often dictate what we are able to do with it. For example, it should make sense to take the mean of a numeric variable, but less so with a character vector. We do this using the R function mean()
:
# This will return a warning and NA (NA = Not Available/Missing Value)
z <- c("a", "b", "c")
mean(z)
## Warning in mean.default(z): argument is not numeric or logical: returning NA
## [1] NA
x <- c(1, 2, 3)
mean(x)
## [1] 2
Some situations can be confusing based on appearances. The class()
function will tell us what type a vector is if we are not sure
# This looks numeric
z <- c("1", "2", "3")
mean(z)
## Warning in mean.default(z): argument is not numeric or logical: returning NA
## [1] NA
# but actually, it is a character
class(z)
## [1] "character"
Usually in situations like this, we can coerce a vector to being a different type, though I will make you aware in advance when a situation like this will arise:
# coerce the z variable to be a numeric vector
z_num <- as.numeric(z)
class(z_num)
## [1] "numeric"
mean(z_num)
## [1] 2
Deciding what class a vector should have is often context specific For example, having numeric vectors is ideal if we plan on using them for mathematical operations (i.e., finding the mean, adding them together, taking a square root), which may make sense for things like age, population size, or temperature. Other times, a variable with numbers may make more sense as a character vector to represent categories: this could be things like room numbers, the year an automobile was manufactured, or group assignments.
One important characteristic of a vector to note: all elements of a vector must be the same type. They are either all numeric, all logical, or all character. R will force this by default:
z_mix <- c("a", 1, TRUE)
## They are all forced to be character
class(z_mix)
## [1] "character"
Most of the data we will be working with in this class will be stored in objects known as data frames. Put simply, a data frame is a collection of vectors that are all the same length.
# Create two vectors of the same length
x <- c("Alice", "Mary", "Bob")
y <- c(1, 4, 10)
# We can create a data frame with columns named `name` and `value`
df <- data.frame(name = x, value = y)
df
## name value
## 1 Alice 1
## 2 Mary 4
## 3 Bob 10
As we can see, this has the format of having rows as observations and columns being variables. We can select one of the variables from our data.frame
using $
which will extract the vector
# Extract the "name" column
df$name
## [1] "Alice" "Mary" "Bob"
# Extract the "value" column and use it to find median
median(df$value)
## [1] 4
While we can create data frames directly, it is far more likely that we will be reading into R data that has already been created. The most common way to do so is with the read.csv()
function (CSV stands for comma separated values). This can be read directly from your computer OR you can pass in a URL
# Use read.csv to pull Happy Planet data
planet <- read.csv("https://collinn.github.io/data/HappyPlanet.csv")
Once a data frame is read in, you will see it show up in your environment. Right away, we get two critical pieces of information: this data frame has 143 observations and 11 variables (recall that rows are observations and columns are variables).
If we click on the blue arrow to the left, we will get a drop down showing us a more detailed description of the variables we do have. Note also the additional information given about the vectors (num = numeric, int = integer, chr = character).
Clicking on the white box pane to the right of planet
brings us to the tabular view in R, giving us the opportunity to navigate and explore the data more directly
Question 3:
For this question, we will be using the HappyPlanet
data that we have just looked at:
Part A Copy the code above to read the Happy Planet data into your own R Markdown file, saving the dataset to a variable called planet
Part B Looking at the Happy Planet data, explain in one or two sentences what constitutes an observation in this dataset
Part C Using $
to extract columns from the dataset, find the mean life expectancy of all countries in the dataset? (Hint: what functions have we seen already in this lab?)
Part D Are there any variables in this dataset that are stored as a numeric that would be better suited as a categorical variable? Explain your answer
This concludes our first introduction to R. Once you have finished with your lab, make sure that you are able to successfully Knit to PDF. Once that is finished, we are ready to submit to Gradescope.