Lab #1 - First Steps in R

This is a short lab intended to introduce a few aspects of R and R Studio while practicing the format of future class meetings.

Directions (Please read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Please record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Please ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Preamble

The “Preamble” section of labs is something we will work on together as an entire class.

The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:

One partner is the driver, who physically writes code and operates the computer
One partner is the navigator, who reviews the actions of the driver and provides feedback and guidance

Partners should switch roles throughout the “Lab” section. For the first few labs, the less experienced coder should spend more time as the driver.

$~$

The Layout of R Studio

After you open RStudio, the first thing you’ll want to do is open a file to work in. You can do this by navigating: File -> New File -> RScript, which will open a new window in the top left of the RStudio user interface for you to work in. At this point you should see four panels:

Your R Script (top left)
The Console (bottom left)
Your Environment (top right)
The Files/Plots/Help viewer (bottom right)

An R Script is like a text-file that stores your code while you work on it. At any time, you can send some or all of the code in your R Script to the Console for the computer to execute. You can also type commands directly into the Console. The Console will echo any code you tell it to run, and will display the textual/numeric output that the code generates.

The Environment shows you the names of datasets, variables, and user-created functions that have been loaded into your workspace and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code and a few other useful entities (like help documentation and file trees).

Question #0: Create a blank R Script. You will use this R Script to record your answers to future questions in this document.

$~$

Using R

R is an interpreted programming language, which allows you to have the computer execute any piece of code contained your R Script at any time without a lengthy compiling process.

To run a single piece of code, simply highlight it and either hit Ctrl-Enter or click on the “Run” button near the top right corner of your R Script. You should see an echo of the code you ran in the Console, along with any response generated by that code.

4 + 6 - (24/6)

## [1] 6

5 ^ 2 + 2 * 2

## [1] 29

The examples above demonstrate how R can be used as a calculator. However, most of code we will write will rely upon functions, or pre-built units of code that translate one or more inputs into one or more outputs.

log(x = 4, base = 2)

## [1] 2

In the example above we input an “x” value of 4, and a “base” of 2. The labels given to these inputs, “x” and “base”, are the function’s arguments. The function returns the output “2”, which is $\text{log}_2(4)$.

$~$

Help Documentation

Over time you’ll end up memorizing the arguments of common R functions; however, while you’re learning I strongly encourage you to read the help documentation for any R function used in your code. You can access a function’s documentation by typing a ? in front of the function name and submitting to the console.

?log

$~$

Adding Comments

When coding, it is good practice to include comments that describe what your code is doing. In R the character “#” is used to start a comment. Everything appearing on the same line to the right of the “#” will not be executed when that line is submitted to the console.

# This entire line is a comment and will do nothing if run

1:6 # The command "1:6" appears before this comment

## [1] 1 2 3 4 5 6

In your R Script, comments appear in green. You also should remember that the “#” starts a comment only for a single line of your R Script, so long comments requiring multiple lines should each begin with their own “#”.

$~$

Lab

The remainder of the lab is to be completed by you and your lab partner. You should work at a pace that ensures both of you thoroughly understand the lab’s contents and examples.

$~$

Loading Data

An important part of data science is reproducibility, or the ability for two people to independently replicate a task using the same code.

To ensure reproducibility, every analysis begins by importing raw data into R and manipulating it used documented (commented) code. Further, the raw data should be imported using functions:

## Loading a CSV file from a web URL (storing it as "my_data")
my_data <- read.csv("https://some_webpage/some_data.csv")

## Loading a CSV file with a local file path
my_data <- read.csv("H:/path_to_my_data/my_data.csv")

Note that:

Both <- or = can be used to assign something to a named object. But <- is generally preferred for complex reasons we won’t get into in this course.
File paths must use / or \\. A single \ is used by R to start an instance of a special text character. For example, \n creates a new line in a string of text.

It’s also worth noting here – in the most typical cases, a dataset will organize its data by having each observation constitute a single row, with each of the variables being a column.

Question #1: Add code to your script that uses the read.csv() function to create an object named my_data that contains the “Happy Planet” data stored at: https://remiller1450.github.io/data/HappyPlanet.csv

After running your Question #1 code, an entry named “my_data” should appear in the Environment panel (top right).

$~$

Objects and Assignments

R stores data in containers called objects. Data is assigned into an object using <- or =. After assignment, data can be referenced using that object’s name. While there are a wide variety of different objects and types in R, we are going to focus here on three of the most common: vectors, data.frames, and lists.

Vectors

The simplest objects are vectors which contain one or more elements. A common task in R involves creating an object and assigning it to a named value which we can use for subsequent operations:

x <- 5 # This assigns the integer value '5' to an object called 'x'
x^2    # We can now reference 'x'

## [1] 25

There are a number of ways to construct a vector, including using the : operator and, most commonly, the c() function:

x <- 1:3 # The sequence {1, 2, 3} is assigned to the vector called 'x'
print(x)

## [1] 1 2 3

y <- c(1,2,3) # The function 'c' concatenates arguments (separated by commas) into a vector
print(y)

## [1] 1 2 3

z <- c("A","B","C") # Vectors can contain many types of values
print(z)

## [1] "A" "B" "C"

One of the most important characteristics of a vector is its class which indicates the types of elements contained in the vector as well as what types of functions are able to use it as input.

The three most important classes of vectors are:

numeric vectors - for example: x = c(1,2,3)
character vectors - for example: x = c("A","B","C")
logical vectors - for example: x = c(TRUE, FALSE, TRUE)

You should always know a vector’s class before using it, which you can determine using the class() function.

chars <- c("1","2","3") # Create a character vector
class(chars)

## [1] "character"

nums <- c(1,2,3) # Create a numeric vector
class(nums)

## [1] "numeric"

While vectors can be of different classes, all elements of a vector must be of the same type. If any of the elements of a vector are of a different class, they will be coerced to the broadest type within the vector, i.e., numbers can be characters, but characters cannot be numbers

mixed_vec <- c(1,2,3, "a", "b", "c") # has numeric and character elements
class(mixed_vec)

## [1] "character"

Most functions expect a certain type of input and will produce an error if the wrong class is given.

mean(chars) # This produces an error, mean() only works for numeric vectors

## Warning in mean.default(chars): argument is not numeric or logical: returning
## NA

## [1] NA

mean(nums) # This works as intended

## [1] 2

Many R functions are vectorized, meaning they can accept a scalar input, for example 1, and return a scalar output, f(1), or they can accept a vector input, such as c(1,2,3), and return a vector c(f(1),f(2),f(3)). For example, sqrt() is vectorized:

nums <- c(1,2,3,4)
sqrt(nums)

## [1] 1.000000 1.414214 1.732051 2.000000

Data Frames

Data are usually stored in objects called data.frames, which are composed of several vectors of the same length:

df <- data.frame(A = x, B = y, C = z) # Creates a data.frame object 'df'
print(df)

##   A B C
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C

Note that this creates a new data.frame with column names “A”, “B”, and “C”. We can learn about the structure of a data.frame using the str() function:

str(df)

## 'data.frame':    3 obs. of  3 variables:
##  $ A: int  1 2 3
##  $ B: num  1 2 3
##  $ C: chr  "A" "B" "C"

Here, we see that df has 3 observations (rows) with three variables (columns) each, with the vectors in each column consisting of an integer vector, a numeric vector, and a character vector, respectively.

Lists

Finally, we have lists, the most flexible class in R. Lists are like vectors except their elements can be of any type:

# A list made up of a data.frame, a numeric vector of length 3, and a character vector of length 6
my_list <- list(df, nums, mixed_vec)
print(my_list)

## [[1]]
##   A B C
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1] "1" "2" "3" "a" "b" "c"

str(my_list)

## List of 3
##  $ :'data.frame':    3 obs. of  3 variables:
##   ..$ A: int [1:3] 1 2 3
##   ..$ B: num [1:3] 1 2 3
##   ..$ C: chr [1:3] "A" "B" "C"
##  $ : num [1:4] 1 2 3 4
##  $ : chr [1:6] "1" "2" "3" "a" ...

It’s worth noting that a data.frame is a special case of a list in which all of the components are of the same length.

Question #2: Create a data frame named my_df containing two vectors, $J$ and $K$, where $J$ is created using the seq function to be a sequence from 0 to 100 counting by 10 and $K$ is created using the rep function to replicate the character string “XYZ” the proper number of times. Hint: read the help documentation for each function (seq and rep) to determine the necessary arguments.

Question #3: Use the str() function to inspect the structure of my_data (Happy Planet data). In a comment, indicate how many observations are in this data set and how many variables are of class numeric

$~$

Indices

Often when dealing with data we will want to work with a subset an object, or extract relevant portions of it. This is done with indices. For example, suppose we have a vector x and would like to extract the element in its second position and assign it to a new object named b:

x <- 5:10
b <- x[2]
b # second element of x is 6

## [1] 6

Square brackets, [ and ], are used to access a certain position (or multiple positions) within an object. In this example we access the second position of the object x, though we could also extract the second and fourth position of x:

x[c(2,4)]

## [1] 6 8

Some objects, such as data frames, have multiple dimensions, requiring indices in each dimension (separated by commas) to describe a single element’s position. Indices in the first position indicate a row, while those in the second position indicate a column.

df <- data.frame(x = 1:5, y = 6:10, z = letters[1:5]) 
df

##   x  y z
## 1 1  6 a
## 2 2  7 b
## 3 3  8 c
## 4 4  9 d
## 5 5 10 e

df[2, ] # Everything in row 2

##   x y z
## 2 2 7 b

df[, 3] # Everything in column 3

## [1] "a" "b" "c" "d" "e"

df[2,3] # The element in row 2, column 3

## [1] "b"

As was the case with vectors, we can use multiple indices to select multiple rows (or columns)

# Selects 2nd and 4th columns
df[c(2, 4), ]

##   x y z
## 2 2 7 b
## 4 4 9 d

Although we will not work with many lists directly in this course, it is worth being aware of a special case when subsetting with a list. Whenever the single bracket [ is used to subset a list, the returned value is itself a list; when a double square bracket, [[, is used, the returned value is the object within that portion of the list. For example

## Returns a list of length one
my_list[2]

## [[1]]
## [1] 1 2 3 4

class(my_list[2]) # this remains a list because of single bracket

## [1] "list"

## Returns the vector that is second element of list
my_list[[2]]

## [1] 1 2 3 4

class(my_list[[2]]) # a vector because of double bracket

## [1] "numeric"

As data frames are lists, this is true for them as well:

## Single bracket returns data.frame, in this case the second column
df[2]

##    y
## 1  6
## 2  7
## 3  8
## 4  9
## 5 10

class(df[2]) # data.frame

## [1] "data.frame"

## Double bracket returns vector
df[[2]]

## [1]  6  7  8  9 10

class(df[[2]]) # vector

## [1] "integer"

Note here that when no comma is used in the indices for a data frame, the index is used to identify a column.

Question #4: Use indices to print the Happiness score (column #3) of Hong Kong (row #57) in the object my_data (the Happy Planet data from Question #1).

$~$

Working with Data

Recall from earlier that a data frame is simply a list in which each of the variables are of the same length. Often, we will want to access a single of these variables from a data set as a vector. R provides us with a wide variety of ways with which we might do this. We will consider a few examples here using the Happy Planet dataset my_data where we will be extracting the variable Countries.

The most common method for accessing a variable in a dataset is to use the $ operator:

my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")

# The $ accesses the variable named 'Country' within 'my_data'
countries <- my_data$Country

We can also subset using either of the methods discussed in the previous section, namely those with the [ or [[ operators

# Use the name of the variable in place of an index position
# but don't forget to use a comma
countries2 <- my_data[, 'Country']

# Using the double `[[` operator
countries3 <- my_data[["Country"]]

Technically, we could also subset using the number of the column. In this case, since the Country variable is the first column, we could subset it as such

# Position indexing to access the variable 'Country' (since its the first column)
countries4 <- my_data[, 1]

However, this is not recommended and is an example of something known as a magic number. There may be instances in which our data changes in unexpected ways, and it may be the case that by the time you try to access the first column of a dataset, the column could have changed to something else. By writing out the variable explicitly, we can be sure we are accessing what we intend to.

Just as we may want to access a variable (column) of a dataset, often we will be interested in accessing specific observations (rows). We can do this by specifying the relevant index in the dataset. For example, to access the first observation from my_data, we could write the following:

Albania <- my_data[1, ] # This stores the entire first row
Albania

##   Country Region Happiness LifeExpectancy Footprint  HLY   HPI HPIRank
## 1 Albania      7       5.5           76.2       2.2 41.7 47.91      54
##   GDPperCapita   HDI Population
## 1         5316 0.801       3.15

Or to get the first five observations:

firstFive <- my_data[1:5, ] # This stores the first five rows
firstFive

##     Country Region Happiness LifeExpectancy Footprint  HLY   HPI HPIRank
## 1   Albania      7       5.5           76.2       2.2 41.7 47.91      54
## 2   Algeria      3       5.6           71.7       1.7 40.1 51.23      40
## 3    Angola      4       4.3           41.7       0.9 17.8 26.78     130
## 4 Argentina      1       7.1           74.8       2.5 53.4 58.95      15
## 5   Armenia      7       5.0           71.7       1.4 36.1 48.28      48
##   GDPperCapita   HDI Population
## 1         5316 0.801       3.15
## 2         7062 0.733      32.85
## 3         2335 0.446      16.10
## 4        14280 0.869      38.75
## 5         4945 0.775       3.02

Just as before, though, we want to avoid using hard-coded numbers if possible. We will explore how to do this in the next section.

There are a few functions that we might use when exploring a new data set. The first of these are head() and tail() which print out the first and last few rows of a data set (or elements of a vector) respectively. Here are a few other functions that may be useful:

dim(my_data) # prints the dimensions of 'my_data'

## [1] 143  11

nrow(my_data) # prints the number of rows of 'my_data'

## [1] 143

ncol(my_data) # prints the number of columns of 'my_data'

## [1] 11

colnames(my_data) # prints the names of the variables (columns) of 'my_data'

##  [1] "Country"        "Region"         "Happiness"      "LifeExpectancy"
##  [5] "Footprint"      "HLY"            "HPI"            "HPIRank"       
##  [9] "GDPperCapita"   "HDI"            "Population"

Question #5 (Part 1): Write code that prints out the last three observations in the Happy Planet data set.

Question #5 (Part 2): Write code that finds the median value of life expectancy for the last 10 observations in the Happy Planet data (Hint: ?median)

Logical Conditions and Subsetting

As we mentioned in the last section, we often want to avoid subsetting our data by using numbers for row/column positions. Usually when we are subsetting data, we have a set of criteria in mind that describes the portion we wish to retain. While there are a number of ways we may go about doing this, one very common method involves making use of logical vectors which, as we saw previously, are vectors made up of only the values TRUE and FALSE.

A natural way of constructing logical vectors is with the use of logical operators, the main set of which is given below:

Operator	Description
`==`	equal to
`!=`	not equal to
`<`	less than (strict)
`<=`	less than or equal to
`>`	greater than (strict)
`>=`	greater than or equal to
`&`	and
`\|`	or
`!`	not

To illustrate, let’s begin by considering a vector x which takes integer values from 5 to 10 and ask two questions: which values in the vector are less than 8 and which values are equal to 8, respectively.

x <- 5:10
x

## [1]  5  6  7  8  9 10

# A logical vector indicating for which indices this inequality is true
x < 8

## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

# A logical vector indicating for which values this equality is true
x == 8

## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

Note that this is an example of a vectorized operation, meaning that for each element in the input x, we will receive a corresponding value in the output. We see this in the first expression x < 8, where the inequality is TRUE for the first three elements (5, 6, 7) and FALSE for the last three elements (8, 9, 10). Verify for yourself that the output makes sense for x == 8.

As with all operations in R, we can assign the logical vector output to a named variable. We can use this trick to subset x to keep only values greater than 8. What is happening is that R is returning all of the values where the indices are TRUE and not returning the ones where they are FALSE.

# Here we assign the logical vector to a variable, "idx"
idx <- x > 8
idx

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE

# We can use idx to subset x
x[idx]

## [1]  9 10

# We can also subset this without assigning to variable
x[x > 8]

## [1]  9 10

For situations in which we prefer to have indicies instead of logical vectors, we can use the which() function which returns the index for each position that is TRUE

which(x > 8) # this is true for the 5th and 6th element of x

## [1] 5 6

Sometimes we will find ourselves in situations in which we want to combine logical statements, which we can do with “and” (&) and “or” (|). The results of the operators match what we would typically expect using spoken language: when using &, something is TRUE only if both conditions are TRUE and is otherwise false. Alternatively, when using |, a statement is TRUE if one or both statements are TRUE. Here are a few examples:

# Both of these statements are TRUE
(3 > 2) & (4 > 1)

## [1] TRUE

(3 > 2) | (4 > 1)

## [1] TRUE

# One of these statements is FALSE
(3 > 2) & (4 < 1)

## [1] FALSE

(3 > 2) | (4 < 1)

## [1] TRUE

# Both of these statements are FALSE
(3 < 2) & (4 < 1)

## [1] FALSE

(3 < 2) | (4 < 1)

## [1] FALSE

We can combine these just as we did before to extract elements of x

# Recall x is 5:10
x

## [1]  5  6  7  8  9 10

# See which values are TRUE and which are FALSE
x < 7 | x > 9

## [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE

Subsetting with data

Once we are comfortable with using logical vectors to subset vectors, it is natural to extend this to data frames. For example, in the previous section we used a numeric index to subset the Happy Planet data in order to access the first observation for Albania. Now, we can be explicit in our request with the following:

my_data[my_data$Country == "Albania",   ]

##   Country Region Happiness LifeExpectancy Footprint  HLY   HPI HPIRank
## 1 Albania      7       5.5           76.2       2.2 41.7 47.91      54
##   GDPperCapita   HDI Population
## 1         5316 0.801       3.15

Question #6 Explain in a comment why the above code works. What is my_data$Country == "Albania", and how is it being used to subset the my_data?

Because subsetting data frames based on logical conditions is such a common operation in R, we have available to us a very useful function subset(). subset works by taking primarily two arguments: a data frame or object to be subset and a logical vector. Here, we create a subset of the Happy Planet data only containing those countries whose life expectancy is greater than 80:

sub1 <- subset(my_data, LifeExpectancy > 80)
sub1

##         Country Region Happiness LifeExpectancy Footprint  HLY   HPI HPIRank
## 6     Australia      2       7.9           80.9       7.8 63.7 36.64     102
## 25       Canada      2       8.0           80.3       7.1 64.0 39.40      89
## 47       France      2       7.1           80.2       4.9 56.6 43.86      71
## 57    Hong Kong      6       7.2           81.9       5.7 58.6 41.60      84
## 59      Iceland      2       7.8           81.5       7.4 63.9 38.14      94
## 65       Israel      3       7.1           80.3       4.8 56.8 44.49      67
## 66        Italy      2       6.9           80.3       4.8 55.7 44.02      69
## 68        Japan      6       6.8           82.3       4.9 55.6 43.25      75
## 119       Spain      2       7.6           80.5       5.7 61.2 43.19      76
## 122      Sweden      2       7.9           80.5       5.1 63.2 47.99      53
## 123 Switzerland      2       7.7           81.3       5.0 62.6 48.05      52
##     GDPperCapita   HDI Population
## 6          31794 0.962      20.40
## 25         33375 0.961      32.31
## 47         30386 0.952      60.87
## 57         34833 0.937       6.81
## 59         36510 0.968       0.30
## 65         25864 0.932       6.92
## 66         28529 0.941      58.61
## 68         31267 0.953     127.77
## 119        27169 0.949      43.40
## 122        32525 0.956       9.02
## 123        35633 0.955       7.44

This works equally well for compound logical expressions

sub2 <- subset(my_data, LifeExpectancy <= 70 & Happiness > 6)
sub2

##                 Country Region Happiness LifeExpectancy Footprint  HLY   HPI
## 14               Bhutan      5       6.1           64.7       1.0 39.7 58.50
## 15              Bolivia      1       6.5           64.7       2.1 42.1 49.35
## 52            Guatemala      1       7.4           69.7       1.5 51.8 68.37
## 54               Guyana      1       6.5           65.2       2.6 42.6 45.63
## 56             Honduras      1       7.0           69.4       1.8 48.7 60.99
## 70           Kazakhstan      7       6.1           65.9       3.4 40.4 38.54
## 75                 Laos      6       6.2           63.2       1.1 39.4 57.34
## 127            Thailand      6       6.3           69.6       2.1 43.5 50.90
## 129 Trinidad and Tobago      1       6.7           69.2       2.1 46.3 54.21
##     HPIRank GDPperCapita   HDI Population
## 14       17         3649 0.579       0.64
## 15       47         2819 0.695       9.18
## 52        4         4568 0.689      12.71
## 54       63         4508 0.750       0.74
## 56       10         3430 0.700       6.83
## 70       91         7857 0.794      15.15
## 75       19         2039 0.601       5.66
## 127      41         8677 0.781      63.00
## 129      30        14603 0.814       1.32

Notice that unlike other instances when subsetting where we had to use $ to indicate that LifeExpectancy or Happiness were columns in my_data, the subset function was able to determine that on it’s own.

Question #7 Create a subset of my_data that contains all countries with a population greater than 100 million (the relevant column measures population in millions) that also have a happiness index of 6 or lower. Then determine the number of observations in this subset.