R
This is a short lab intended to introduce a few aspects of
R
and R Studio
while practicing the format of
future class meetings.
Directions (Please read before starting)
\(~\)
The “Preamble” section of labs is something we will work on together as an entire class.
The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:
Partners should switch roles throughout the “Lab” section. For the first few labs, the less experienced coder should spend more time as the driver.
\(~\)
After you open RStudio
, the first thing you’ll want to
do is open a file to work in. You can do this by navigating: File ->
New File -> RScript, which will open a new window in the top left of
the RStudio
user interface for you to work in. At this
point you should see four panels:
An R Script is like a text-file that stores your code while you work on it. At any time, you can send some or all of the code in your R Script to the Console for the computer to execute. You can also type commands directly into the Console. The Console will echo any code you tell it to run, and will display the textual/numeric output that the code generates.
The Environment shows you the names of datasets, variables, and user-created functions that have been loaded into your workspace and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code and a few other useful entities (like help documentation and file trees).
Question #0: Create a blank R Script. You will use this R Script to record your answers to future questions in this document.
\(~\)
R
is an interpreted programming language, which allows
you to have the computer execute any piece of code contained your R
Script at any time without a lengthy compiling process.
To run a single piece of code, simply highlight it and either hit Ctrl-Enter or click on the “Run” button near the top right corner of your R Script. You should see an echo of the code you ran in the Console, along with any response generated by that code.
4 + 6 - (24/6)
## [1] 6
5 ^ 2 + 2 * 2
## [1] 29
The examples above demonstrate how R
can be used as a
calculator. However, most of code we will write will rely upon
functions, or pre-built units of code that translate
one or more inputs into one or more outputs.
log(x = 4, base = 2)
## [1] 2
In the example above we input an “x” value of 4, and a “base” of 2. The labels given to these inputs, “x” and “base”, are the function’s arguments. The function returns the output “2”, which is \(\text{log}_2(4)\).
\(~\)
Over time you’ll end up memorizing the arguments of common
R
functions; however, while you’re learning I strongly
encourage you to read the help documentation for any
R
function used in your code. You can access a function’s
documentation by typing a ?
in front of the function name
and submitting to the console.
?log
\(~\)
When coding, it is good practice to include comments that describe
what your code is doing. In R
the character “#” is used to
start a comment. Everything appearing on the same line to the right of
the “#” will not be executed when that line is submitted to the
console.
# This entire line is a comment and will do nothing if run
1:6 # The command "1:6" appears before this comment
## [1] 1 2 3 4 5 6
In your R Script, comments appear in green. You also should remember that the “#” starts a comment only for a single line of your R Script, so long comments requiring multiple lines should each begin with their own “#”.
\(~\)
The remainder of the lab is to be completed by you and your lab partner. You should work at a pace that ensures both of you thoroughly understand the lab’s contents and examples.
\(~\)
An important part of data science is reproducibility, or the ability for two people to independently replicate a task using the same code.
To ensure reproducibility, every analysis begins by importing raw
data into R
and manipulating it used documented (commented)
code. Further, the raw data should be imported using functions:
## Loading a CSV file from a web URL (storing it as "my_data")
my_data <- read.csv("https://some_webpage/some_data.csv")
## Loading a CSV file with a local file path
my_data <- read.csv("H:/path_to_my_data/my_data.csv")
Note that:
<-
or =
can be used to
assign something to a named object. But <-
is
generally preferred for complex reasons we won’t get into in this
course./
or \\
. A single
\
is used by R
to start an instance of a
special text character. For example, \n
creates a new line
in a string of text.It’s also worth noting here – in the most typical cases, a dataset will organize its data by having each observation constitute a single row, with each of the variables being a column.
Question #1: Add code to your script that uses the
read.csv()
function to create an object named
my_data
that contains the “Happy Planet” data stored at: https://remiller1450.github.io/data/HappyPlanet.csv
After running your Question #1 code, an entry named “my_data” should appear in the Environment panel (top right).
\(~\)
R
stores data in containers called objects.
Data is assigned into an object using <-
or
=
. After assignment, data can be referenced using that
object’s name. While there are a wide variety of different objects and
types in R
, we are going to focus here on three of the most
common: vectors, data.frames, and lists.
The simplest objects are vectors which contain one or more
elements. A common task in R
involves creating an object
and assigning it to a named value which we can use for subsequent
operations:
x <- 5 # This assigns the integer value '5' to an object called 'x'
x^2 # We can now reference 'x'
## [1] 25
There are a number of ways to construct a vector, including using the
:
operator and, most commonly, the c()
function:
x <- 1:3 # The sequence {1, 2, 3} is assigned to the vector called 'x'
print(x)
## [1] 1 2 3
y <- c(1,2,3) # The function 'c' concatenates arguments (separated by commas) into a vector
print(y)
## [1] 1 2 3
z <- c("A","B","C") # Vectors can contain many types of values
print(z)
## [1] "A" "B" "C"
One of the most important characteristics of a vector is its class which indicates the types of elements contained in the vector as well as what types of functions are able to use it as input.
The three most important classes of vectors are:
x = c(1,2,3)
x = c("A","B","C")
x = c(TRUE, FALSE, TRUE)
You should always know a vector’s class before using it, which you
can determine using the class()
function.
chars <- c("1","2","3") # Create a character vector
class(chars)
## [1] "character"
nums <- c(1,2,3) # Create a numeric vector
class(nums)
## [1] "numeric"
While vectors can be of different classes, all elements of a vector must be of the same type. If any of the elements of a vector are of a different class, they will be coerced to the broadest type within the vector, i.e., numbers can be characters, but characters cannot be numbers
mixed_vec <- c(1,2,3, "a", "b", "c") # has numeric and character elements
class(mixed_vec)
## [1] "character"
Most functions expect a certain type of input and will produce an error if the wrong class is given.
mean(chars) # This produces an error, mean() only works for numeric vectors
## Warning in mean.default(chars): argument is not numeric or logical: returning
## NA
## [1] NA
mean(nums) # This works as intended
## [1] 2
Many R
functions are vectorized, meaning they
can accept a scalar input, for example 1
, and return a
scalar output, f(1)
, or they can accept a vector input,
such as c(1,2,3)
, and return a vector
c(f(1),f(2),f(3))
. For example, sqrt()
is
vectorized:
nums <- c(1,2,3,4)
sqrt(nums)
## [1] 1.000000 1.414214 1.732051 2.000000
Data are usually stored in objects called data.frames, which are composed of several vectors of the same length:
df <- data.frame(A = x, B = y, C = z) # Creates a data.frame object 'df'
print(df)
## A B C
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C
Note that this creates a new data.frame
with column
names “A”, “B”, and “C”. We can learn about the structure of a
data.frame
using the str()
function:
str(df)
## 'data.frame': 3 obs. of 3 variables:
## $ A: int 1 2 3
## $ B: num 1 2 3
## $ C: chr "A" "B" "C"
Here, we see that df
has 3 observations (rows) with
three variables (columns) each, with the vectors in each column
consisting of an integer vector, a numeric vector, and a character
vector, respectively.
Finally, we have lists, the most flexible class in
R
. Lists are like vectors except their elements can be of
any type:
# A list made up of a data.frame, a numeric vector of length 3, and a character vector of length 6
my_list <- list(df, nums, mixed_vec)
print(my_list)
## [[1]]
## A B C
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C
##
## [[2]]
## [1] 1 2 3 4
##
## [[3]]
## [1] "1" "2" "3" "a" "b" "c"
str(my_list)
## List of 3
## $ :'data.frame': 3 obs. of 3 variables:
## ..$ A: int [1:3] 1 2 3
## ..$ B: num [1:3] 1 2 3
## ..$ C: chr [1:3] "A" "B" "C"
## $ : num [1:4] 1 2 3 4
## $ : chr [1:6] "1" "2" "3" "a" ...
It’s worth noting that a data.frame
is a special case of
a list in which all of the components are of the same length.
Question #2: Create a data frame named
my_df
containing two vectors, \(J\) and \(K\), where \(J\) is created using the seq
function to be a sequence from 0 to 100 counting by 10 and \(K\) is created using the rep
function to replicate the character string “XYZ” the proper number of
times. Hint: read the help documentation for each function
(seq
and rep
) to determine the necessary
arguments.
Question #3: Use the str()
function to
inspect the structure of my_data
(Happy Planet data). In a
comment, indicate how many observations are in this data set and how
many variables are of class numeric
\(~\)
Often when dealing with data we will want to work with a
subset an object, or extract relevant portions of it. This is
done with indices. For example, suppose we have a vector x
and would like to extract the element in its second position and assign
it to a new object named b
:
x <- 5:10
b <- x[2]
b # second element of x is 6
## [1] 6
Square brackets, [
and ]
, are used to
access a certain position (or multiple positions) within an
object. In this example we access the second position of the object
x
, though we could also extract the second and fourth
position of x
:
x[c(2,4)]
## [1] 6 8
Some objects, such as data frames, have multiple dimensions, requiring indices in each dimension (separated by commas) to describe a single element’s position. Indices in the first position indicate a row, while those in the second position indicate a column.
df <- data.frame(x = 1:5, y = 6:10, z = letters[1:5])
df
## x y z
## 1 1 6 a
## 2 2 7 b
## 3 3 8 c
## 4 4 9 d
## 5 5 10 e
df[2, ] # Everything in row 2
## x y z
## 2 2 7 b
df[, 3] # Everything in column 3
## [1] "a" "b" "c" "d" "e"
df[2,3] # The element in row 2, column 3
## [1] "b"
As was the case with vectors, we can use multiple indices to select multiple rows (or columns)
# Selects 2nd and 4th columns
df[c(2, 4), ]
## x y z
## 2 2 7 b
## 4 4 9 d
Although we will not work with many lists directly in this course, it
is worth being aware of a special case when subsetting with a list.
Whenever the single bracket [
is used to subset a list, the
returned value is itself a list; when a double square bracket,
[[
, is used, the returned value is the object within that
portion of the list. For example
## Returns a list of length one
my_list[2]
## [[1]]
## [1] 1 2 3 4
class(my_list[2]) # this remains a list because of single bracket
## [1] "list"
## Returns the vector that is second element of list
my_list[[2]]
## [1] 1 2 3 4
class(my_list[[2]]) # a vector because of double bracket
## [1] "numeric"
As data frames are lists, this is true for them as well:
## Single bracket returns data.frame, in this case the second column
df[2]
## y
## 1 6
## 2 7
## 3 8
## 4 9
## 5 10
class(df[2]) # data.frame
## [1] "data.frame"
## Double bracket returns vector
df[[2]]
## [1] 6 7 8 9 10
class(df[[2]]) # vector
## [1] "integer"
Note here that when no comma is used in the indices for a data frame, the index is used to identify a column.
Question #4: Use indices to print the Happiness
score (column #3) of Hong Kong (row #57) in the object
my_data
(the Happy Planet data from Question #1).
\(~\)
Recall from earlier that a data frame is simply a list in which each
of the variables are of the same length. Often, we will want to access a
single of these variables from a data set as a vector. R
provides us with a wide variety of ways with which we might do this. We
will consider a few examples here using the Happy Planet dataset
my_data
where we will be extracting the variable
Countries
.
The most common method for accessing a variable in a dataset is to
use the $
operator:
my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
# The $ accesses the variable named 'Country' within 'my_data'
countries <- my_data$Country
We can also subset using either of the methods discussed in the
previous section, namely those with the [
or
[[
operators
# Use the name of the variable in place of an index position
# but don't forget to use a comma
countries2 <- my_data[, 'Country']
# Using the double `[[` operator
countries3 <- my_data[["Country"]]
Technically, we could also subset using the number of the column. In
this case, since the Country
variable is the first column,
we could subset it as such
# Position indexing to access the variable 'Country' (since its the first column)
countries4 <- my_data[, 1]
However, this is not recommended and is an example of something known as a magic number. There may be instances in which our data changes in unexpected ways, and it may be the case that by the time you try to access the first column of a dataset, the column could have changed to something else. By writing out the variable explicitly, we can be sure we are accessing what we intend to.
Just as we may want to access a variable (column) of a dataset, often
we will be interested in accessing specific observations (rows). We can
do this by specifying the relevant index in the dataset. For example, to
access the first observation from my_data
, we could write
the following:
Albania <- my_data[1, ] # This stores the entire first row
Albania
## Country Region Happiness LifeExpectancy Footprint HLY HPI HPIRank
## 1 Albania 7 5.5 76.2 2.2 41.7 47.91 54
## GDPperCapita HDI Population
## 1 5316 0.801 3.15
Or to get the first five observations:
firstFive <- my_data[1:5, ] # This stores the first five rows
firstFive
## Country Region Happiness LifeExpectancy Footprint HLY HPI HPIRank
## 1 Albania 7 5.5 76.2 2.2 41.7 47.91 54
## 2 Algeria 3 5.6 71.7 1.7 40.1 51.23 40
## 3 Angola 4 4.3 41.7 0.9 17.8 26.78 130
## 4 Argentina 1 7.1 74.8 2.5 53.4 58.95 15
## 5 Armenia 7 5.0 71.7 1.4 36.1 48.28 48
## GDPperCapita HDI Population
## 1 5316 0.801 3.15
## 2 7062 0.733 32.85
## 3 2335 0.446 16.10
## 4 14280 0.869 38.75
## 5 4945 0.775 3.02
Just as before, though, we want to avoid using hard-coded numbers if possible. We will explore how to do this in the next section.
There are a few functions that we might use when exploring a new data
set. The first of these are head()
and tail()
which print out the first and last few rows of a data set (or elements
of a vector) respectively. Here are a few other functions that may be
useful:
dim(my_data) # prints the dimensions of 'my_data'
## [1] 143 11
nrow(my_data) # prints the number of rows of 'my_data'
## [1] 143
ncol(my_data) # prints the number of columns of 'my_data'
## [1] 11
colnames(my_data) # prints the names of the variables (columns) of 'my_data'
## [1] "Country" "Region" "Happiness" "LifeExpectancy"
## [5] "Footprint" "HLY" "HPI" "HPIRank"
## [9] "GDPperCapita" "HDI" "Population"
Question #5 (Part 1): Write code that prints out the last three observations in the Happy Planet data set.
Question #5 (Part 2): Write code that finds the
median value of life expectancy for the last 10 observations in
the Happy Planet data (Hint: ?median
)
As we mentioned in the last section, we often want to avoid
subsetting our data by using numbers for row/column positions. Usually
when we are subsetting data, we have a set of criteria in mind that
describes the portion we wish to retain. While there are a number of
ways we may go about doing this, one very common method involves making
use of logical vectors which, as we saw previously, are vectors
made up of only the values TRUE
and FALSE
.
A natural way of constructing logical vectors is with the use of logical operators, the main set of which is given below:
Operator | Description |
---|---|
== |
equal to |
!= |
not equal to |
< |
less than (strict) |
<= |
less than or equal to |
> |
greater than (strict) |
>= |
greater than or equal to |
& |
and |
| |
or |
! |
not |
To illustrate, let’s begin by considering a vector x
which takes integer values from 5 to 10 and ask two questions: which
values in the vector are less than 8 and which values are equal to 8,
respectively.
x <- 5:10
x
## [1] 5 6 7 8 9 10
# A logical vector indicating for which indices this inequality is true
x < 8
## [1] TRUE TRUE TRUE FALSE FALSE FALSE
# A logical vector indicating for which values this equality is true
x == 8
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
Note that this is an example of a vectorized operation,
meaning that for each element in the input x
, we will
receive a corresponding value in the output. We see this in the first
expression x < 8
, where the inequality is
TRUE
for the first three elements (5, 6, 7) and
FALSE
for the last three elements (8, 9, 10). Verify for
yourself that the output makes sense for x == 8
.
As with all operations in R
, we can assign the logical
vector output to a named variable. We can use this trick to subset
x
to keep only values greater than 8. What is happening is
that R
is returning all of the values where the indices are
TRUE
and not returning the ones where they are
FALSE
.
# Here we assign the logical vector to a variable, "idx"
idx <- x > 8
idx
## [1] FALSE FALSE FALSE FALSE TRUE TRUE
# We can use idx to subset x
x[idx]
## [1] 9 10
# We can also subset this without assigning to variable
x[x > 8]
## [1] 9 10
For situations in which we prefer to have indicies instead of logical
vectors, we can use the which()
function which returns the
index for each position that is TRUE
which(x > 8) # this is true for the 5th and 6th element of x
## [1] 5 6
Sometimes we will find ourselves in situations in which we want to
combine logical statements, which we can do with “and”
(&
) and “or” (|
). The results of the
operators match what we would typically expect using spoken language:
when using &
, something is TRUE
only if
both conditions are TRUE
and is otherwise false.
Alternatively, when using |
, a statement is
TRUE
if one or both statements are TRUE
. Here
are a few examples:
# Both of these statements are TRUE
(3 > 2) & (4 > 1)
## [1] TRUE
(3 > 2) | (4 > 1)
## [1] TRUE
# One of these statements is FALSE
(3 > 2) & (4 < 1)
## [1] FALSE
(3 > 2) | (4 < 1)
## [1] TRUE
# Both of these statements are FALSE
(3 < 2) & (4 < 1)
## [1] FALSE
(3 < 2) | (4 < 1)
## [1] FALSE
We can combine these just as we did before to extract elements of
x
# Recall x is 5:10
x
## [1] 5 6 7 8 9 10
# See which values are TRUE and which are FALSE
x < 7 | x > 9
## [1] TRUE TRUE FALSE FALSE FALSE TRUE
Once we are comfortable with using logical vectors to subset vectors, it is natural to extend this to data frames. For example, in the previous section we used a numeric index to subset the Happy Planet data in order to access the first observation for Albania. Now, we can be explicit in our request with the following:
my_data[my_data$Country == "Albania", ]
## Country Region Happiness LifeExpectancy Footprint HLY HPI HPIRank
## 1 Albania 7 5.5 76.2 2.2 41.7 47.91 54
## GDPperCapita HDI Population
## 1 5316 0.801 3.15
Question #6 Explain in a comment why the above code
works. What is my_data$Country == "Albania"
, and how is it
being used to subset the my_data
?
Because subsetting data frames based on logical conditions is such a
common operation in R
, we have available to us a very
useful function subset()
. subset
works by
taking primarily two arguments: a data frame or object to be subset and
a logical vector. Here, we create a subset of the Happy Planet data only
containing those countries whose life expectancy is greater than 80:
sub1 <- subset(my_data, LifeExpectancy > 80)
sub1
## Country Region Happiness LifeExpectancy Footprint HLY HPI HPIRank
## 6 Australia 2 7.9 80.9 7.8 63.7 36.64 102
## 25 Canada 2 8.0 80.3 7.1 64.0 39.40 89
## 47 France 2 7.1 80.2 4.9 56.6 43.86 71
## 57 Hong Kong 6 7.2 81.9 5.7 58.6 41.60 84
## 59 Iceland 2 7.8 81.5 7.4 63.9 38.14 94
## 65 Israel 3 7.1 80.3 4.8 56.8 44.49 67
## 66 Italy 2 6.9 80.3 4.8 55.7 44.02 69
## 68 Japan 6 6.8 82.3 4.9 55.6 43.25 75
## 119 Spain 2 7.6 80.5 5.7 61.2 43.19 76
## 122 Sweden 2 7.9 80.5 5.1 63.2 47.99 53
## 123 Switzerland 2 7.7 81.3 5.0 62.6 48.05 52
## GDPperCapita HDI Population
## 6 31794 0.962 20.40
## 25 33375 0.961 32.31
## 47 30386 0.952 60.87
## 57 34833 0.937 6.81
## 59 36510 0.968 0.30
## 65 25864 0.932 6.92
## 66 28529 0.941 58.61
## 68 31267 0.953 127.77
## 119 27169 0.949 43.40
## 122 32525 0.956 9.02
## 123 35633 0.955 7.44
This works equally well for compound logical expressions
sub2 <- subset(my_data, LifeExpectancy <= 70 & Happiness > 6)
sub2
## Country Region Happiness LifeExpectancy Footprint HLY HPI
## 14 Bhutan 5 6.1 64.7 1.0 39.7 58.50
## 15 Bolivia 1 6.5 64.7 2.1 42.1 49.35
## 52 Guatemala 1 7.4 69.7 1.5 51.8 68.37
## 54 Guyana 1 6.5 65.2 2.6 42.6 45.63
## 56 Honduras 1 7.0 69.4 1.8 48.7 60.99
## 70 Kazakhstan 7 6.1 65.9 3.4 40.4 38.54
## 75 Laos 6 6.2 63.2 1.1 39.4 57.34
## 127 Thailand 6 6.3 69.6 2.1 43.5 50.90
## 129 Trinidad and Tobago 1 6.7 69.2 2.1 46.3 54.21
## HPIRank GDPperCapita HDI Population
## 14 17 3649 0.579 0.64
## 15 47 2819 0.695 9.18
## 52 4 4568 0.689 12.71
## 54 63 4508 0.750 0.74
## 56 10 3430 0.700 6.83
## 70 91 7857 0.794 15.15
## 75 19 2039 0.601 5.66
## 127 41 8677 0.781 63.00
## 129 30 14603 0.814 1.32
Notice that unlike other instances when subsetting where we had to
use $
to indicate that LifeExpectancy
or
Happiness
were columns in my_data
, the
subset
function was able to determine that on it’s own.
Question #7 Create a subset of my_data
that contains all countries with a population greater than 100 million
(the relevant column measures population in millions) that also have a
happiness index of 6 or lower. Then determine the number of observations
in this subset.