This lab is going to cover functions, loops and conditionals, and files, all of which serve important roles in the R language, as well as in data science more generally.
Functions, in their most essential definition, can be thought of as mini computer programs, pre-packaged snippets of code that assist us in doing common or repetitive tasks. Thus far in the course, we have used functions extensively, from complex tasks such as generating the beautifully rendered ggplots to computing the arithmetic mean. Most of the functions we have used take arguments or different variables, though this is not always the case. While there are many reasons to use functions, perhaps one of the most important for us is to standardize repetitive tasks. If we find ourselves copy-and-pasting code over and over again to do something similar, this is a hint that we need to write a function. Functions are also helpful in this regard in case there is an error in our code; using the copy-and-paste method, any errors that we find will have to be corrected in each place, whereas with a function, we only need to fix the error within the function.
Functions in R are similar to functions in other programming language, with the more important differences being beyond the scope of this class. In this portion of the lab, we will familiarize ourselves with how to create functions and use them in our own work.
Functions in R primarily consist of three components:
Functions in R begin with an assignment with <-
to a
name, along with the function function()
, followed by the
body of the function enclosed in curly brackets {}
. Here,
for example, is a function that computes the sum of squares of two
inputs, x
and y
sum_of_squares <- function(x, y) {
x^2 + y^2
}
sum_of_squares(x = 2, y = 3)
## [1] 13
We can also write functions that take default arguments. For
example, in the sum_of_squares()
function, if we do not
provide both an x
and y
, we will get an
error:
sum_of_squares(x = 2)
## Error in sum_of_squares(x = 2): argument "y" is missing, with no default
We can rewrite our function so that we have a default argument with
y = 3
. Note that the default only applies when an
argument for y
is not given; if we do supply an argument to
y
, the default with be ignored.
sum_of_squares_default <- function(x, y = 3) {
x^2 + y^2
}
# Using the default argument
sum_of_squares_default(x = 2)
## [1] 13
# Providing our own
sum_of_squares_default(x = 2, y = 10)
## [1] 104
Question 1 Recall from previous homework that we had introduced both the police and college datasets.
police <- read.csv("https://remiller1450.github.io/data/Police2019.csv")
college <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
Your goal is to write a function called top_table
that
takes a character vector and returns the names of the values with the
top five occurrences. Recall, this is precisely what was done in
Homework 1. Then, verify that it works by printing out the top five
states in both the police and college datasets. Your results should look
like this:
top_table(police$state)
## v
## CA TX FL AZ CO
## 825 496 369 259 204
top_table(college$State)
## v
## NY PA CA TX MA
## 127 116 105 83 67
Question 2 Modify top_table
so that it
takes an additional argument n
that allows you to specify
that you want to view the top N values in each vector. Here, for
example, we print the top 10:
top_table(police$state, n = 10)
## v
## CA TX FL AZ CO GA OK NC OH WA
## 825 496 369 259 204 189 170 163 159 156
top_table(college$State, n = 10)
## v
## NY PA CA TX MA OH IL FL NC GA
## 127 116 105 83 67 66 59 54 54 50
Here, we introduce the concept of loops and conditionals, both of which are common in other programming languages.
The concept of a loop is straightforward enough: given a vector of values, we will loop through each one, performing some operation. Similar to functions, loops have a particular syntax to initiate a loop and then enclose the resulting expression between curly brackets.
Loops begin with a syntax that takes the form of something like
for (i in 1:10)
which we read as “for the variable
i in the vector 1 to 10…”. In this case, the loop will
loop through the vector 1:10
, in each iteration
assigning the variable i
first the number 1, then 2,
continuing until it has exhausted the vector. A simple loop may look
like this
library(stringr)
for (i in 1:10) {
print(str_c("The variable 'i' is now equal to: ", i, collapse = ""))
}
## [1] "The variable 'i' is now equal to: 1"
## [1] "The variable 'i' is now equal to: 2"
## [1] "The variable 'i' is now equal to: 3"
## [1] "The variable 'i' is now equal to: 4"
## [1] "The variable 'i' is now equal to: 5"
## [1] "The variable 'i' is now equal to: 6"
## [1] "The variable 'i' is now equal to: 7"
## [1] "The variable 'i' is now equal to: 8"
## [1] "The variable 'i' is now equal to: 9"
## [1] "The variable 'i' is now equal to: 10"
Note: we often have to use the print()
function inside
of a loop if we wish to print the output to the console.
Of course, we do not have to use the variable i
for
this, any other variable will do. Perhaps strangely, we do not even need
to use an iterative vector for this. We can loop through the values of
any vector by using it instead of the 1:10
vec <- c(1,3,5,7,9)
for (v in vec) {
print(v)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
Finally, we can combine the logic of both of these operations to both
loop through an iterative vector while still accessing the values stored
in a vector. First, we will use the seq_along()
function
which creates a sequential vector the same length as it’s argument:
vec <- c(1,3,5,7,9)
seq_along(vec)
## [1] 1 2 3 4 5
and combine with with the positional indices that we learned about in Lab 1
vec <- c(1,3,5,7,9)
# Loop along the positions of vec
for (i in seq_along(vec)) {
print(str_c("The ", i, " position of the vector 'vec' is: ", vec[i], collapse = ""))
}
## [1] "The 1 position of the vector 'vec' is: 1"
## [1] "The 2 position of the vector 'vec' is: 3"
## [1] "The 3 position of the vector 'vec' is: 5"
## [1] "The 4 position of the vector 'vec' is: 7"
## [1] "The 5 position of the vector 'vec' is: 9"
This is useful, for example, if we wish to loop along the columns of a data frame
df <- data.frame(x = 1:5,
y = 6:10,
z = 11:15)
df
## x y z
## 1 1 6 11
## 2 2 7 12
## 3 3 8 13
## 4 4 9 14
## 5 5 10 15
# mean of each column
for (i in 1:ncol(df)) {
print(mean(df[, i]))
}
## [1] 3
## [1] 8
## [1] 13
Finally, if we wish to save the output of a loop, we can do so by
creating a vector that is the same length as the vector we are iterating
over and assigning the results to each position. For a numeric vector,
we can create this with the function numeric()
df <- data.frame(x = 1:5,
y = 6:10,
z = 11:15)
# Creates an empty numberic vector
results <- numeric(length = ncol(df))
for (i in 1:ncol(df)) {
# ith value of results is mean of ith column of df
results[i] <- mean(df[, i])
}
results
## [1] 3 8 13
Conditionals, like loops and functions, are basic data structures in
R that involve a special syntax followed by expressions enclosed in
curly brackets. Conditionals take a logical value (TRUE
or
FALSE
) and, based on this value, evaluate on expression or
the other. The special syntax used in conditionals are
if ()
, if else ()
and else
. Both
if
and if else
take an argument that may be
true or false and will evaluate their expression accordingly;
else
, on the other hand, will always evaluate if none of
its previous conditions were met
We can best see how these work with a simple example.
d <- 5
if (d > 7) {
print("yay")
} else if (d < 4) {
print("boo")
} else {
print("hiss")
}
## [1] "hiss"
The only syntax needed to perform a conditional if the
if
function. If the condition fails to be true and there is
no else
, the sequence is simply ignored.
## This is ignored, there is no else
a <- 4
if (a > 5) {
a <- a^2
}
a
## [1] 4
# This one is not ignored
b <- 5
if (b > 3) {
b <- b^2 + 5
}
b
## [1] 30
Lastly, one can create as many if else
conditions as
their heart desires
a <- 4
if (a == 1) {
print("a is 1")
} else if (a == 2) {
print("a is 2")
} else if (a == 3) {
print("a is 3")
} else if (a == 4) {
print("a is 4")
} else {
print("I don't know what 'a' is")
}
## [1] "a is 4"
Question 3 Write a function called
long_square
that takes a single numeric argument
n
. If the length of the vector is greater than the square
root of the sum of all the numbers in the vector, the function should
print "long!"
. Otherwise, is should print
"not long!"
. Verify that it works for the arguments
x1
, x2
, and x3
.
x1 <- c(1, 2, 3, 4, 5)
x2 <- c(5, 8, 10, 12)
x3 <- c(2, 5, 9, 10, 1, 1, 1)
long_square(n = x1)
## [1] "long!"
long_square(n = x2)
## [1] "not long!"
long_square(n = x3)
## [1] "long!"
Question 4 Here, we are going to modify the
top_table
function we wrote in Question 1 once more. In
addition to having a second argument n
indicating the
number of results, we now want to include a third argument
top
which takes either TRUE
or
FALSE
. When TRUE
, the function should return
the top n
rows; when FALSE
, it should return
the bottom n
.
top_table(police$state, n = 5, top = TRUE)
## v
## CA TX FL AZ CO
## 825 496 369 259 204
top_table(college$State, n = 10, top = FALSE)
## v
## GU VI WY AK DE HI NV ID NM ND
## 1 1 1 2 5 6 6 7 7 8
Although much of the work we have done so far has involved reading data into our R session and manipulating in the environment here, we often are working with our computer in more intimate ways. While R provides very powerful tools for manipulating the underlying file system, we limit our attention today to just working directly with files that we have locally on our machine.
To begin, we can identify the directory that we are currently working
in (the directory in which you Rmd file is save) with the function
getwd()
getwd()
## [1] "/home/collin/courses/sta230/labs"
We can see a list of sub-directories with list.dirs()
and move to any one of them with the setwd()
function
## Ignore this, R Markdown resets directory after each code chunk
setwd("fundir/")
## Recursively lists all the directories
list.dirs()
## [1] "." "./hwk" "./sim"
## Move into the directory called sim
setwd("sim")
## Now we are in fundir/sim
getwd()
## [1] "/home/collin/courses/sta230/labs/fundir/sim"
Once we are in a directory, we can list all of the files in a
directory with list.files()
. Conveniently,
list.files()
has a pattern argument allowing us to specify
a regular expression to indicate which files we want to list.
## Ignore this, R Markdown resets directory after each code chunk
setwd("fundir/sim")
# List all of the files in fundir/sim
list.files()
## [1] "sim_1.csv" "sim_10.csv" "sim_2.csv" "sim_3.csv" "sim_4.csv"
## [6] "sim_5.csv" "sim_6.csv" "sim_7.csv" "sim_8.csv" "sim_9.csv"
## [11] "sim.R"
# List all of the files in fundr/sim with .csv extension
list.files(pattern = "csv")
## [1] "sim_1.csv" "sim_10.csv" "sim_2.csv" "sim_3.csv" "sim_4.csv"
## [6] "sim_5.csv" "sim_6.csv" "sim_7.csv" "sim_8.csv" "sim_9.csv"
list.files()
returns a character vector, the contents of
which can be used as arguments to other functions. For example, we have
used read.csv()
several times in this course, and now we
can do it passing variables are arguments
## Ignore this, R Markdown resets directory after each code chunk
setwd("fundir/sim")
my_sims <- list.files(pattern = "csv")
# Read the first simulation data in
sim1 <- read.csv(my_sims[1])
#
sim1
## V1
## 1 -1.36856
## 2 0.46344
## 3 1.87452
## 4 -0.25650
## 5 -1.41324
## 6 0.83684
## 7 -0.15509
## 8 0.15707
## 9 0.19777
## 10 -0.43873
We can combine this with other things we have learned in this lab in order to create a vector with the mean value from each of our simulations:
## Ignore this, R Markdown resets directory after each code chunk
setwd("fundir/sim")
# Create a numeric vector the same length as `my_sims` vector
mean_vec <- numeric(length = length(my_sims))
for (i in seq_along(mean_vec)) {
# First read in the ith simulation
dat <- read.csv(my_sims[i])
# Compute the mean and save it to the ith spot here
mean_vec[i] <- mean(dat$V1)
}
mean_vec
## [1] -0.010247 0.509078 -0.508938 -0.145795 -0.498174 -0.345951 0.036010
## [8] -0.066167 0.328893 0.097149
\(~\)
Question 5 For this lab, download
these files and unzip them in the same directory as your lab Rmd
file. Contained within this zipped folder are the results of an
experiment that we conducted on 30 people in three different
experimental groups. The data files themselves only contain the results
from the experiment; information on the subjects themselves is stored as
metadata in the file name. For example, the file
PatriciaFexpA.csv
contains the results for the subject
named Patricia, sex Female, in experimental group A. Similarly,
RobertMexpC.csv
contains data for Robert, sex Male, in
experimental group C. Each file contains 5 observations – however, we
are only interested in retaining the mean value of these observations
for each subject.
Use this information along with the data in the zipped folder to recreate the plot below.
Hint: Just like numerics, character vectors can be created with
character(length = n)