Lab

This lab is going to cover functions, loops and conditionals, and files, all of which serve important roles in the R language, as well as in data science more generally.

Functions

Functions, in their most essential definition, can be thought of as mini computer programs, pre-packaged snippets of code that assist us in doing common or repetitive tasks. Thus far in the course, we have used functions extensively, from complex tasks such as generating the beautifully rendered ggplots to computing the arithmetic mean. Most of the functions we have used take arguments or different variables, though this is not always the case. While there are many reasons to use functions, perhaps one of the most important for us is to standardize repetitive tasks. If we find ourselves copy-and-pasting code over and over again to do something similar, this is a hint that we need to write a function. Functions are also helpful in this regard in case there is an error in our code; using the copy-and-paste method, any errors that we find will have to be corrected in each place, whereas with a function, we only need to fix the error within the function.

Functions in R are similar to functions in other programming language, with the more important differences being beyond the scope of this class. In this portion of the lab, we will familiarize ourselves with how to create functions and use them in our own work.

Functions in R primarily consist of three components:

  1. The name of the function
  2. The arguments of the function
  3. The body of the function

Functions in R begin with an assignment with <- to a name, along with the function function(), followed by the body of the function enclosed in curly brackets {}. Here, for example, is a function that computes the sum of squares of two inputs, x and y

sum_of_squares <- function(x, y) {
  x^2 + y^2
}

sum_of_squares(x = 2, y = 3)
## [1] 13

We can also write functions that take default arguments. For example, in the sum_of_squares() function, if we do not provide both an x and y, we will get an error:

sum_of_squares(x = 2)
## Error in sum_of_squares(x = 2): argument "y" is missing, with no default

We can rewrite our function so that we have a default argument with y = 3. Note that the default only applies when an argument for y is not given; if we do supply an argument to y, the default with be ignored.

sum_of_squares_default <- function(x, y = 3) {
  x^2 + y^2
}

# Using the default argument
sum_of_squares_default(x = 2)
## [1] 13
# Providing our own
sum_of_squares_default(x = 2, y = 10)
## [1] 104

Question 1 Recall from previous homework that we had introduced both the police and college datasets.

police <- read.csv("https://remiller1450.github.io/data/Police2019.csv")
college <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

Your goal is to write a function called top_table that takes a character vector and returns the names of the values with the top five occurrences. Recall, this is precisely what was done in Homework 1. Then, verify that it works by printing out the top five states in both the police and college datasets. Your results should look like this:

top_table(police$state)
## v
##  CA  TX  FL  AZ  CO 
## 825 496 369 259 204
top_table(college$State)
## v
##  NY  PA  CA  TX  MA 
## 127 116 105  83  67

Question 2 Modify top_table so that it takes an additional argument n that allows you to specify that you want to view the top N values in each vector. Here, for example, we print the top 10:

top_table(police$state, n = 10)
## v
##  CA  TX  FL  AZ  CO  GA  OK  NC  OH  WA 
## 825 496 369 259 204 189 170 163 159 156
top_table(college$State, n = 10)
## v
##  NY  PA  CA  TX  MA  OH  IL  FL  NC  GA 
## 127 116 105  83  67  66  59  54  54  50

Loops and Conditionals

Here, we introduce the concept of loops and conditionals, both of which are common in other programming languages.

Loops

The concept of a loop is straightforward enough: given a vector of values, we will loop through each one, performing some operation. Similar to functions, loops have a particular syntax to initiate a loop and then enclose the resulting expression between curly brackets.

Loops begin with a syntax that takes the form of something like for (i in 1:10) which we read as “for the variable i in the vector 1 to 10…”. In this case, the loop will loop through the vector 1:10, in each iteration assigning the variable i first the number 1, then 2, continuing until it has exhausted the vector. A simple loop may look like this

library(stringr)
for (i in 1:10) {
  print(str_c("The variable 'i' is now equal to: ", i, collapse = ""))
}
## [1] "The variable 'i' is now equal to: 1"
## [1] "The variable 'i' is now equal to: 2"
## [1] "The variable 'i' is now equal to: 3"
## [1] "The variable 'i' is now equal to: 4"
## [1] "The variable 'i' is now equal to: 5"
## [1] "The variable 'i' is now equal to: 6"
## [1] "The variable 'i' is now equal to: 7"
## [1] "The variable 'i' is now equal to: 8"
## [1] "The variable 'i' is now equal to: 9"
## [1] "The variable 'i' is now equal to: 10"

Note: we often have to use the print() function inside of a loop if we wish to print the output to the console.

Of course, we do not have to use the variable i for this, any other variable will do. Perhaps strangely, we do not even need to use an iterative vector for this. We can loop through the values of any vector by using it instead of the 1:10

vec <- c(1,3,5,7,9)
for (v in vec) {
  print(v)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9

Finally, we can combine the logic of both of these operations to both loop through an iterative vector while still accessing the values stored in a vector. First, we will use the seq_along() function which creates a sequential vector the same length as it’s argument:

vec <- c(1,3,5,7,9)
seq_along(vec)
## [1] 1 2 3 4 5

and combine with with the positional indices that we learned about in Lab 1

vec <- c(1,3,5,7,9)

# Loop along the positions of vec
for (i in seq_along(vec)) {
  print(str_c("The ", i, " position of the vector 'vec' is: ", vec[i], collapse = ""))
}
## [1] "The 1 position of the vector 'vec' is: 1"
## [1] "The 2 position of the vector 'vec' is: 3"
## [1] "The 3 position of the vector 'vec' is: 5"
## [1] "The 4 position of the vector 'vec' is: 7"
## [1] "The 5 position of the vector 'vec' is: 9"

This is useful, for example, if we wish to loop along the columns of a data frame

df <- data.frame(x = 1:5, 
                 y = 6:10, 
                 z = 11:15)
df
##   x  y  z
## 1 1  6 11
## 2 2  7 12
## 3 3  8 13
## 4 4  9 14
## 5 5 10 15
# mean of each column
for (i in 1:ncol(df)) {
  print(mean(df[, i]))
}
## [1] 3
## [1] 8
## [1] 13

Finally, if we wish to save the output of a loop, we can do so by creating a vector that is the same length as the vector we are iterating over and assigning the results to each position. For a numeric vector, we can create this with the function numeric()

df <- data.frame(x = 1:5, 
                 y = 6:10, 
                 z = 11:15)

# Creates an empty numberic vector
results <- numeric(length = ncol(df))

for (i in 1:ncol(df)) {
  # ith value of results is mean of ith column of df
  results[i] <- mean(df[, i])
}

results
## [1]  3  8 13

Conditionals

Conditionals, like loops and functions, are basic data structures in R that involve a special syntax followed by expressions enclosed in curly brackets. Conditionals take a logical value (TRUE or FALSE) and, based on this value, evaluate on expression or the other. The special syntax used in conditionals are if (), if else () and else. Both if and if else take an argument that may be true or false and will evaluate their expression accordingly; else, on the other hand, will always evaluate if none of its previous conditions were met

We can best see how these work with a simple example.

d <- 5

if (d > 7) {
  print("yay")
} else if (d < 4) {
  print("boo")
} else {
  print("hiss")
}
## [1] "hiss"

The only syntax needed to perform a conditional if the if function. If the condition fails to be true and there is no else, the sequence is simply ignored.

## This is ignored, there is no else
a <- 4
if (a > 5) {
  a <- a^2
}
a
## [1] 4
# This one is not ignored
b <- 5
if (b > 3) {
  b <- b^2 + 5
}
b
## [1] 30

Lastly, one can create as many if else conditions as their heart desires

a <- 4

if (a == 1) {
  print("a is 1")
} else if (a == 2) {
  print("a is 2")
} else if (a == 3) {
  print("a is 3")
} else if (a == 4) {
  print("a is 4")
} else {
  print("I don't know what 'a' is")
}
## [1] "a is 4"

Question 3 Write a function called long_square that takes a single numeric argument n. If the length of the vector is greater than the square root of the sum of all the numbers in the vector, the function should print "long!". Otherwise, is should print "not long!". Verify that it works for the arguments x1, x2, and x3.

x1 <- c(1, 2, 3, 4, 5)
x2 <- c(5, 8, 10, 12)
x3 <- c(2, 5, 9, 10, 1, 1, 1)

long_square(n = x1)
## [1] "long!"
long_square(n = x2)
## [1] "not long!"
long_square(n = x3)
## [1] "long!"

Question 4 Here, we are going to modify the top_table function we wrote in Question 1 once more. In addition to having a second argument n indicating the number of results, we now want to include a third argument top which takes either TRUE or FALSE. When TRUE, the function should return the top n rows; when FALSE, it should return the bottom n.

top_table(police$state, n = 5, top = TRUE)
## v
##  CA  TX  FL  AZ  CO 
## 825 496 369 259 204
top_table(college$State, n = 10, top = FALSE)
## v
## GU VI WY AK DE HI NV ID NM ND 
##  1  1  1  2  5  6  6  7  7  8

Files

Although much of the work we have done so far has involved reading data into our R session and manipulating in the environment here, we often are working with our computer in more intimate ways. While R provides very powerful tools for manipulating the underlying file system, we limit our attention today to just working directly with files that we have locally on our machine.

To begin, we can identify the directory that we are currently working in (the directory in which you Rmd file is save) with the function getwd()

getwd()
## [1] "/home/collin/courses/sta230/labs"

We can see a list of sub-directories with list.dirs() and move to any one of them with the setwd() function

## Ignore this, R Markdown resets directory after each code chunk
setwd("fundir/")

## Recursively lists all the directories
list.dirs()
## [1] "."     "./hwk" "./sim"
## Move into the directory called sim
setwd("sim")

## Now we are in fundir/sim
getwd()
## [1] "/home/collin/courses/sta230/labs/fundir/sim"

Once we are in a directory, we can list all of the files in a directory with list.files(). Conveniently, list.files() has a pattern argument allowing us to specify a regular expression to indicate which files we want to list.

## Ignore this, R Markdown resets directory after each code chunk
setwd("fundir/sim")

# List all of the files in fundir/sim
list.files()
##  [1] "sim_1.csv"  "sim_10.csv" "sim_2.csv"  "sim_3.csv"  "sim_4.csv" 
##  [6] "sim_5.csv"  "sim_6.csv"  "sim_7.csv"  "sim_8.csv"  "sim_9.csv" 
## [11] "sim.R"
# List all of the files in fundr/sim with .csv extension
list.files(pattern = "csv")
##  [1] "sim_1.csv"  "sim_10.csv" "sim_2.csv"  "sim_3.csv"  "sim_4.csv" 
##  [6] "sim_5.csv"  "sim_6.csv"  "sim_7.csv"  "sim_8.csv"  "sim_9.csv"

list.files() returns a character vector, the contents of which can be used as arguments to other functions. For example, we have used read.csv() several times in this course, and now we can do it passing variables are arguments

## Ignore this, R Markdown resets directory after each code chunk
setwd("fundir/sim")

my_sims <- list.files(pattern = "csv")

# Read the first simulation data in 
sim1 <- read.csv(my_sims[1])

#
sim1
##          V1
## 1  -1.36856
## 2   0.46344
## 3   1.87452
## 4  -0.25650
## 5  -1.41324
## 6   0.83684
## 7  -0.15509
## 8   0.15707
## 9   0.19777
## 10 -0.43873

We can combine this with other things we have learned in this lab in order to create a vector with the mean value from each of our simulations:

## Ignore this, R Markdown resets directory after each code chunk
setwd("fundir/sim")

# Create a numeric vector the same length as `my_sims` vector
mean_vec <- numeric(length = length(my_sims))

for (i in seq_along(mean_vec)) {
  
  # First read in the ith simulation
  dat <- read.csv(my_sims[i])
  
  # Compute the mean and save it to the ith spot here
  mean_vec[i] <- mean(dat$V1)
}

mean_vec
##  [1] -0.010247  0.509078 -0.508938 -0.145795 -0.498174 -0.345951  0.036010
##  [8] -0.066167  0.328893  0.097149

\(~\)

Question 5 For this lab, download these files and unzip them in the same directory as your lab Rmd file. Contained within this zipped folder are the results of an experiment that we conducted on 30 people in three different experimental groups. The data files themselves only contain the results from the experiment; information on the subjects themselves is stored as metadata in the file name. For example, the file PatriciaFexpA.csv contains the results for the subject named Patricia, sex Female, in experimental group A. Similarly, RobertMexpC.csv contains data for Robert, sex Male, in experimental group C. Each file contains 5 observations – however, we are only interested in retaining the mean value of these observations for each subject.

Use this information along with the data in the zipped folder to recreate the plot below.

Hint: Just like numerics, character vectors can be created with character(length = n)