Introduction

Functions, in their simplest definitions, can be thought of as pre-packaged snippets of code that assist in performing common or repetitive tasks. While many functions are provided by R or R packages (including nearly everything we have done), there are often times where we wish to construct our own. The goal of this lab is to introduce the components of functions in R as well as illustrate how to create them ourselves.

Functions

Functions in R are similar to functions in other programming language, with the more important differences being beyond the scope of this class. In this portion of the lab, we will familiarize ourselves with how to create functions and use them in our own work.

Functions in R primarily consist of three components:

  1. The name of the function
  2. The arguments of the function
  3. The body of the function

Functions in R begin with an assignment with <- to a name, along with the function function() and any arguments the function will take (e.g., x and y), followed by the body of the function enclosed in curly brackets {}. Here, for example, is a function that computes the sum of squares of two inputs, x and y

sum_of_squares <- function(x, y) {
  x^2 + y^2
}

sum_of_squares(x = 2, y = 3)
## [1] 13

We can also write functions that take default arguments. For example, in the sum_of_squares() function, if we do not provide both an x and y, we will get an error:

sum_of_squares(x = 2)
## Error in sum_of_squares(x = 2): argument "y" is missing, with no default

We can rewrite our function so that we have a default argument with y = 3. Note that the default only applies when an argument for y is not given; if we do supply an argument to y, the default with be ignored.

sum_of_squares_default <- function(x, y = 3) {
  x^2 + y^2
}

# Using the default argument
sum_of_squares_default(x = 2)
## [1] 13
# Providing our own
sum_of_squares_default(x = 2, y = 10)
## [1] 104

A helpful tip: we can see the code used inside of a function by simply typing the function name into the console without adding parentheses.

## See the code for sum_of_squares
sum_of_squares
## function (x, y) 
## {
##     x^2 + y^2
## }
## <bytecode: 0x5e8ecfb66f70>

Question 1: For this question, we will use two separate datasets. You may also consider the R functions ?table() and ?sort()

police <- read.csv("https://remiller1450.github.io/data/Police2019.csv")
college <- read.csv("https://collinn.github.io/data/college2019.csv")

Your goal is to write a function called top_table that takes a character vector and returns the names of the values with the top five occurrences. Then, verify that it works by printing out the top five states in both the police and college datasets. Your results should look like this:

top_table(police$state)
## v
##  CA  TX  FL  AZ  CO 
## 825 496 369 259 204
top_table(college$State)
## v
## PA NY CA TX OH 
## 85 67 63 60 48

Question 2: Modify top_table so that it takes an additional argument n that allows you to specify that you want to view the top N values in each vector. Here, for example, we print the top 10:

top_table(police$state, n = 10)
## v
##  CA  TX  FL  AZ  CO  GA  OK  NC  OH  WA 
## 825 496 369 259 204 189 170 163 159 156
top_table(college$State, n = 10)
## v
## PA NY CA TX OH IL NC MA MI IN 
## 85 67 63 60 48 45 40 36 34 33

Question 3: Write a function called long_square that takes a single numeric argument n. If the length of the vector is greater than the square root of the sum of all the numbers in the vector, the function should print "long!". Otherwise, is should print "not long!". Verify that it works for the arguments x1, x2, and x3.

x1 <- c(1, 2, 3, 4, 5)
x2 <- c(5, 8, 10, 12)
x3 <- c(2, 5, 9, 10, 1, 1, 1)

long_square(n = x1)
## [1] "long!"
long_square(n = x2)
## [1] "not long!"
long_square(n = x3)
## [1] "long!"

Question 4: Here, we are going to modify the top_table function we wrote in Question 1 once more. In addition to having a second argument n indicating the number of results, we now want to include a third argument top which takes either TRUE or FALSE. When TRUE, the function should return the top n rows; when FALSE, it should return the bottom n.

top_table(police$state, n = 5, top = TRUE)
## v
##  CA  TX  FL  AZ  CO 
## 825 496 369 259 204
top_table(college$State, n = 10, top = FALSE)
## v
## AK WY NV DE NM AZ DC ID NH RI 
##  1  1  2  3  4  5  5  5  5  5

Lists

So far in R, we have primarily seen two different data types: vectors, an ordered collection of atomic items all of the same type (e.g., a numeric vector) and data.frames, tabular objects defined by a set number of rows and columns. In a more general definition, data.frames are simply collections of vectors, each required to be the same length.

Structure of lists

We can see this more clearly with the str() function (structure). Here, we investigate the iris dataset which we see is a collection of 4 numeric vectors and one factor (character vector with a set number of levels)

## I don't want to keep seeing a trillion bajillion rows
iris5 <- iris[1:5, ]

## Here it is as a data.frame
head(iris5)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## Here, we see the underlying structure
str(iris5)
## 'data.frame':    5 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1

As we’ve seen elsewhere, we can extract the vector from the data.frame using $ notation

## Once extracted, this is a vector
class(iris5$Sepal.Length)
## [1] "numeric"
iris5$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0

We now introduce a third data type, the list object. Like a vector, a list is an ordered array of elements. Unlike a vector, however, each element of a list can be anything: it can be a number, a vector, a data.frame, or even another list.

Typically, we are not making lists directly, but we can do so here to illustrate. I’ll begin by creating an empty vector and list, each of length 3, to see how they compare

## Here is my vector
v <- vector(mode = "numeric", length = 3)
v
## [1] 0 0 0
## And my list
l <- vector(mode = "list", length = 3)
l
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

The vector, as we might expect, returns a single array of 0s in sequence, whereas our vector has printed out 3 empty “containers”, each indexed with [[]]. By default, these containers are empty which is represented by the NULL. We can insert an item into the list by specifying a position and an element to insert

## Put 1 in the first container of the list
l[[1]] <- 1
l
## [[1]]
## [1] 1
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

As mentioned, any sort of object can go into a list, including vectors and data.frames

## Let's add a vector
l[[2]] <- c(2, 4, 6, 8)

## And a data.frame 
l[[3]] <- iris5

## We now have a list of a length 1 vector, a length 3 vector, and a data.frame
l
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2 4 6 8
## 
## [[3]]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa

This idea of a list being a string of “containers” is a useful one. Specifically, we can imagine a list being a big choo-choo train; yes there are a series of cars attached to each other, but this is distinct from what is inside the car. When subsetting our list, we need to differentiate which we want. We do so by using [] (access the containers) or [[]] (access what is inside the containers)

## This will grab the second container, which is still a list
l[2]
## [[1]]
## [1] 2 4 6 8
## This will grab what is inside the second container
l[[2]]
## [1] 2 4 6 8

A list can also have objects associated with names. Doing so allows us to extract elements by name rather than position. The same rules apply with [] and [[]]

names(l) <- c("a", "b", "c")

## This will be a vector
l[["b"]]
## [1] 2 4 6 8
## This will still be a list
l['b']
## $b
## [1] 2 4 6 8

You can access all of the names of a list (or a vector or data frame) with the names() function:

## Names of list
names(l)
## [1] "a" "b" "c"
## Names of data.frame
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

There is a distinction (beyond the scope of this course) between how an object in R is stored internally and how it is used by a program. For example, we use data.frames all the time to store tabular data, access variables via column names, and provide numerical summaries of our data. Internally, however, a data.frame is simply a list

## How it is used in R
class(iris5)
## [1] "data.frame"
## How it is stored internally
typeof(iris5)
## [1] "list"

This means we can use list subset notation on a data.frame. Recall, using [[]] will extract the object in the container, using [] will extract the container itself

## This will still be a data.frame (with a single column)
iris5[2]
##   Sepal.Width
## 1         3.5
## 2         3.0
## 3         3.2
## 4         3.1
## 5         3.6
class(iris5[2])
## [1] "data.frame"
## This will extract the vector
iris5[[2]]
## [1] 3.5 3.0 3.2 3.1 3.6
class(iris5[[2]])
## [1] "numeric"

Likewise, we can extract named elements of a list using $ notation

l$b
## [1] 2 4 6 8

Question 5: Write a function that takes two arguments: a list, and a name. If the name is in the list, it should return that element of the list. If not, it should return null. (You can explicitly return a null object from a function with return(NULL)). Confirm that you function works like the one below:

l <- list("apple" = 1:5, 
          "bumblebee" = letters[1:5], 
          "capybara" = iris[1:5, ])

name_grabber(l, "capybara")
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
name_grabber(l, "doggo")
## NULL

Lists are exceptionally useful tools in R given their flexibility in storing different data types. In the next few sections, we will see some examples of how we already use lists in statistical analysis in R without realizing it.

Uses of lists

Recall from STA 209 the process of making a linear model. Here, we build a regression model using the iris dataset and the lm() function (linear model)

## Fit model
fit <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)

## See the fitted coefficients
fit
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
## 
## Coefficients:
##       (Intercept)        Sepal.Width  Speciesversicolor   Speciesvirginica  
##             2.251              0.804              1.459              1.947

Also recall that we can run a summary() function on this object to get t-statistics and p-values, along with collections of model diagnostics

summary(fit)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3071 -0.2571 -0.0533  0.1954  1.4125 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          2.251      0.370    6.09      0.0000000095681 ***
## Sepal.Width          0.804      0.106    7.56      0.0000000000042 ***
## Speciesversicolor    1.459      0.112   13.01 < 0.0000000000000002 ***
## Speciesvirginica     1.947      0.100   19.47 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.438 on 146 degrees of freedom
## Multiple R-squared:  0.726,  Adjusted R-squared:  0.72 
## F-statistic:  129 on 3 and 146 DF,  p-value: <0.0000000000000002

How is all of this information stored? After running the code above, you should see the fit object in your environment; next to it, you should also see a description: “List of 13.”

Using the str() function, we can take a closer look at how this model is structured in R

str(fit)
## List of 13
##  $ coefficients : Named num [1:4] 2.251 0.804 1.459 1.947
##   ..- attr(*, "names")= chr [1:4] "(Intercept)" "Sepal.Width" "Speciesversicolor" "Speciesvirginica"
##  $ residuals    : Named num [1:150] 0.0361 0.2379 -0.1228 -0.1424 -0.1442 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:150] -71.566 -1.188 0.279 8.525 -0.114 ...
##   ..- attr(*, "names")= chr [1:150] "(Intercept)" "Sepal.Width" "Speciesversicolor" "Speciesvirginica" ...
##  $ rank         : int 4
##  $ fitted.values: Named num [1:150] 5.06 4.66 4.82 4.74 5.14 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ assign       : int [1:4] 0 1 2 2
##  $ qr           :List of 5
##   ..$ qr   : num [1:150, 1:4] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:4] "(Intercept)" "Sepal.Width" "Speciesversicolor" "Speciesvirginica"
##   .. ..- attr(*, "assign")= int [1:4] 0 1 2 2
##   .. ..- attr(*, "contrasts")=List of 1
##   .. .. ..$ Species: chr "contr.treatment"
##   ..$ qraux: num [1:4] 1.08 1.02 1.05 1.11
##   ..$ pivot: int [1:4] 1 2 3 4
##   ..$ tol  : num 0.0000001
##   ..$ rank : int 4
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 146
##  $ contrasts    :List of 1
##   ..$ Species: chr "contr.treatment"
##  $ xlevels      :List of 1
##   ..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
##  $ call         : language lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
##  $ terms        :Classes 'terms', 'formula'  language Sepal.Length ~ Sepal.Width + Species
##   .. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Species)
##   .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
##   .. .. .. ..$ : chr [1:2] "Sepal.Width" "Species"
##   .. ..- attr(*, "term.labels")= chr [1:2] "Sepal.Width" "Species"
##   .. ..- attr(*, "order")= int [1:2] 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Species)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "factor"
##   .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
##  $ model        :'data.frame':   150 obs. of  3 variables:
##   ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##   ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language Sepal.Length ~ Sepal.Width + Species
##   .. .. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Species)
##   .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
##   .. .. .. .. ..$ : chr [1:2] "Sepal.Width" "Species"
##   .. .. ..- attr(*, "term.labels")= chr [1:2] "Sepal.Width" "Species"
##   .. .. ..- attr(*, "order")= int [1:2] 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Species)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "factor"
##   .. .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
##  - attr(*, "class")= chr "lm"

This is immensely useful, as it provides us with a way to extract elements from our linear model for use in our analysis. For example, we may wish to create a plot of the residuals to check that our residuals our normally distributed

hist(fit$residuals, breaks = 8)

Or perhaps we want to extract coefficients for use in a vector

v <- fit$coefficients
v
##       (Intercept)       Sepal.Width Speciesversicolor  Speciesvirginica 
##           2.25139           0.80356           1.45874           1.94682

In general, any object returned in R that isn’t a vector or a data.frame is going to be a list.

Question 6:

  1. Using the college dataset, create a 2 way table with the variables Type and Region
  2. Use the chisq.test() function to test the hypothesis that these variables are independent
  3. Using list constructs, extract the expected values under independent and save it to an object. Recreate the \(\chi^2\) test statistic using the tables of observed and expected values (note: in the same way we can add two vectors of the same length, we can do the same with matrices. This idea extends to many arithmetical operations)

Formula:

\[ \chi^2 = \sum_{i=1} \frac{(\text{Observed}_i - \text{Expected}_i)^2}{\text{Expected}_i} \]

college <- read.csv("https://collinn.github.io/data/college2019.csv")

Question 7: Just like the output of lm() creates a list object, so too does a summary object

ss <- summary(fit)
typeof(ss)
## [1] "list"

Given a a linear model and a Type I error rate alpha, write a function that returns how many predictors are significant at a given alpha level. It should work like this:

## Some fitted model
fit <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)

## We see 4 covariates (including intercept) have p < alpha
summary(fit)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3071 -0.2571 -0.0533  0.1954  1.4125 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          2.251      0.370    6.09      0.0000000095681 ***
## Sepal.Width          0.804      0.106    7.56      0.0000000000042 ***
## Speciesversicolor    1.459      0.112   13.01 < 0.0000000000000002 ***
## Speciesvirginica     1.947      0.100   19.47 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.438 on 146 degrees of freedom
## Multiple R-squared:  0.726,  Adjusted R-squared:  0.72 
## F-statistic:  129 on 3 and 146 DF,  p-value: <0.0000000000000002
countsig(fit, alpha = 0.05)
## [1] 4

Verify your function works by checking it on the following models. That is, your solution should include the code for your function and verification that it returns the same values with the arguments given

## Has one significant at alpha = 0.05
fit1 <- lm(circumference ~ age, data = Orange)

## Has one significant at alpha = 0.05 and two significant at alpha = 0.1
fit2 <- lm(Murder ~ Assault + UrbanPop, data = USArrests)

## Only two sig at 0.05
fit3 <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)


## You function should take these arguments and return the same values
countsig(fit1, alpha = 0.05)
## [1] 1
countsig(fit2, alpha = 0.1)
## [1] 3
countsig(fit3, alpha = 0.2)
## [1] 2

Functionals

A functional is a special type of function in R in that, rather than simply taking data arguments, it can also take other functions. The utility of this comes when we remember the purpose of writing functions: generalize our ideas and avoid repetitive copy-and-pasting.

R comes with a number of functionals built-in, with two classes in particular worth considering here. The first is the *apply family of functions, with the second being a common set across functional programming languages

*apply Family

The *apply family of functions in R is a collection of functions that, broadly, take two arguments: a list or vector to iterate over and a function. This makes them similar to loops in a number of ways (in fact, any apply functional can be written as a loop). The first of these we will introduce is lapply() or “list” apply. This works by taking a vector and a function, applying the function to each element of the vector, then returning a list of values

knitr::include_graphics("lapply.png")

You can see the argument syntax with ?lapply. The first argument will be a list or vector, the second will be a function, typically with one argument, that takes each element of the vector. Note that if I am passing in a function with a name, I simply give the name as the argument (no parenthesis or other arguments)

lapply(1:5, sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.4142
## 
## [[3]]
## [1] 1.7321
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.2361

If we are using a function that takes more than two arguments by default, we can pass them in following the function (note in the documentation, this ability is represented with the ... argument, but we need not worry about that for now)

lapply(1:5, paste, "X")
## [[1]]
## [1] "1 X"
## 
## [[2]]
## [1] "2 X"
## 
## [[3]]
## [1] "3 X"
## 
## [[4]]
## [1] "4 X"
## 
## [[5]]
## [1] "5 X"

More commonly, though, we are writing anonymous functions for use with lapply. An anonymous function is simply a function without a name and works identically to how we wrote functions above

lapply(1:5, function(x) x^2)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25

The key idea with lapply is that, by returning a list of objects, we are able to perform useful functions that can have complicated return values. For example, lapply on a list of file names will give us a list of data.frames

## Grab the first 3
ff <- list.files("fundir/sim/", full.names = TRUE)[1:3]

df_list <- lapply(ff, read.csv)
df_list
## [[1]]
##          V1
## 1  -1.36856
## 2   0.46344
## 3   1.87452
## 4  -0.25650
## 5  -1.41324
## 6   0.83684
## 7  -0.15509
## 8   0.15707
## 9   0.19777
## 10 -0.43873
## 
## [[2]]
##           V1
## 1   0.064383
## 2   0.645119
## 3   1.172006
## 4  -1.047275
## 5  -0.263830
## 6  -1.964384
## 7   2.731564
## 8   0.056206
## 9  -0.041927
## 10  3.738920
## 
## [[3]]
##           V1
## 1  -0.536709
## 2  -1.460844
## 3   1.016002
## 4   0.683450
## 5  -0.126251
## 6  -2.793817
## 7   0.227314
## 8  -1.306102
## 9  -0.865014
## 10  0.072594

It is an exceptionally common task to call lapply on a list of data.frames. For example, we may want to verify that each of our individual data.frames has exactly 10 observations

lapply(df_list, function(x) nrow(x) == 10)
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE

As a quick note, we can always “unlist” an object with the use of unlist()

v <- lapply(1:5, function(x) x^2)
unlist(v)
## [1]  1  4  9 16 25

Question 8: Rewrite your loop from Question 4 of the previous lab so that it uses lapply() instead. (Note: You might wait until reviewing the Reduce() below for a trick on completing this more efficiently)

Question 9: Look at the documentation for lapply(). How is sapply() different? The code below is not run, but you can use it to investigate. Can you perform the same task with vapply()? See examples in the documentation if you get stuck

## Create list of data
df_list <- lapply(1:5, function(x) {
  data.frame(x = rnorm(10), 
             y = rnorm(10))
})

lapply(df_list, function(df) {
  cor(df$x, df$y)
})

lapply(df_list, function(df) {
  cor(df$x, df$y)
})

Question 10: The apply() function is slightly different than the the rest of the *apply family in that rather than taking a vector or list of objects, it operates on a matrix (a matrix is similar to a data.frame with the exception that all of the entries must be of the same type e.g., numeric or character). Use apply() to verify the row and column sums given below.

set.seed(123)
m <- matrix(rnorm(25), nrow = 5)

## Verify these values with apply
rowSums(m)
## [1]  3.09776  0.87043 -2.29820 -0.53320 -1.97005
colSums(m)
## [1]  0.96785 -0.22159  1.53951  0.54671 -3.66573

Common Functionals in the Wild

The next set of functionals worth knowing are Reduce, Filter, Map, and Negate. You can see all of their documentation at once by viewing ?Reduce. The most commonly used of these is Reduce(), which iteratively applies a binary function (a function with two arguments) to elements of a list or vector, with the option to include each of the intermediate steps. This, for example, is useful in computing a cumulative sum:

Reduce(f = sum, x = 1:5, accumulate = TRUE)
## [1]  1  3  6 10 15

Without accumulate, it simply returns the last element, in this case, the sum of the vector

Reduce(sum, 1:5)
## [1] 15

An extremely common use of Reduce() is to join a list of data.frames into a larger one

library(dplyr)

df_list <- lapply(LETTERS[1:3], function(val) {
  data.frame(group = val, 
             x = rnorm(5), 
             y = rnorm(5))
})

df_list
## [[1]]
##   group        x        y
## 1     A -1.68669  0.42646
## 2     A  0.83779 -0.29507
## 3     A  0.15337  0.89513
## 4     A -1.13814  0.87813
## 5     A  1.25381  0.82158
## 
## [[2]]
##   group         x        y
## 1     B  0.688640 -0.69471
## 2     B  0.553918 -0.20792
## 3     B -0.061912 -1.26540
## 4     B -0.305963  2.16896
## 5     B -0.380471  1.20796
## 
## [[3]]
##   group         x         y
## 1     C -1.123109  0.253319
## 2     C -0.402885 -0.028547
## 3     C -0.466655 -0.042870
## 4     C  0.779965  1.368602
## 5     C -0.083369 -0.225771
Reduce(full_join, df_list)
##    group         x         y
## 1      A -1.686693  0.426464
## 2      A  0.837787 -0.295071
## 3      A  0.153373  0.895126
## 4      A -1.138137  0.878133
## 5      A  1.253815  0.821581
## 6      B  0.688640 -0.694707
## 7      B  0.553918 -0.207917
## 8      B -0.061912 -1.265396
## 9      B -0.305963  2.168956
## 10     B -0.380471  1.207962
## 11     C -1.123109  0.253319
## 12     C -0.402885 -0.028547
## 13     C -0.466655 -0.042870
## 14     C  0.779965  1.368602
## 15     C -0.083369 -0.225771

Finally we briefly introduce Map() which works like lapply but is able to take any number of arguments (see the documentation for how the arguments are used). This is useful when we want to apply over several lists in sequence. Like lapply(), the return object is a list

a <- 1:5
b <- 6:10
c <- 11:15

## Here a -> x, b -> y, and c -> z
Map(function(x, y, z) {
  (y - z)^x
}, a, b, c)
## [[1]]
## [1] -5
## 
## [[2]]
## [1] 25
## 
## [[3]]
## [1] -125
## 
## [[4]]
## [1] 625
## 
## [[5]]
## [1] -3125
## Though for simple vectors, we can still often do directly without functionals
(b-c)^a
## [1]    -5    25  -125   625 -3125