Functions, in their simplest definitions, can be thought of as pre-packaged snippets of code that assist in performing common or repetitive tasks. While many functions are provided by R or R packages (including nearly everything we have done), there are often times where we wish to construct our own. The goal of this lab is to introduce the components of functions in R as well as illustrate how to create them ourselves.
Functions in R are similar to functions in other programming language, with the more important differences being beyond the scope of this class. In this portion of the lab, we will familiarize ourselves with how to create functions and use them in our own work.
Functions in R primarily consist of three components:
Functions in R begin with an assignment with <- to a
name, along with the function function() and any arguments
the function will take (e.g., x and y),
followed by the body of the function enclosed in curly brackets
{}. Here, for example, is a function that computes the sum
of squares of two inputs, x and y
sum_of_squares <- function(x, y) {
x^2 + y^2
}
sum_of_squares(x = 2, y = 3)
## [1] 13
We can also write functions that take default arguments. For
example, in the sum_of_squares() function, if we do not
provide both an x and y, we will get an
error:
sum_of_squares(x = 2)
## Error in sum_of_squares(x = 2): argument "y" is missing, with no default
We can rewrite our function so that we have a default argument with
y = 3. Note that the default only applies when an
argument for y is not given; if we do supply an argument to
y, the default with be ignored.
sum_of_squares_default <- function(x, y = 3) {
x^2 + y^2
}
# Using the default argument
sum_of_squares_default(x = 2)
## [1] 13
# Providing our own
sum_of_squares_default(x = 2, y = 10)
## [1] 104
A helpful tip: we can see the code used inside of a function by simply typing the function name into the console without adding parentheses.
## See the code for sum_of_squares
sum_of_squares
## function (x, y)
## {
## x^2 + y^2
## }
## <bytecode: 0x5e8ecfb66f70>
Question 1: For this question, we will use two
separate datasets. You may also consider the R functions
?table() and ?sort()
police <- read.csv("https://remiller1450.github.io/data/Police2019.csv")
college <- read.csv("https://collinn.github.io/data/college2019.csv")
Your goal is to write a function called top_table that
takes a character vector and returns the names of the values with the
top five occurrences. Then, verify that it works by printing out the top
five states in both the police and college datasets. Your results should
look like this:
top_table(police$state)
## v
## CA TX FL AZ CO
## 825 496 369 259 204
top_table(college$State)
## v
## PA NY CA TX OH
## 85 67 63 60 48
Question 2: Modify top_table so that it
takes an additional argument n that allows you to specify
that you want to view the top N values in each vector. Here, for
example, we print the top 10:
top_table(police$state, n = 10)
## v
## CA TX FL AZ CO GA OK NC OH WA
## 825 496 369 259 204 189 170 163 159 156
top_table(college$State, n = 10)
## v
## PA NY CA TX OH IL NC MA MI IN
## 85 67 63 60 48 45 40 36 34 33
Question 3: Write a function called
long_square that takes a single numeric argument
n. If the length of the vector is greater than the square
root of the sum of all the numbers in the vector, the function should
print "long!". Otherwise, is should print
"not long!". Verify that it works for the arguments
x1, x2, and x3.
x1 <- c(1, 2, 3, 4, 5)
x2 <- c(5, 8, 10, 12)
x3 <- c(2, 5, 9, 10, 1, 1, 1)
long_square(n = x1)
## [1] "long!"
long_square(n = x2)
## [1] "not long!"
long_square(n = x3)
## [1] "long!"
Question 4: Here, we are going to modify the
top_table function we wrote in Question 1 once more. In
addition to having a second argument n indicating the
number of results, we now want to include a third argument
top which takes either TRUE or
FALSE. When TRUE, the function should return
the top n rows; when FALSE, it should return
the bottom n.
top_table(police$state, n = 5, top = TRUE)
## v
## CA TX FL AZ CO
## 825 496 369 259 204
top_table(college$State, n = 10, top = FALSE)
## v
## AK WY NV DE NM AZ DC ID NH RI
## 1 1 2 3 4 5 5 5 5 5
So far in R, we have primarily seen two different data types: vectors, an ordered collection of atomic items all of the same type (e.g., a numeric vector) and data.frames, tabular objects defined by a set number of rows and columns. In a more general definition, data.frames are simply collections of vectors, each required to be the same length.
We can see this more clearly with the str() function
(structure). Here, we investigate the iris dataset which we
see is a collection of 4 numeric vectors and one factor (character
vector with a set number of levels)
## I don't want to keep seeing a trillion bajillion rows
iris5 <- iris[1:5, ]
## Here it is as a data.frame
head(iris5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## Here, we see the underlying structure
str(iris5)
## 'data.frame': 5 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1
As we’ve seen elsewhere, we can extract the vector from the
data.frame using $ notation
## Once extracted, this is a vector
class(iris5$Sepal.Length)
## [1] "numeric"
iris5$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0
We now introduce a third data type, the list object.
Like a vector, a list is an ordered array of elements. Unlike a vector,
however, each element of a list can be anything: it can be a number, a
vector, a data.frame, or even another list.
Typically, we are not making lists directly, but we can do so here to illustrate. I’ll begin by creating an empty vector and list, each of length 3, to see how they compare
## Here is my vector
v <- vector(mode = "numeric", length = 3)
v
## [1] 0 0 0
## And my list
l <- vector(mode = "list", length = 3)
l
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
The vector, as we might expect, returns a single array of 0s in
sequence, whereas our vector has printed out 3 empty “containers”, each
indexed with [[]]. By default, these containers are empty
which is represented by the NULL. We can insert an item
into the list by specifying a position and an element to insert
## Put 1 in the first container of the list
l[[1]] <- 1
l
## [[1]]
## [1] 1
##
## [[2]]
## NULL
##
## [[3]]
## NULL
As mentioned, any sort of object can go into a list, including vectors and data.frames
## Let's add a vector
l[[2]] <- c(2, 4, 6, 8)
## And a data.frame
l[[3]] <- iris5
## We now have a list of a length 1 vector, a length 3 vector, and a data.frame
l
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2 4 6 8
##
## [[3]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
This idea of a list being a string of “containers” is a useful one.
Specifically, we can imagine a list being a big choo-choo train; yes
there are a series of cars attached to each other, but this is distinct
from what is inside the car. When subsetting our list, we need
to differentiate which we want. We do so by using []
(access the containers) or [[]] (access what is inside the
containers)
## This will grab the second container, which is still a list
l[2]
## [[1]]
## [1] 2 4 6 8
## This will grab what is inside the second container
l[[2]]
## [1] 2 4 6 8
A list can also have objects associated with names. Doing so allows
us to extract elements by name rather than position. The same rules
apply with [] and [[]]
names(l) <- c("a", "b", "c")
## This will be a vector
l[["b"]]
## [1] 2 4 6 8
## This will still be a list
l['b']
## $b
## [1] 2 4 6 8
You can access all of the names of a list (or a vector or data frame)
with the names() function:
## Names of list
names(l)
## [1] "a" "b" "c"
## Names of data.frame
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
There is a distinction (beyond the scope of this course) between how an object in R is stored internally and how it is used by a program. For example, we use data.frames all the time to store tabular data, access variables via column names, and provide numerical summaries of our data. Internally, however, a data.frame is simply a list
## How it is used in R
class(iris5)
## [1] "data.frame"
## How it is stored internally
typeof(iris5)
## [1] "list"
This means we can use list subset notation on a data.frame. Recall,
using [[]] will extract the object in the container, using
[] will extract the container itself
## This will still be a data.frame (with a single column)
iris5[2]
## Sepal.Width
## 1 3.5
## 2 3.0
## 3 3.2
## 4 3.1
## 5 3.6
class(iris5[2])
## [1] "data.frame"
## This will extract the vector
iris5[[2]]
## [1] 3.5 3.0 3.2 3.1 3.6
class(iris5[[2]])
## [1] "numeric"
Likewise, we can extract named elements of a list using
$ notation
l$b
## [1] 2 4 6 8
Question 5: Write a function that takes two
arguments: a list, and a name. If the name is in the list, it should
return that element of the list. If not, it should return null. (You can
explicitly return a null object from a function with
return(NULL)). Confirm that you function works like the one
below:
l <- list("apple" = 1:5,
"bumblebee" = letters[1:5],
"capybara" = iris[1:5, ])
name_grabber(l, "capybara")
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
name_grabber(l, "doggo")
## NULL
Lists are exceptionally useful tools in R given their flexibility in storing different data types. In the next few sections, we will see some examples of how we already use lists in statistical analysis in R without realizing it.
Recall from STA 209 the process of making a linear model. Here, we
build a regression model using the iris dataset and the
lm() function (linear model)
## Fit model
fit <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)
## See the fitted coefficients
fit
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
##
## Coefficients:
## (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
## 2.251 0.804 1.459 1.947
Also recall that we can run a summary() function on this
object to get t-statistics and p-values, along with collections of model
diagnostics
summary(fit)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3071 -0.2571 -0.0533 0.1954 1.4125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.251 0.370 6.09 0.0000000095681 ***
## Sepal.Width 0.804 0.106 7.56 0.0000000000042 ***
## Speciesversicolor 1.459 0.112 13.01 < 0.0000000000000002 ***
## Speciesvirginica 1.947 0.100 19.47 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.438 on 146 degrees of freedom
## Multiple R-squared: 0.726, Adjusted R-squared: 0.72
## F-statistic: 129 on 3 and 146 DF, p-value: <0.0000000000000002
How is all of this information stored? After running the code above,
you should see the fit object in your environment; next to
it, you should also see a description: “List of 13.”
Using the str() function, we can take a closer look at
how this model is structured in R
str(fit)
## List of 13
## $ coefficients : Named num [1:4] 2.251 0.804 1.459 1.947
## ..- attr(*, "names")= chr [1:4] "(Intercept)" "Sepal.Width" "Speciesversicolor" "Speciesvirginica"
## $ residuals : Named num [1:150] 0.0361 0.2379 -0.1228 -0.1424 -0.1442 ...
## ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
## $ effects : Named num [1:150] -71.566 -1.188 0.279 8.525 -0.114 ...
## ..- attr(*, "names")= chr [1:150] "(Intercept)" "Sepal.Width" "Speciesversicolor" "Speciesvirginica" ...
## $ rank : int 4
## $ fitted.values: Named num [1:150] 5.06 4.66 4.82 4.74 5.14 ...
## ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
## $ assign : int [1:4] 0 1 2 2
## $ qr :List of 5
## ..$ qr : num [1:150, 1:4] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:4] "(Intercept)" "Sepal.Width" "Speciesversicolor" "Speciesvirginica"
## .. ..- attr(*, "assign")= int [1:4] 0 1 2 2
## .. ..- attr(*, "contrasts")=List of 1
## .. .. ..$ Species: chr "contr.treatment"
## ..$ qraux: num [1:4] 1.08 1.02 1.05 1.11
## ..$ pivot: int [1:4] 1 2 3 4
## ..$ tol : num 0.0000001
## ..$ rank : int 4
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 146
## $ contrasts :List of 1
## ..$ Species: chr "contr.treatment"
## $ xlevels :List of 1
## ..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
## $ call : language lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
## $ terms :Classes 'terms', 'formula' language Sepal.Length ~ Sepal.Width + Species
## .. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Species)
## .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
## .. .. .. ..$ : chr [1:2] "Sepal.Width" "Species"
## .. ..- attr(*, "term.labels")= chr [1:2] "Sepal.Width" "Species"
## .. ..- attr(*, "order")= int [1:2] 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Species)
## .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "factor"
## .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
## $ model :'data.frame': 150 obs. of 3 variables:
## ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language Sepal.Length ~ Sepal.Width + Species
## .. .. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Species)
## .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
## .. .. .. .. ..$ : chr [1:2] "Sepal.Width" "Species"
## .. .. ..- attr(*, "term.labels")= chr [1:2] "Sepal.Width" "Species"
## .. .. ..- attr(*, "order")= int [1:2] 1 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Species)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "factor"
## .. .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
## - attr(*, "class")= chr "lm"
This is immensely useful, as it provides us with a way to extract elements from our linear model for use in our analysis. For example, we may wish to create a plot of the residuals to check that our residuals our normally distributed
hist(fit$residuals, breaks = 8)
Or perhaps we want to extract coefficients for use in a vector
v <- fit$coefficients
v
## (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
## 2.25139 0.80356 1.45874 1.94682
In general, any object returned in R that isn’t a vector or a data.frame is going to be a list.
Question 6:
college dataset, create a 2 way table with
the variables Type and Regionchisq.test() function to test the hypothesis
that these variables are independentFormula:
\[ \chi^2 = \sum_{i=1} \frac{(\text{Observed}_i - \text{Expected}_i)^2}{\text{Expected}_i} \]
college <- read.csv("https://collinn.github.io/data/college2019.csv")
Question 7: Just like the output of
lm() creates a list object, so too does a summary
object
ss <- summary(fit)
typeof(ss)
## [1] "list"
Given a a linear model and a Type I error rate alpha, write a function that returns how many predictors are significant at a given alpha level. It should work like this:
## Some fitted model
fit <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)
## We see 4 covariates (including intercept) have p < alpha
summary(fit)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3071 -0.2571 -0.0533 0.1954 1.4125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.251 0.370 6.09 0.0000000095681 ***
## Sepal.Width 0.804 0.106 7.56 0.0000000000042 ***
## Speciesversicolor 1.459 0.112 13.01 < 0.0000000000000002 ***
## Speciesvirginica 1.947 0.100 19.47 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.438 on 146 degrees of freedom
## Multiple R-squared: 0.726, Adjusted R-squared: 0.72
## F-statistic: 129 on 3 and 146 DF, p-value: <0.0000000000000002
countsig(fit, alpha = 0.05)
## [1] 4
Verify your function works by checking it on the following models. That is, your solution should include the code for your function and verification that it returns the same values with the arguments given
## Has one significant at alpha = 0.05
fit1 <- lm(circumference ~ age, data = Orange)
## Has one significant at alpha = 0.05 and two significant at alpha = 0.1
fit2 <- lm(Murder ~ Assault + UrbanPop, data = USArrests)
## Only two sig at 0.05
fit3 <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
## You function should take these arguments and return the same values
countsig(fit1, alpha = 0.05)
## [1] 1
countsig(fit2, alpha = 0.1)
## [1] 3
countsig(fit3, alpha = 0.2)
## [1] 2
A functional is a special type of function in R in that, rather than simply taking data arguments, it can also take other functions. The utility of this comes when we remember the purpose of writing functions: generalize our ideas and avoid repetitive copy-and-pasting.
R comes with a number of functionals built-in, with two classes in particular worth considering here. The first is the *apply family of functions, with the second being a common set across functional programming languages
The *apply family of functions in R is a collection of functions
that, broadly, take two arguments: a list or vector to iterate over and
a function. This makes them similar to loops in a number of ways (in
fact, any apply functional can be written as a loop). The first of these
we will introduce is lapply() or “list” apply. This works
by taking a vector and a function, applying the function to each element
of the vector, then returning a list of values
knitr::include_graphics("lapply.png")
You can see the argument syntax with ?lapply. The first
argument will be a list or vector, the second will be a function,
typically with one argument, that takes each element of the vector. Note
that if I am passing in a function with a name, I simply give the name
as the argument (no parenthesis or other arguments)
lapply(1:5, sqrt)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.4142
##
## [[3]]
## [1] 1.7321
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 2.2361
If we are using a function that takes more than two arguments by
default, we can pass them in following the function (note in the
documentation, this ability is represented with the ...
argument, but we need not worry about that for now)
lapply(1:5, paste, "X")
## [[1]]
## [1] "1 X"
##
## [[2]]
## [1] "2 X"
##
## [[3]]
## [1] "3 X"
##
## [[4]]
## [1] "4 X"
##
## [[5]]
## [1] "5 X"
More commonly, though, we are writing anonymous functions for use with lapply. An anonymous function is simply a function without a name and works identically to how we wrote functions above
lapply(1:5, function(x) x^2)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 9
##
## [[4]]
## [1] 16
##
## [[5]]
## [1] 25
The key idea with lapply is that, by returning a list of
objects, we are able to perform useful functions that can have
complicated return values. For example, lapply on a list of file names
will give us a list of data.frames
## Grab the first 3
ff <- list.files("fundir/sim/", full.names = TRUE)[1:3]
df_list <- lapply(ff, read.csv)
df_list
## [[1]]
## V1
## 1 -1.36856
## 2 0.46344
## 3 1.87452
## 4 -0.25650
## 5 -1.41324
## 6 0.83684
## 7 -0.15509
## 8 0.15707
## 9 0.19777
## 10 -0.43873
##
## [[2]]
## V1
## 1 0.064383
## 2 0.645119
## 3 1.172006
## 4 -1.047275
## 5 -0.263830
## 6 -1.964384
## 7 2.731564
## 8 0.056206
## 9 -0.041927
## 10 3.738920
##
## [[3]]
## V1
## 1 -0.536709
## 2 -1.460844
## 3 1.016002
## 4 0.683450
## 5 -0.126251
## 6 -2.793817
## 7 0.227314
## 8 -1.306102
## 9 -0.865014
## 10 0.072594
It is an exceptionally common task to call lapply on a list of data.frames. For example, we may want to verify that each of our individual data.frames has exactly 10 observations
lapply(df_list, function(x) nrow(x) == 10)
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
As a quick note, we can always “unlist” an object with the use of
unlist()
v <- lapply(1:5, function(x) x^2)
unlist(v)
## [1] 1 4 9 16 25
Question 8: Rewrite your loop from Question 4 of the
previous lab so that it uses lapply() instead. (Note: You
might wait until reviewing the Reduce() below for a trick
on completing this more efficiently)
Question 9: Look at the documentation for
lapply(). How is sapply() different? The code
below is not run, but you can use it to investigate. Can you perform the
same task with vapply()? See examples in the documentation
if you get stuck
## Create list of data
df_list <- lapply(1:5, function(x) {
data.frame(x = rnorm(10),
y = rnorm(10))
})
lapply(df_list, function(df) {
cor(df$x, df$y)
})
lapply(df_list, function(df) {
cor(df$x, df$y)
})
Question 10: The apply() function is
slightly different than the the rest of the *apply family in that rather
than taking a vector or list of objects, it operates on a matrix (a
matrix is similar to a data.frame with the exception that all of the
entries must be of the same type e.g., numeric or character). Use
apply() to verify the row and column sums given below.
set.seed(123)
m <- matrix(rnorm(25), nrow = 5)
## Verify these values with apply
rowSums(m)
## [1] 3.09776 0.87043 -2.29820 -0.53320 -1.97005
colSums(m)
## [1] 0.96785 -0.22159 1.53951 0.54671 -3.66573
The next set of functionals worth knowing are Reduce, Filter, Map,
and Negate. You can see all of their documentation at once by viewing
?Reduce. The most commonly used of these is
Reduce(), which iteratively applies a binary function (a
function with two arguments) to elements of a list or vector, with the
option to include each of the intermediate steps. This, for example, is
useful in computing a cumulative sum:
Reduce(f = sum, x = 1:5, accumulate = TRUE)
## [1] 1 3 6 10 15
Without accumulate, it simply returns the last element, in this case, the sum of the vector
Reduce(sum, 1:5)
## [1] 15
An extremely common use of Reduce() is to join a list of
data.frames into a larger one
library(dplyr)
df_list <- lapply(LETTERS[1:3], function(val) {
data.frame(group = val,
x = rnorm(5),
y = rnorm(5))
})
df_list
## [[1]]
## group x y
## 1 A -1.68669 0.42646
## 2 A 0.83779 -0.29507
## 3 A 0.15337 0.89513
## 4 A -1.13814 0.87813
## 5 A 1.25381 0.82158
##
## [[2]]
## group x y
## 1 B 0.688640 -0.69471
## 2 B 0.553918 -0.20792
## 3 B -0.061912 -1.26540
## 4 B -0.305963 2.16896
## 5 B -0.380471 1.20796
##
## [[3]]
## group x y
## 1 C -1.123109 0.253319
## 2 C -0.402885 -0.028547
## 3 C -0.466655 -0.042870
## 4 C 0.779965 1.368602
## 5 C -0.083369 -0.225771
Reduce(full_join, df_list)
## group x y
## 1 A -1.686693 0.426464
## 2 A 0.837787 -0.295071
## 3 A 0.153373 0.895126
## 4 A -1.138137 0.878133
## 5 A 1.253815 0.821581
## 6 B 0.688640 -0.694707
## 7 B 0.553918 -0.207917
## 8 B -0.061912 -1.265396
## 9 B -0.305963 2.168956
## 10 B -0.380471 1.207962
## 11 C -1.123109 0.253319
## 12 C -0.402885 -0.028547
## 13 C -0.466655 -0.042870
## 14 C 0.779965 1.368602
## 15 C -0.083369 -0.225771
Finally we briefly introduce Map() which works like
lapply but is able to take any number of arguments (see the
documentation for how the arguments are used). This is useful when we
want to apply over several lists in sequence. Like
lapply(), the return object is a list
a <- 1:5
b <- 6:10
c <- 11:15
## Here a -> x, b -> y, and c -> z
Map(function(x, y, z) {
(y - z)^x
}, a, b, c)
## [[1]]
## [1] -5
##
## [[2]]
## [1] 25
##
## [[3]]
## [1] -125
##
## [[4]]
## [1] 625
##
## [[5]]
## [1] -3125
## Though for simple vectors, we can still often do directly without functionals
(b-c)^a
## [1] -5 25 -125 625 -3125