This lab is intended to introduce additional topics that are
essential for your understanding of R
and will be needed as
a precursor for performing more advanced data science tasks.
Directions (Read before starting)
As we have seen, the R
language comes standard with a
number of functions that are useful in data analysis tasks. To
facilitate more complex operations, many people have developed their own
sets of functions, typically distributed as a cohesive suite known as a
package. The most popular packages are stored on the CRAN repository and can be
installed directly to your computer. For example, here we install the
package ggplot2
:
install.packages("ggplot2")
Once a package has been installed, it needs to be loaded into your R
session using the library()
function before it can be used.
You’ll need to re-load the package each time you create a new
session, but you’ll only need to install it once.
Once a package has been loaded, you will be able to freely use the functions in the package.
library(ggplot2)
# Plot the `women` dataset in R
qplot(women$height, women$weight)
In our first lab we introduced the R Script file type, a simple text file that is intended to include only executable R code and comments.
Here, we introduce R Markdown, a hybrid type of file that permits the
interweaving of R
code alongside text utilizing the
“Markdown” authoring framework. An R Markdown file thus permits you to
both:
R
codeTo begin using R Markdown, we begin by verifying that we have the package correctly installed
install.packages("rmarkdown")
Once this has been installed, we can create a new R Markdown file by selecting: File -> New File -> R Markdown. There are a few components of the r markdown document that are worth examining more closely.
At the top of the document is the header:
-
charactersThe second thing we see is a code chunk:
R
code when your report is built. For now, you should keep
this chunk as it appears and place your actual code inside of other code
chunks.R
code in a chunk by highlighting
the relevant code and using Ctrl-Enter.Next, we have a number of formatting options that will render the text in markdown:
summary()
Finally, R Markdown allows us to type ordinary text outside of code chunks, creating a streamlined way to integrate written text into the same document as your code and output.
There are a couple of very important problems that R Markdown is intended to solve. First, it serves to create documents that seamlessly blend R code, output (including plots), and text into a polished report. It also serves to ensure the reproducibility of a report. When interactively doing data analysis, it is easy to change important aspects of the data or iteratively make changes to an object, but unless each step is explicitly written down and reported, it may very well become impossible to reproduce. By contrast, when an R Markdown report is generated, it clears everything from the environment and starts from scratch – only code that is included in the R Markdown file is used in the generated report. This means that starting with the raw data, anybody who generates a report with R Markdown on their machine can be sure that they are going to end up with the same results.
To generate this document, we must compile our R Markdown file using
the Knit
button (a blue yarn ball) located at the top of
the screen. Alternatively, one can knit with Ctrl+Shift+K.
Question #8: Create a new R Markdown file and delete all of the template code that appears beneath the “r setup” block. Change the title to “Lab #1” and the author to your and your partner’s name. Next, create section header labels for each of the questions in this lab and the previous lab (there are 14 total). Do this using three # characters followed by “Question X”. Move your answers from your R Script for the first part of Lab 1 to this document.
Question #8 (continued): R Markdown will use LaTex typesetting for any text wrapped in \(\$\) characters. For example, \(\$\text{\\beta}\$\) will appear as a the Greek letter \(\beta\) after you knit your document. To practice this, include \(\$\text{H_0: \\mu = 0}\$\) in a sentence (the sentence can say anything, but it should not be inside an R code chunk or a section header).
For the second half of our introduction to R labs, we are going to focus on two topics: summarizing our data and addressing miscellaneous issues when handling data.
Data summaries are precisely what the name suggests: summaries of data. Good data summaries are critical for condensing potentially enormous volumes of information into concise metrics that describe relevant properties of our data. Different types of data are better suited to different summaries, and we begin by considering a few of these in the next sections.
Numeric summaries are useful to describing data that take on continuous values, for example, height, weight, BMI, age, etc.,. Numeric summaries serve to tell us important details about a variable, including the maximum and minimum values (and also the range), as well as the mean and standard deviation of values. Each of these is conveniently associated with a different function in R
## Extract life expectancy as a vector
my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
life <- my_data$LifeExpectancy
mean(life) # mean
## [1] 67.83846
sd(life) # standard deviation
## [1] 11.04193
min(life) # minimum
## [1] 40.5
median(life) # median
## [1] 71.5
max(life) # maximum
## [1] 82.3
range(life) # min and maximum
## [1] 40.5 82.3
We also have functions available to tell us quantiles of numeric data. The quantiles of a numeric value tell us about the distribution of values. For example:
# The 35th quantile
quantile(life, probs = 0.35)
## 35%
## 66.18
tells us that 35% of the life expectancy data has values less than 66.18 years.
Finally, we consider the summary()
function which offers
a quick summary of some of the most common numerical summary
functions.
summary(life)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.50 61.90 71.50 67.84 76.05 82.30
Question #9 (Part 1) What class of R
object does the function range()
return? Show how we can
use this to find the distance between the minimum and maximum
values.
Question #9 (Part 2) The interquartile range is an important metric for indicating how dispersed our data is by indicating the range of the middle 50% of our data. Use any of the functions described above to give the IQR of population size from the Happy Planet dataset.
\(~\)
For categorical variables such as sex, ethnicity, disease or
treatment status, frequency tables are a more concise way of
summarizing the information. Here for example, consider the Happy Planet
dataset which categorizes each of the countries in its list to belonging
to one of seven different regions. The table()
function
allows us to quickly see how these regions compares:
table(my_data$Region)
##
## 1 2 3 4 5 6 7
## 24 24 16 33 7 12 27
One especially useful trick is recognizing that logical vectors are excellent candidates for categorical variables and can be used to summarize numeric variables in a very similar way. For example, here we consider how many countries have a life expectancy greater than 75 years:
table(my_data$LifeExpectancy > 75)
##
## FALSE TRUE
## 100 43
Using the table()
function, we construct two-way
frequency tables showing the relationship between categorical
variables:
table(my_data$Region, my_data$LifeExpectancy > 75)
##
## FALSE TRUE
## 1 17 7
## 2 0 24
## 3 13 3
## 4 33 0
## 5 7 0
## 6 8 4
## 7 22 5
Here, for example, we see that in Region 1, only 7 countries have a life expectancy greater than 75 years, while the remaining 17 countries are less. Similar observations can be made about each of the other regions.
Finally, we consider prop.table
, which takes an object
of class table
as an argument and returns proportions based
on the value of margin
.
## Assign table to variable tab
tab <- table(my_data$Region, my_data$LifeExpectancy > 75)
class(tab) # object of class table
## [1] "table"
## This gives proportion of row data
prop.table(tab, margin = 1)
##
## FALSE TRUE
## 1 0.7083333 0.2916667
## 2 0.0000000 1.0000000
## 3 0.8125000 0.1875000
## 4 1.0000000 0.0000000
## 5 1.0000000 0.0000000
## 6 0.6666667 0.3333333
## 7 0.8148148 0.1851852
## This gives proportion of column data
prop.table(tab, margin = 2)
##
## FALSE TRUE
## 1 0.17000000 0.16279070
## 2 0.00000000 0.55813953
## 3 0.13000000 0.06976744
## 4 0.33000000 0.00000000
## 5 0.07000000 0.00000000
## 6 0.08000000 0.09302326
## 7 0.22000000 0.11627907
Question #10 Data science involves not only knowing
how to manipulate data, but also how to ask questions and make
inferences based on the results. Consider the two results given for
prop.table
for when margin = 1
and
margin = 2
, and specifically consider the second row of
each of these outputs, associated with region 2. Why are these values
different? What statements can be made about region 2 based on each of
these outputs.
Plots are ubiquitous in data science for their ability to quick
illustrate important aspects of a wide variety of data types. Though for
this class we will spend considerable time learning to construct
publication-quality plots with the ggplot2
package, here we
briefly illustrate a few common plots that are included in base R.
First, for numerical data we have box plots that
demonstrate the spread of data, similar to what is reported with the
summary()
function.
boxplot(my_data$LifeExpectancy)
For categorical data, the bar plot is a method for visually summarizing table data
tab <- table(my_data$Region)
barplot(tab)
Finally, a scatter plot is an excellent tool for quickly illustrating
the relationship between two numeric variables. Here, we show the
relationship between the average height and weight for American women
aged 30-39 (see ?women
to learn more about this data
set)
plot(x = women$height, women$weight)
Question #11 (Part 1): Look at the documentation for
the barplot
function (?barplot
) and determine
which argument makes the bar plot horizontal. Plot a horizontal bar plot
for the regions in the Happy Planet data.
Question #11 (Part 2): Look at the documentation for
the faithful
dataset in R (?faithful
) which
includes two variable related to the Old Faithful geyser, the waiting
time between eruptions and the duration of the eruption. Create a plot
investigating the relationship between these two variables with the
duration of eruptions on the x-axis and the time spent waiting on the
y-axis. What are some conclusions that you can draw about this
relationship?
We conclude this lab with a section addressing a few more topics that may arise when working with data. These include coercion, missing data, and factors.
In the first lab, we introduced three important types of vectors:
x = c(1, 2, 3)
x = c(TRUE, FALSE, TRUE)
We also noted that a property of vectors is that each element of the vector must be the same type, and when these types are mixed, they will be coerced into the broadest category.
mix_vec <- c(1, 2, 3, "a", "b", "c")
class(mix_vec)
## [1] "character"
Here, we see that each of the elements is coerced to a
character
, the broadest class of vectors. Similarly, we
note that internally, a logical vector with TRUE/FALSE
is
categorized as 1/0
(this is especially useful to know). The
consequence of this is that logical vectors will be coerced to a numeric
when the two exist together:
mix_vec <- c(TRUE, FALSE, 1, 2)
mix_vec
## [1] 1 0 1 2
class(mix_vec)
## [1] "numeric"
Somewhat unfortunately, this convention is not consistent when coercing with characters
mix_vec <- c(TRUE, FALSE, "a")
mix_vec
## [1] "TRUE" "FALSE" "a"
Just as vectors can be automatically coerced based on their element
types, we can manually coerce to whatever type we wish with the
functions as.numeric()
, as.logical()
, and
as.character()
:
vec <- c("1", "2", "3")
as.numeric(vec)
## [1] 1 2 3
If an element of a vector cannot be coerced, it is stored as
a special element, NA
mix_vec <- c("1", "2", "3", "a")
as.numeric(mix_vec)
## Warning: NAs introduced by coercion
## [1] 1 2 3 NA
We will look at these more in the next section.
Question #12 Consider the following code:
v <- c(TRUE, TRUE, FALSE)
v_num <- as.numeric(v)
v_char <- as.character(v_num)
v_logical <- as.logical(v_char)
v_logical
## [1] NA NA NA
Why did v_logical
return only NA
values?
Look at the documentation for ?as.logical
to see how it
handles character vectors. How is this different than how it handles
numeric vectors? Modify the above code by adding one line (and changing
a variable name, if necessary) so that v_logical
returns
c(TRUE, TRUE, FALSE)
\(~\)
Real data sometimes contains missing values which, as we
saw, are stored in R
as a special element, NA
.
Sometimes this is introduced by coercion or from other operations in
R
, but most commonly it comes directly from the raw data
that you are working with.
We can determine if elements in a vector have NA
values
with the function is.na
x <- c("1", "2", "3", "a")
x <- as.numeric(x)
## Warning: NAs introduced by coercion
is.na(x)
## [1] FALSE FALSE FALSE TRUE
Just as useful, we can combine this with the “not” logical operator
!
that was introduced in Lab 1. By this logic, something
that is “not” TRUE is FALSE and something that is “not” FALSE is TRUE.
We can use this to subset our data with a logical vector just as we did
in Lab 1.
# Returns TRUE if they are NA
is.na(x)
## [1] FALSE FALSE FALSE TRUE
# The "!" operator gives the inverse of a logical vector
!is.na(x)
## [1] TRUE TRUE TRUE FALSE
# Returns only non-NA values
x[!is.na(x)]
## [1] 1 2 3
Missing values can be problems for functions unless they are
explicitly addressed. For example, both the sum
and
mean
functions will return NA
if any of the
values in the vector are NA
sum(x) # returns NA
## [1] NA
mean(x) # returns NA
## [1] NA
sum(x, na.rm = TRUE) # correct
## [1] 6
mean(x, na.rm = TRUE) # correct
## [1] 2
This seems to create more work than is necessary, but the reasoning
behind it is quite simple from the prospective of a data scientist: if
any of your values are NA
, it is important to be aware of
them. If functions such as mean
and sum
simply
removed them without notice, it would be entirely possible for them to
slip by undetected. By setting the default as it is, the programmer is
immediately notified of any potential issues, with the option to then
ignore them if desired.
Question 13: Find the median value of the “GDPperCapita” variable in the Happy Planet dataset, removing missing values if necessary. Report the country names corresponding to any missing values that you removed.
This section borrows heavily from the relevant section in R for Data Science.
We conclude this lab with a brief discussion on a fourth type of vector known as a factor. Factors are primarily intended to be categorical variables, particularly when there are a known, small number of distinct values that a variable can take. This is in contrast to character vectors which may contain any number of distinct values. To illustrate this with an example, factors may be used to represent the months in a year (for which there are exactly 12) whereas character vectors may be better suited to recording subjects’ names (of which there may be any number).
Suppose that we are interested in a variable that records months
x1 <- c("Dec", "Apr", "Jan", "Mar")
Using a standard character vector, we run into a few possible issues:
x2 <- c("Dec", "Apr", "Jam", "Mar")
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"
When creating factor variables, we begin by understanding that there are two distinct parts:
Since there are only 12 possible values for our months variable, these will be our levels
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
And we can use these to create our factor vector
y1 <- factor(x1, levels = month_levels)
y1 # prints out value of factor and the possible levels
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1) # sorted how we wish
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
If we do have a typo or include a value that is not a valid level, we
will get an NA
in return
y2 <- factor(x2, levels = month_levels)
y2
## [1] Dec Apr <NA> Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Finally, because the factors are sorted by level rather than alphabetically, we have more control over how the vector is presented in plots
# See ?month.abb to see a number of built-in constants that make fake data easier
x <- month.abb[c(1,3,2,5,12,11,10,10,3,4,1,3,2,7)]
barplot(table(x)) # sorted alphabetically
# Sorted by correct month
y <- factor(x, levels = month_levels)
barplot(table(y))
Question 14: The code below loads the “colleges” data set, containing information pertaining to all primarily undergraduate institutions with at least 400 full-time students in the 2019-2020 academic year
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")