In Class

Example code for use in class

library(dplyr)
library(nycflights13)

x <- data.frame(key = 1:3, 
                val_x = paste0("x", 1:3))
y <- data.frame(key = c(1,2,4), 
                val_y = paste0("y", 1:3))
z <- data.frame(KEY = c(2,4,6), 
                val_z = paste0("z", 1:3))

Lab

The code below provides 3 minimal datasets to be using for this lab, including:

library(dplyr)
library(ggplot2)

students <- data.frame(
  student_id = c("S001", "S002", "S003", "S004", "S005", "S007", "S008"),
  name = c("Alice", "Bob", "Charlie", "Diana", "Evan", "Fiona", "George"),
  major = c(
    "Statistics",
    "Mathematics",
    "Computer Science",
    "Statistics",
    "Physics",
    "Biology",
    "Mathematics"
  )
)

# enrollments.csv
enrollments <- data.frame(
  student_id = c("S001", "S001","S002", "S002", "S003", 
                 "S006", "S007", "S008", "S008"),
  course_id = c("C101", "C102", "C101", "C103", "C103", 
                "C101", "C104","C102", "C105"),
  semester = rep("Fall", 9)
)

# courses.csv
courses <- data.frame(
  course_id = c("C101", "C102", "C103", "C104", "C106"),
  course_name = c(
    "Intro to Statistics",
    "Linear Algebra",
    "Econometrics",
    "Intro to Political Science",
    "Statistical Computing"
  ),
  instructor = paste("Prof.", c("Wells", "Hazel", "Lee", "Jozwiak", "Friedrichsen"))
)

For each of the problems, identify the type of join to use (as relevant) and provide a response based on the joins.

Mutate Joins

Question 1: Which students are enrolled in at least one class this semester, and what courses are they taking?

Question 2: Identify all of the enrollments for which we have both student and course records

Question 3: How many students are enrolled in Intro Stat?

Question 4: Create a roster with all students and they classes in which they’re enrolled, if any. Which students are not enrolled in any classes for fall semester?

Question 5: Create a table identifying all of the courses and the names of students enrolled in each course

Question 6: Which students did not enroll in any course this semester?

Question 7: Which courses listed in the catalog do not have any enrollments? Use both a filtering joing and a merging join. How do you identify courses with each type? Which is more appropriate for this type of question?

Question 8: Why can I not combine the students dataset with the courses dataset in a single join? What do I need to do to create a list of students each faculty has?

Question 9: Construct a dataset showing all enrollment and course information and identify any explicitly missing values

Question 10: Why would using a full join be inappropriate for counting the number of current enrollments with complete records? Why would a full join be useful for auditing mistakes?

Question 11: Describe a scenario in which using the incorrect join could lead to incorrect conclusions.

Question 12: Why would it be a mistake to use an inner join to create a complete summary of student enrollments? Why my a student not be on this list?

Question 13: In the table below, what does NA in the course_id variable actually mean?

left_join(students, enrollments)
##    student_id    name            major course_id semester
## 1        S001   Alice       Statistics      C101     Fall
## 2        S001   Alice       Statistics      C102     Fall
## 3        S002     Bob      Mathematics      C101     Fall
## 4        S002     Bob      Mathematics      C103     Fall
## 5        S003 Charlie Computer Science      C103     Fall
## 6        S004   Diana       Statistics      <NA>     <NA>
## 7        S005    Evan          Physics      <NA>     <NA>
## 8        S007   Fiona          Biology      C104     Fall
## 9        S008  George      Mathematics      C102     Fall
## 10       S008  George      Mathematics      C105     Fall

Question 14: What would happen if I combine the datasets generated from the code below. Why does this happen?

anti_join(enrollments, courses)
semi_join(enrollments, courses)