Example code for use in class
library(dplyr)
library(nycflights13)
x <- data.frame(key = 1:3,
val_x = paste0("x", 1:3))
y <- data.frame(key = c(1,2,4),
val_y = paste0("y", 1:3))
z <- data.frame(KEY = c(2,4,6),
val_z = paste0("z", 1:3))
The code below provides 3 minimal datasets to be using for this lab, including:
students a dataset of students and ID informationenrollments a dataset of enrolled coursescourses a dataset with detailed course informationlibrary(dplyr)
library(ggplot2)
students <- data.frame(
student_id = c("S001", "S002", "S003", "S004", "S005", "S007", "S008"),
name = c("Alice", "Bob", "Charlie", "Diana", "Evan", "Fiona", "George"),
major = c(
"Statistics",
"Mathematics",
"Computer Science",
"Statistics",
"Physics",
"Biology",
"Mathematics"
)
)
# enrollments.csv
enrollments <- data.frame(
student_id = c("S001", "S001","S002", "S002", "S003",
"S006", "S007", "S008", "S008"),
course_id = c("C101", "C102", "C101", "C103", "C103",
"C101", "C104","C102", "C105"),
semester = rep("Fall", 9)
)
# courses.csv
courses <- data.frame(
course_id = c("C101", "C102", "C103", "C104", "C106"),
course_name = c(
"Intro to Statistics",
"Linear Algebra",
"Econometrics",
"Intro to Political Science",
"Statistical Computing"
),
instructor = paste("Prof.", c("Wells", "Hazel", "Lee", "Jozwiak", "Friedrichsen"))
)
For each of the problems, identify the type of join to use (as relevant) and provide a response based on the joins.
Question 1: Which students are enrolled in at least one class this semester, and what courses are they taking?
Question 2: Identify all of the enrollments for which we have both student and course records
Question 3: How many students are enrolled in Intro Stat?
Question 4: Create a roster with all students and they classes in which they’re enrolled, if any. Which students are not enrolled in any classes for fall semester?
Question 5: Create a table identifying all of the courses and the names of students enrolled in each course
Question 6: Which students did not enroll in any course this semester?
Question 7: Which courses listed in the catalog do not have any enrollments? Use both a filtering joing and a merging join. How do you identify courses with each type? Which is more appropriate for this type of question?
Question 8: Why can I not combine the
students dataset with the courses dataset in a
single join? What do I need to do to create a list of students each
faculty has?
Question 9: Construct a dataset showing all enrollment and course information and identify any explicitly missing values
Question 10: Why would using a full join be inappropriate for counting the number of current enrollments with complete records? Why would a full join be useful for auditing mistakes?
Question 11: Describe a scenario in which using the incorrect join could lead to incorrect conclusions.
Question 12: Why would it be a mistake to use an inner join to create a complete summary of student enrollments? Why my a student not be on this list?
Question 13: In the table below, what does
NA in the course_id variable actually
mean?
left_join(students, enrollments)
## student_id name major course_id semester
## 1 S001 Alice Statistics C101 Fall
## 2 S001 Alice Statistics C102 Fall
## 3 S002 Bob Mathematics C101 Fall
## 4 S002 Bob Mathematics C103 Fall
## 5 S003 Charlie Computer Science C103 Fall
## 6 S004 Diana Statistics <NA> <NA>
## 7 S005 Evan Physics <NA> <NA>
## 8 S007 Fiona Biology C104 Fall
## 9 S008 George Mathematics C102 Fall
## 10 S008 George Mathematics C105 Fall
Question 14: What would happen if I combine the datasets generated from the code below. Why does this happen?
anti_join(enrollments, courses)
semi_join(enrollments, courses)