Introduction

This lab serves as an introduction to SQL (Structured Query Language) and how SQL is often implemented in R

Data are often stored in databases, or organized collections of structured information (data). Most databases are a collection of tables of data like those we’ve been working with so far in class. For this class, and most applications in general, we will work with Relational Databases. A system that manages a relational database is known (surprisingly) as an relational database management system (RDBMS). I bring this up only because you will see the term RDBMS thrown around a lot in SQL-land.

The essential component of a Relational Database is its relational structure. A Relational Database consists of a number of tables. Each Table consists of columns of attributes and unique rows of information. Tables may have a “Primary Key” which is a unique identifier. They may also have “Foreign Keys” which are keys that can be used to match to other tables. It is best practice that the first column be the Primary Key if it exists.

The combination of a Primary and Foreign Key is a relation and where the name comes from. The tables can be related to each other to store, access, and modify data more efficiently.

In effect, this will work nearly identically to our joins lab, where instead of having the data stored locally, it will be kept (and processed) on an RDBMS server, and instead of dplyr, we will be using syntax from SQL.

Copy and run the line below in your console (do not put in your Rmd file) to install the packages necessary for this lab:

install.packages(c("DBI", "odbc", "RMySQL", "dbplyr"))

Lab

library(DBI)
library(odbc)
library(RMySQL)
library(tidyverse)
library(dbplyr)

We start by establishing a connection to an active RDBMS server hosted here on Grinnell’s campus (note: you must be on campus wi-fi to do this). We can name this connection whatever we want (in this case, we call it con), and all of our SQL operations will be performed in the context of this connection.

con <- dbConnect(
  MySQL(),
  dbname   = "STA230",
  host     = "aiken.cs.grinnell.edu",
  port     = 3306,
  user     = "student",
  password = "password!"
)

Once we have finished working with our connection, it’s good practice to disconnect using

## Don't run this until you're finished
dbDisconnect(con)

There are only two functions we’ll need today: dbListTables(), which will list all of the tables in our relational database and dbGetQuery() which will execute a query on the SQL server. We’ll start with the former, showing us which tables we have to work with, passing in the connection we have already established

dbListTables(con)
##  [1] "Iowa_homes_1"       "Iowa_homes_2"       "Police_shootings_1"
##  [4] "Police_shootings_2" "course"             "course_offering"   
##  [7] "courses"            "customers"          "enrollments"       
## [10] "office_hour"        "orders"             "professor_info"    
## [13] "students"           "syllabus_info"

This is a poor example of what a relational database may look like in real life, as they contain a number of tables that have nothing to do with each other. However, they’ll serve for the time being. For our illustrations, we will continue using two of these tables, customers and orders.

Cheat Sheet

Here is a helpful reference sheet for common queries and commands in SQL. A short list (without examples) is also given below

  • SQL Queries:
    • SELECT: retrieve data
    • *: Wildcard, everything
      • Example: SELECT * FROM employees;
      • This returns all columns and all values from the employees table
    • DISTINCT: unique values from a column
      • Example: SELECT DISTINCT department FROM employees;
      • This returns the distinct departments in the employees table
    • WHERE: filter on conditions
    • LIMIT: limit return size
    • CASE: conditional logic
  • Filters:
    • WHERE: filter rows on conditions
    • LIKE: Match a pattern in a column
    • IN: match any value in a list
    • Between: Values in a specified range
    • IS NULL: null values
    • ORDER BY: Sort
    • AND/OR/NOT can be used in Where
  • Aggregation:
    • GROUP BY: group rows with same values
    • COUNT: count the values in a column
    • SUM: sum the values of a column
    • AVG: average the values
    • MIN: find minimum value
    • MAX: find max value
    • HAVING: filter on conditions
  • Sorting:
    • ORDER BY: sorts the output by the column(s) of interest.
    • default is ascending, use DESC to descend.

Basic Queries

Let’s begin by getting some basic information from one of our tables. We do this with the SQL command DESCRIBE. All we need to do this is the name of the connection and the table we wish to describe:

dbGetQuery(con, "DESCRIBE orders")
##       Field       Type Null Key Default Extra
## 1 row_names       text  YES        <NA>      
## 2     order bigint(20)  YES        <NA>      
## 3        id bigint(20)  YES        <NA>      
## 4      date       text  YES        <NA>

This returns a collection of column names (Field) and the type of value (Type). The (Null) column indicates if a column is allowed to contain null values (yes by default), (Key) if a key has been established in the RBDMS (it hasn’t), (Default) describes how new data is handled when adding rows (we don’t need this now), and (Extra) describes extra attributes a column may have (also not needed). In any event, we see we have four columns, two of type text and two of type integer.

We can run a query and return a data.frame using the syntax SELECT [something] FROM [table name]. The [something] can be a comma separated list of variables names or we can use * to select them all

## Select all columns from orders table
dbGetQuery(con, "SELECT * FROM orders")
##   row_names order id   date
## 1         1     1  4 Jan-01
## 2         2     2  8 Feb-01
## 3         3     3 42 Apr-15
## 4         4     4 50 Apr-17
## 5         5    11  4 Apr-18
## 6         6    12  8 Feb-11
## 7         7    13 42 Jul-15
## 8         8    14 50 Jun-17
## Select date and id from orders table
dbGetQuery(con, "SELECT date, id FROM orders")
##     date id
## 1 Jan-01  4
## 2 Feb-01  8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18  4
## 6 Feb-11  8
## 7 Jul-15 42
## 8 Jun-17 50

We can run into problems, though, if we have column names that are associated with SQL commands. For example, ORDER is a command in SQL as well as a name in this table. If I try to select it, I’ll get an error

## Select order from orders table
dbGetQuery(con, "SELECT order FROM orders")
## Error in dbSendQuery(conn, statement, ...): could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'order FROM orders' at line 1

Again, because SQL is case insensitive, the fact we have order in lowercase doesn’t matter. We can get around this by specifying from which table we want to grab the order variable from the table orders using a . with the syntax TABLE.VAR_NAME

## Select order from orders table
dbGetQuery(con, "SELECT orders.order FROM orders")
##   order
## 1     1
## 2     2
## 3     3
## 4     4
## 5    11
## 6    12
## 7    13
## 8    14

I can combine queries with different filtering functions, though the order matters. For example, to get distinct IDs from the order list, I would use

dbGetQuery(con, "SELECT DISTINCT id FROM orders")
##   id
## 1  4
## 2  8
## 3 42
## 4 50

whereas to put a proper filter on this, I would use WHERE

dbGetQuery(con, "SELECT * FROM orders WHERE id<10")
##   row_names order id   date
## 1         1     1  4 Jan-01
## 2         2     2  8 Feb-01
## 3         5    11  4 Apr-18
## 4         6    12  8 Feb-11

We can select from a list of values using IN syntax, where the list is a parenthetical comma separated list

dbGetQuery(con, "SELECT * FROM orders WHERE id IN (4, 8, 42)")
##   row_names order id   date
## 1         1     1  4 Jan-01
## 2         2     2  8 Feb-01
## 3         3     3 42 Apr-15
## 4         5    11  4 Apr-18
## 5         6    12  8 Feb-11
## 6         7    13 42 Jul-15

Problem 1: First, display the variables that are available in the customers table. Return the id and name columns in all instances where the row names are less than 4

Joins and Aliasing

One of the most common operations when working with SQL and tables is to perform a join. Joins are called using JOIN and ON to tell us what type of join we wish and how we want to match them

dbGetQuery(con, "SELECT * from orders
           LEFT JOIN customers ON orders.id = customers.id")
##   row_names order id   date row_names id    name
## 1         1     1  4 Jan-01         1  4   Tukey
## 2         5    11  4 Apr-18         1  4   Tukey
## 3         2     2  8 Feb-01         2  8 Wickham
## 4         6    12  8 Feb-11         2  8 Wickham
## 5         3     3 42 Apr-15         6 42     Cox
## 6         7    13 42 Jul-15         6 42     Cox
## 7         4     4 50 Apr-17      <NA> NA    <NA>
## 8         8    14 50 Jun-17      <NA> NA    <NA>

Ok, what’s happening here:

  • orders.id = customers.id tells us which variables from which table with the table.varname syntax
  • LEFT JOIN tells us left join
  • Because we used SELECT * we are retaining all columns from both tables. Probably not what we want

If all I want is the customer name from the customers table, I need to specify that I want everything from orders and only the customer name form customers. That would look something like this

dbGetQuery(con, "SELECT orders.*, customers.name FROM orders
           LEFT JOIN customers ON orders.id = customers.id")
##   row_names order id   date    name
## 1         1     1  4 Jan-01   Tukey
## 2         5    11  4 Apr-18   Tukey
## 3         2     2  8 Feb-01 Wickham
## 4         6    12  8 Feb-11 Wickham
## 5         3     3 42 Apr-15     Cox
## 6         7    13 42 Jul-15     Cox
## 7         4     4 50 Apr-17    <NA>
## 8         8    14 50 Jun-17    <NA>

It is no problem, then, to combine this with things we covered previously. For example, let’s take this join and only keep those with row_names less than 7

dbGetQuery(con, "SELECT orders.*, customers.name FROM orders
           LEFT JOIN customers ON orders.id = customers.id
           WHERE orders.row_names<7")
##   row_names order id   date    name
## 1         1     1  4 Jan-01   Tukey
## 2         5    11  4 Apr-18   Tukey
## 3         2     2  8 Feb-01 Wickham
## 4         6    12  8 Feb-11 Wickham
## 5         3     3 42 Apr-15     Cox
## 6         4     4 50 Apr-17    <NA>

Magic.

This brings us to the topic of aliasing. Aliasing involves assigning another name to either a variable in a table or a table itself. Aliasing can be done with AS

## Change column names to all caps
dbGetQuery(con, "SELECT date AS DATE, id AS ID FROM orders AS o")
##     DATE ID
## 1 Jan-01  4
## 2 Feb-01  8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18  4
## 6 Feb-11  8
## 7 Jul-15 42
## 8 Jun-17 50

In this case, we see the column names have changed, but there has been no apparent use of the alias for orders. The utility of this becomes clearer when we consider its use in joins

dbGetQuery(con, "SELECT * FROM orders AS o
           LEFT JOIN customers AS c ON o.id = c.id")
##   row_names order id   date row_names id    name
## 1         1     1  4 Jan-01         1  4   Tukey
## 2         5    11  4 Apr-18         1  4   Tukey
## 3         2     2  8 Feb-01         2  8 Wickham
## 4         6    12  8 Feb-11         2  8 Wickham
## 5         3     3 42 Apr-15         6 42     Cox
## 6         7    13 42 Jul-15         6 42     Cox
## 7         4     4 50 Apr-17      <NA> NA    <NA>
## 8         8    14 50 Jun-17      <NA> NA    <NA>

Here, by aliasing tables as o and c respectively, we can call o.id = c.id rather than order.id = customer.id. Whether this was useful in this case is besides the point; aliasing is common in the wild and it’s important to see its syntax. When we use it in conjunction with joins or filters, the alias carries throughout. Here is the same query we ran before

dbGetQuery(con, "SELECT o.*, c.name FROM orders AS o
           LEFT JOIN customers AS c ON o.id = c.id
           WHERE o.row_names<7")
##   row_names order id   date    name
## 1         1     1  4 Jan-01   Tukey
## 2         5    11  4 Apr-18   Tukey
## 3         2     2  8 Feb-01 Wickham
## 4         6    12  8 Feb-11 Wickham
## 5         3     3 42 Apr-15     Cox
## 6         4     4 50 Apr-17    <NA>

Problem 2: Write a SQL query that performs the following:

  • Selects only order, id, and date from orders table, but change column names to be all caps
  • Left join on customers by id
  • Filter to only include ID values of 4, 8, and 50
  • (Hint: If something is ambiguous, it helps to specify which table it comes from)

Your output should look like this

##   ID   DATE ORDERS
## 1  4 Jan-01      1
## 2  4 Apr-18     11
## 3  8 Feb-01      2
## 4  8 Feb-11     12
## 5 50 Apr-17      4
## 6 50 Jun-17     14

Note that we also have RIGHT JOIN and INNER JOIN available to use as appropriate

Anti Joins

Recall that anti-joins are used to identify implicit missing values: which entries do I have that are missing values in a corresponding table?

For example, suppose we wanted to find all of the customers in our database who haven’t placed an order. To do this with dplyr, we would run something like

customers <- dbGetQuery(con, "SELECT * FROM customers")
orders <- dbGetQuery(con, "SELECT * FROM orders")

anti_join(customers, orders, by = "id")
##   row_names id   name
## 1         3 15  Mason
## 2         4 16 Jordan
## 3         5 23  Patil

Anti-joins don’t exist directly in SQL, but there is a way we can deploy them. Start by recognizing that what we are looking for is all customers without a corresponding ID in orders. A left join on customers will allow us to see which these are

dbGetQuery(con, "select * from customers as c
           LEFT JOIN orders as o ON o.id=c.id")
##   row_names id    name row_names order id   date
## 1         1  4   Tukey         1     1  4 Jan-01
## 2         2  8 Wickham         2     2  8 Feb-01
## 3         6 42     Cox         3     3 42 Apr-15
## 4         1  4   Tukey         5    11  4 Apr-18
## 5         2  8 Wickham         6    12  8 Feb-11
## 6         6 42     Cox         7    13 42 Jul-15
## 7         3 15   Mason      <NA>    NA NA   <NA>
## 8         4 16  Jordan      <NA>    NA NA   <NA>
## 9         5 23   Patil      <NA>    NA NA   <NA>

We see just as above, there are three names without corresponding orders: Mason, Jordan, and Patil. To select only those with a missing ID in orders, we can use a WHERE statement

dbGetQuery(con, "select * from customers as c
           LEFT JOIN orders as o ON o.id=c.id
           WHERE o.id IS NULL")
##   row_names id   name row_names order id date
## 1         3 15  Mason      <NA>    NA NA <NA>
## 2         4 16 Jordan      <NA>    NA NA <NA>
## 3         5 23  Patil      <NA>    NA NA <NA>

Then, to extract only the names, we need to modify our SELECT statement to only grab the variable name from the table customers (aliased as c)

## Ta-da!
dbGetQuery(con, "select c.name from customers as c
           LEFT JOIN orders as o ON o.id=c.id
           WHERE o.id IS NULL")
##     name
## 1  Mason
## 2 Jordan
## 3  Patil

Exercises

Here are a few exercises covering just what we’ve learned. For these, we will be using students, enrollments, and courses (not course)

dbListTables(con)
##  [1] "Iowa_homes_1"       "Iowa_homes_2"       "Police_shootings_1"
##  [4] "Police_shootings_2" "course"             "course_offering"   
##  [7] "courses"            "customers"          "enrollments"       
## [10] "office_hour"        "orders"             "professor_info"    
## [13] "students"           "syllabus_info"

Problem 3: Identify the all unique student IDs for students who are enrolled in at least one course

Problem 4: Identify all of the courses (by course name) that have at least one student enrolled

Problem 5: How do you think you join on multiple tables? Try to produce the following output showing each student and the name of the courses in which they’re enrolled. Note also the change in variable names

##   Student_Name                Course_Name
## 1        Alice        Intro to Statistics
## 2        Alice             Linear Algebra
## 3          Bob        Intro to Statistics
## 4          Bob               Econometrics
## 5      Charlie               Econometrics
## 6        Fiona Intro to Political Science
## 7       George             Linear Algebra

Problem 6: Return the names and majors of all students not enrolled in any classes.

Aggregation Functions

sorting and ordering

group_by, count, sum, avg, having

paste commands for convenience