This lab serves as an introduction to SQL (Structured Query Language) and how SQL is often implemented in R
Data are often stored in databases, or organized collections of structured information (data). Most databases are a collection of tables of data like those we’ve been working with so far in class. For this class, and most applications in general, we will work with Relational Databases. A system that manages a relational database is known (surprisingly) as an relational database management system (RDBMS). I bring this up only because you will see the term RDBMS thrown around a lot in SQL-land.
The essential component of a Relational Database is its relational structure. A Relational Database consists of a number of tables. Each Table consists of columns of attributes and unique rows of information. Tables may have a “Primary Key” which is a unique identifier. They may also have “Foreign Keys” which are keys that can be used to match to other tables. It is best practice that the first column be the Primary Key if it exists.
The combination of a Primary and Foreign Key is a relation and where the name comes from. The tables can be related to each other to store, access, and modify data more efficiently.
In effect, this will work nearly identically to our joins
lab, where instead of having the data stored locally, it will be
kept (and processed) on an RDBMS server, and instead of
dplyr, we will be using syntax from SQL.
Copy and run the line below in your console (do not put in your Rmd file) to install the packages necessary for this lab:
install.packages(c("DBI", "odbc", "RMySQL", "dbplyr"))
library(DBI)
library(odbc)
library(RMySQL)
library(tidyverse)
library(dbplyr)
We start by establishing a connection to an active RDBMS server
hosted here on Grinnell’s campus (note: you must be on campus wi-fi to
do this). We can name this connection whatever we want (in this case, we
call it con), and all of our SQL operations will be
performed in the context of this connection.
con <- dbConnect(
MySQL(),
dbname = "STA230",
host = "aiken.cs.grinnell.edu",
port = 3306,
user = "student",
password = "password!"
)
Once we have finished working with our connection, it’s good practice to disconnect using
## Don't run this until you're finished
dbDisconnect(con)
There are only two functions we’ll need today:
dbListTables(), which will list all of the tables in our
relational database and dbGetQuery() which will execute a
query on the SQL server. We’ll start with the former, showing us which
tables we have to work with, passing in the connection we have already
established
dbListTables(con)
## [1] "Iowa_homes_1" "Iowa_homes_2" "Police_shootings_1"
## [4] "Police_shootings_2" "course" "course_offering"
## [7] "courses" "customers" "enrollments"
## [10] "office_hour" "orders" "professor_info"
## [13] "students" "syllabus_info"
This is a poor example of what a relational database may look like in
real life, as they contain a number of tables that have nothing to do
with each other. However, they’ll serve for the time being. For our
illustrations, we will continue using two of these tables,
customers and orders.
Here is a helpful reference sheet for common queries and commands in SQL. A short list (without examples) is also given below
Let’s begin by getting some basic information from one of our tables.
We do this with the SQL command DESCRIBE. All we need to do
this is the name of the connection and the table we wish to
describe:
dbGetQuery(con, "DESCRIBE orders")
## Field Type Null Key Default Extra
## 1 row_names text YES <NA>
## 2 order bigint(20) YES <NA>
## 3 id bigint(20) YES <NA>
## 4 date text YES <NA>
This returns a collection of column names (Field) and the type of value (Type). The (Null) column indicates if a column is allowed to contain null values (yes by default), (Key) if a key has been established in the RBDMS (it hasn’t), (Default) describes how new data is handled when adding rows (we don’t need this now), and (Extra) describes extra attributes a column may have (also not needed). In any event, we see we have four columns, two of type text and two of type integer.
We can run a query and return a data.frame using the syntax
SELECT [something] FROM [table name]. The
[something] can be a comma separated list of variables
names or we can use * to select them all
## Select all columns from orders table
dbGetQuery(con, "SELECT * FROM orders")
## row_names order id date
## 1 1 1 4 Jan-01
## 2 2 2 8 Feb-01
## 3 3 3 42 Apr-15
## 4 4 4 50 Apr-17
## 5 5 11 4 Apr-18
## 6 6 12 8 Feb-11
## 7 7 13 42 Jul-15
## 8 8 14 50 Jun-17
## Select date and id from orders table
dbGetQuery(con, "SELECT date, id FROM orders")
## date id
## 1 Jan-01 4
## 2 Feb-01 8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18 4
## 6 Feb-11 8
## 7 Jul-15 42
## 8 Jun-17 50
We can run into problems, though, if we have column names that are
associated with SQL commands. For example, ORDER is a
command in SQL as well as a name in this table. If I try to select it,
I’ll get an error
## Select order from orders table
dbGetQuery(con, "SELECT order FROM orders")
## Error in dbSendQuery(conn, statement, ...): could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'order FROM orders' at line 1
Again, because SQL is case insensitive, the fact we have
order in lowercase doesn’t matter. We can get around this
by specifying from which table we want to grab the order
variable from the table orders using a . with
the syntax TABLE.VAR_NAME
## Select order from orders table
dbGetQuery(con, "SELECT orders.order FROM orders")
## order
## 1 1
## 2 2
## 3 3
## 4 4
## 5 11
## 6 12
## 7 13
## 8 14
I can combine queries with different filtering functions, though the order matters. For example, to get distinct IDs from the order list, I would use
dbGetQuery(con, "SELECT DISTINCT id FROM orders")
## id
## 1 4
## 2 8
## 3 42
## 4 50
whereas to put a proper filter on this, I would use
WHERE
dbGetQuery(con, "SELECT * FROM orders WHERE id<10")
## row_names order id date
## 1 1 1 4 Jan-01
## 2 2 2 8 Feb-01
## 3 5 11 4 Apr-18
## 4 6 12 8 Feb-11
We can select from a list of values using IN syntax,
where the list is a parenthetical comma separated list
dbGetQuery(con, "SELECT * FROM orders WHERE id IN (4, 8, 42)")
## row_names order id date
## 1 1 1 4 Jan-01
## 2 2 2 8 Feb-01
## 3 3 3 42 Apr-15
## 4 5 11 4 Apr-18
## 5 6 12 8 Feb-11
## 6 7 13 42 Jul-15
Problem 1: First, display the variables that are
available in the customers table. Return the
id and name columns in all instances where the
row names are less than 4
One of the most common operations when working with SQL and tables is
to perform a join. Joins are called using JOIN and
ON to tell us what type of join we wish and how we want to
match them
dbGetQuery(con, "SELECT * from orders
LEFT JOIN customers ON orders.id = customers.id")
## row_names order id date row_names id name
## 1 1 1 4 Jan-01 1 4 Tukey
## 2 5 11 4 Apr-18 1 4 Tukey
## 3 2 2 8 Feb-01 2 8 Wickham
## 4 6 12 8 Feb-11 2 8 Wickham
## 5 3 3 42 Apr-15 6 42 Cox
## 6 7 13 42 Jul-15 6 42 Cox
## 7 4 4 50 Apr-17 <NA> NA <NA>
## 8 8 14 50 Jun-17 <NA> NA <NA>
Ok, what’s happening here:
orders.id = customers.id tells us which variables from
which table with the table.varname syntaxLEFT JOIN tells us left joinSELECT * we are retaining all columns
from both tables. Probably not what we wantIf all I want is the customer name from the customers table, I need to specify that I want everything from orders and only the customer name form customers. That would look something like this
dbGetQuery(con, "SELECT orders.*, customers.name FROM orders
LEFT JOIN customers ON orders.id = customers.id")
## row_names order id date name
## 1 1 1 4 Jan-01 Tukey
## 2 5 11 4 Apr-18 Tukey
## 3 2 2 8 Feb-01 Wickham
## 4 6 12 8 Feb-11 Wickham
## 5 3 3 42 Apr-15 Cox
## 6 7 13 42 Jul-15 Cox
## 7 4 4 50 Apr-17 <NA>
## 8 8 14 50 Jun-17 <NA>
It is no problem, then, to combine this with things we covered previously. For example, let’s take this join and only keep those with row_names less than 7
dbGetQuery(con, "SELECT orders.*, customers.name FROM orders
LEFT JOIN customers ON orders.id = customers.id
WHERE orders.row_names<7")
## row_names order id date name
## 1 1 1 4 Jan-01 Tukey
## 2 5 11 4 Apr-18 Tukey
## 3 2 2 8 Feb-01 Wickham
## 4 6 12 8 Feb-11 Wickham
## 5 3 3 42 Apr-15 Cox
## 6 4 4 50 Apr-17 <NA>
Magic.
This brings us to the topic of aliasing. Aliasing involves assigning
another name to either a variable in a table or a table itself. Aliasing
can be done with AS
## Change column names to all caps
dbGetQuery(con, "SELECT date AS DATE, id AS ID FROM orders AS o")
## DATE ID
## 1 Jan-01 4
## 2 Feb-01 8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18 4
## 6 Feb-11 8
## 7 Jul-15 42
## 8 Jun-17 50
In this case, we see the column names have changed, but there has
been no apparent use of the alias for orders. The utility
of this becomes clearer when we consider its use in joins
dbGetQuery(con, "SELECT * FROM orders AS o
LEFT JOIN customers AS c ON o.id = c.id")
## row_names order id date row_names id name
## 1 1 1 4 Jan-01 1 4 Tukey
## 2 5 11 4 Apr-18 1 4 Tukey
## 3 2 2 8 Feb-01 2 8 Wickham
## 4 6 12 8 Feb-11 2 8 Wickham
## 5 3 3 42 Apr-15 6 42 Cox
## 6 7 13 42 Jul-15 6 42 Cox
## 7 4 4 50 Apr-17 <NA> NA <NA>
## 8 8 14 50 Jun-17 <NA> NA <NA>
Here, by aliasing tables as o and c
respectively, we can call o.id = c.id rather than
order.id = customer.id. Whether this was useful in this
case is besides the point; aliasing is common in the wild and it’s
important to see its syntax. When we use it in conjunction with joins or
filters, the alias carries throughout. Here is the same query we ran
before
dbGetQuery(con, "SELECT o.*, c.name FROM orders AS o
LEFT JOIN customers AS c ON o.id = c.id
WHERE o.row_names<7")
## row_names order id date name
## 1 1 1 4 Jan-01 Tukey
## 2 5 11 4 Apr-18 Tukey
## 3 2 2 8 Feb-01 Wickham
## 4 6 12 8 Feb-11 Wickham
## 5 3 3 42 Apr-15 Cox
## 6 4 4 50 Apr-17 <NA>
Problem 2: Write a SQL query that performs the following:
Your output should look like this
## ID DATE ORDERS
## 1 4 Jan-01 1
## 2 4 Apr-18 11
## 3 8 Feb-01 2
## 4 8 Feb-11 12
## 5 50 Apr-17 4
## 6 50 Jun-17 14
Note that we also have RIGHT JOIN and
INNER JOIN available to use as appropriate
Recall that anti-joins are used to identify implicit missing values: which entries do I have that are missing values in a corresponding table?
For example, suppose we wanted to find all of the customers in our database who haven’t placed an order. To do this with dplyr, we would run something like
customers <- dbGetQuery(con, "SELECT * FROM customers")
orders <- dbGetQuery(con, "SELECT * FROM orders")
anti_join(customers, orders, by = "id")
## row_names id name
## 1 3 15 Mason
## 2 4 16 Jordan
## 3 5 23 Patil
Anti-joins don’t exist directly in SQL, but there is a way we can
deploy them. Start by recognizing that what we are looking for is all
customers without a corresponding ID in orders. A left join on
customers will allow us to see which these are
dbGetQuery(con, "select * from customers as c
LEFT JOIN orders as o ON o.id=c.id")
## row_names id name row_names order id date
## 1 1 4 Tukey 1 1 4 Jan-01
## 2 2 8 Wickham 2 2 8 Feb-01
## 3 6 42 Cox 3 3 42 Apr-15
## 4 1 4 Tukey 5 11 4 Apr-18
## 5 2 8 Wickham 6 12 8 Feb-11
## 6 6 42 Cox 7 13 42 Jul-15
## 7 3 15 Mason <NA> NA NA <NA>
## 8 4 16 Jordan <NA> NA NA <NA>
## 9 5 23 Patil <NA> NA NA <NA>
We see just as above, there are three names without corresponding
orders: Mason, Jordan, and Patil. To select only those with a missing ID
in orders, we can use a WHERE statement
dbGetQuery(con, "select * from customers as c
LEFT JOIN orders as o ON o.id=c.id
WHERE o.id IS NULL")
## row_names id name row_names order id date
## 1 3 15 Mason <NA> NA NA <NA>
## 2 4 16 Jordan <NA> NA NA <NA>
## 3 5 23 Patil <NA> NA NA <NA>
Then, to extract only the names, we need to modify our
SELECT statement to only grab the variable
name from the table customers (aliased as
c)
## Ta-da!
dbGetQuery(con, "select c.name from customers as c
LEFT JOIN orders as o ON o.id=c.id
WHERE o.id IS NULL")
## name
## 1 Mason
## 2 Jordan
## 3 Patil
Here are a few exercises covering just what we’ve learned. For these,
we will be using students, enrollments, and
courses (not course)
dbListTables(con)
## [1] "Iowa_homes_1" "Iowa_homes_2" "Police_shootings_1"
## [4] "Police_shootings_2" "course" "course_offering"
## [7] "courses" "customers" "enrollments"
## [10] "office_hour" "orders" "professor_info"
## [13] "students" "syllabus_info"
Problem 3: Identify the all unique student IDs for students who are enrolled in at least one course
Problem 4: Identify all of the courses (by course name) that have at least one student enrolled
Problem 5: How do you think you join on multiple tables? Try to produce the following output showing each student and the name of the courses in which they’re enrolled. Note also the change in variable names
## Student_Name Course_Name
## 1 Alice Intro to Statistics
## 2 Alice Linear Algebra
## 3 Bob Intro to Statistics
## 4 Bob Econometrics
## 5 Charlie Econometrics
## 6 Fiona Intro to Political Science
## 7 George Linear Algebra
Problem 6: Return the names and majors of all students not enrolled in any classes.
sorting and ordering
group_by, count, sum, avg, having
paste commands for convenience