Introduction

This lab serves as an introduction to SQL (Structured Query Language) and how SQL is often implemented in R

Data are often stored in databases, or organized collections of structured information (data). Most databases are a collection of tables of data like those we’ve been working with so far in class. For this class, and most applications in general, we will work with Relational Databases. A system that manages a relational database is known (surprisingly) as an relational database management system (RDBMS). I bring this up only because you will see the term RDBMS thrown around a lot in SQL-land.

The essential component of a Relational Database is its relational structure. A Relational Database consists of a number of tables. Each Table consists of columns of attributes and unique rows of information. Tables may have a “Primary Key” which is a unique identifier. They may also have “Foreign Keys” which are keys that can be used to match to other tables. It is best practice that the first column be the Primary Key if it exists.

The combination of a Primary and Foreign Key is a relation and where the name comes from. The tables can be related to each other to store, access, and modify data more efficiently.

In effect, this will work nearly identically to our joins lab, where instead of having the data stored locally, it will be kept (and processed) on an RDBMS server, and instead of dplyr, we will be using syntax from SQL.

Copy and run the line below in your console (do not put in your Rmd file) to install the packages necessary for this lab:

install.packages(c("DBI", "odbc", "RMySQL", "dbplyr"))

Lab

library(DBI)
library(odbc)
library(RMySQL)
library(dbplyr)

We start by establishing a connection to an active RDBMS server hosted here on Grinnell’s campus (note: you must be on campus wi-fi to do this). We can name this connection whatever we want (in this case, we call it con), and all of our SQL operations will be performed in the context of this connection.

con <- dbConnect(
  MySQL(),
  dbname   = "STA230",
  host     = "aiken.cs.grinnell.edu",
  port     = 3306,
  user     = "student",
  password = "password!"
)

Once we have finished working with our connection, it’s good practice to disconnect using

## Don't run this until you're finished
dbDisconnect(con)

There are only two functions we’ll need today: dbListTables(), which will list all of the tables in our relational database and dbGetQuery() which will execute a query on the SQL server. We’ll start with the former, showing us which tables we have to work with, passing in the connection we have already established

dbListTables(con)

##  [1] "Batting"            "Iowa_homes_1"       "Iowa_homes_2"      
##  [4] "MLBPlayers"         "MLBTeams"           "Police_shootings_1"
##  [7] "Police_shootings_2" "batting2"           "course"            
## [10] "course_offering"    "courses"            "customers"         
## [13] "enrollments"        "office_hour"        "orders"            
## [16] "players2"           "professor_info"     "students"          
## [19] "syllabus_info"      "teams2"

This is a poor example of what a relational database may look like in real life, as they contain a number of tables that have nothing to do with each other. However, they’ll serve for the time being. For our illustrations, we will continue using two of these tables, customers and orders.

Cheat Sheet

Here is a helpful reference sheet for common queries and commands in SQL. A short list (without examples) is also given below

SQL Queries:
- SELECT: retrieve data
- *: Wildcard, everything
  - Example: SELECT * FROM employees;
  - This returns all columns and all values from the employees table
- DISTINCT: unique values from a column
  - Example: SELECT DISTINCT department FROM employees;
  - This returns the distinct departments in the employees table
- WHERE: filter on conditions
- LIMIT: limit return size
- CASE: conditional logic
Filters:
- WHERE: filter rows on conditions
- LIKE: Match a pattern in a column
- IN: match any value in a list
- BETWEEN: Values in a specified range
- IS NULL: null values
- ORDER BY: Sort
- AND/OR/NOT can be used in Where
Aggregation:
- GROUP BY: group rows with same values
- COUNT: count the values in a column
- SUM: sum the values of a column
- AVG: average the values
- MIN: find minimum value
- MAX: find max value
- HAVING: filter on conditions
Sorting:
- ORDER BY: sorts the output by the column(s) of interest.
- default is ascending, use DESC to descend.

A quick note on computation:

In dplyr, we understand that piped functions are executed in the order they are written. That is, if we wish to mutate, group by then summarize, we would write them in that order

## Fake data, don't run this it won't work
df %>% mutate(newvar = var) %>% 
  group_by(grp) %>% 
  summarize(mean(newvar))

SQL doesn’t work like this. We will always write our SQL commands in this order:

## Written order
"SELECT ...
FROM ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ..."

However, it will always be executed in this order:

\[ \text{FROM} \rightarrow \text{WHERE} \rightarrow \text{GROUP BY} \rightarrow \text{HAVING} \rightarrow \text{SELECT} \rightarrow \text{ORDER BY} \] From a computational aspect, we can sort of reason it out. FROM (including joins) and WHERE start by selecting the tables we need and filtering them accordingly. This also includes any aliasing created with AS. The ensures that we are working on the smallest dataset we need. HAVING is a filter like WHERE that works on data that has already been summarized (more on this shortly), SELECT allows you to select whichever columns we need after the rest of the query has been resolved. Finally, ORDER BY takes the final query and sorts it.

Do not memorize this. But, if you run into issues below, it may be a helpful thing to come back to.

Basic Queries

Let’s begin by getting some basic information from one of our tables. We do this with the SQL command DESCRIBE. All we need to do this is the name of the connection and the table we wish to describe:

dbGetQuery(con, "DESCRIBE orders")

##       Field       Type Null Key Default Extra
## 1 row_names       text  YES        <NA>      
## 2     order bigint(20)  YES        <NA>      
## 3        id bigint(20)  YES        <NA>      
## 4      date       text  YES        <NA>

This returns a collection of column names (Field) and the type of value (Type). The (Null) column indicates if a column is allowed to contain null values (yes by default), (Key) if a key has been established in the RBDMS (it hasn’t), (Default) describes how new data is handled when adding rows (we don’t need this now), and (Extra) describes extra attributes a column may have (also not needed). In any event, we see we have four columns, two of type text and two of type integer.

We can run a query and return a data.frame using the syntax SELECT [something] FROM [table name]. The [something] can be a comma separated list of variables names or we can use * to select them all

## Select all columns from orders table
dbGetQuery(con, "SELECT * FROM orders")

##   row_names order id   date
## 1         1     1  4 Jan-01
## 2         2     2  8 Feb-01
## 3         3     3 42 Apr-15
## 4         4     4 50 Apr-17
## 5         5    11  4 Apr-18
## 6         6    12  8 Feb-11
## 7         7    13 42 Jul-15
## 8         8    14 50 Jun-17

## Select date and id from orders table
dbGetQuery(con, "SELECT date, id FROM orders")

##     date id
## 1 Jan-01  4
## 2 Feb-01  8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18  4
## 6 Feb-11  8
## 7 Jul-15 42
## 8 Jun-17 50

We can run into problems, though, if we have column names that are associated with SQL commands. For example, ORDER is a command in SQL as well as a name in this table. If I try to select it, I’ll get an error

## Select order from orders table
dbGetQuery(con, "SELECT order FROM orders")

## Error in dbSendQuery(conn, statement, ...): could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'order FROM orders' at line 1

Again, because SQL is case insensitive, the fact we have order in lowercase doesn’t matter. We can get around this by specifying from which table we want to grab the order variable from the table orders using a . with the syntax TABLE.VAR_NAME

## Select order from orders table
dbGetQuery(con, "SELECT orders.order FROM orders")

##   order
## 1     1
## 2     2
## 3     3
## 4     4
## 5    11
## 6    12
## 7    13
## 8    14

I can combine queries with different filtering functions, though the order matters. For example, to get distinct IDs from the order list, I would use

dbGetQuery(con, "SELECT DISTINCT id FROM orders")

##   id
## 1  4
## 2  8
## 3 42
## 4 50

whereas to put a proper filter on this, I would use WHERE

dbGetQuery(con, "SELECT * FROM orders WHERE id<10")

##   row_names order id   date
## 1         1     1  4 Jan-01
## 2         2     2  8 Feb-01
## 3         5    11  4 Apr-18
## 4         6    12  8 Feb-11

We can select from a list of values using IN syntax, where the list is a parenthetical comma separated list

dbGetQuery(con, "SELECT * FROM orders WHERE id IN (4, 8, 42)")

##   row_names order id   date
## 1         1     1  4 Jan-01
## 2         2     2  8 Feb-01
## 3         3     3 42 Apr-15
## 4         5    11  4 Apr-18
## 5         6    12  8 Feb-11
## 6         7    13 42 Jul-15

Problem 1: First, display the variables that are available in the customers table. Return the id and name columns in all instances where the row names are less than 4

Joins and Aliasing

One of the most common operations when working with SQL and tables is to perform a join. Joins are called using JOIN and ON to tell us what type of join we wish and how we want to match them

dbGetQuery(con, "SELECT * from orders
           LEFT JOIN customers ON orders.id = customers.id")

##   row_names order id   date row_names id    name
## 1         1     1  4 Jan-01         1  4   Tukey
## 2         5    11  4 Apr-18         1  4   Tukey
## 3         2     2  8 Feb-01         2  8 Wickham
## 4         6    12  8 Feb-11         2  8 Wickham
## 5         3     3 42 Apr-15         6 42     Cox
## 6         7    13 42 Jul-15         6 42     Cox
## 7         4     4 50 Apr-17      <NA> NA    <NA>
## 8         8    14 50 Jun-17      <NA> NA    <NA>

Ok, what’s happening here:

orders.id = customers.id tells us which variables from which table with the table.varname syntax
LEFT JOIN tells us left join
Because we used SELECT * we are retaining all columns from both tables. Probably not what we want

If all I want is the customer name from the customers table, I need to specify that I want everything from orders and only the customer name form customers. That would look something like this

dbGetQuery(con, "SELECT orders.*, customers.name FROM orders
           LEFT JOIN customers ON orders.id = customers.id")

##   row_names order id   date    name
## 1         1     1  4 Jan-01   Tukey
## 2         5    11  4 Apr-18   Tukey
## 3         2     2  8 Feb-01 Wickham
## 4         6    12  8 Feb-11 Wickham
## 5         3     3 42 Apr-15     Cox
## 6         7    13 42 Jul-15     Cox
## 7         4     4 50 Apr-17    <NA>
## 8         8    14 50 Jun-17    <NA>

It is no problem, then, to combine this with things we covered previously. For example, let’s take this join and only keep those with row_names less than 7

dbGetQuery(con, "SELECT orders.*, customers.name FROM orders
           LEFT JOIN customers ON orders.id = customers.id
           WHERE orders.row_names<7")

##   row_names order id   date    name
## 1         1     1  4 Jan-01   Tukey
## 2         5    11  4 Apr-18   Tukey
## 3         2     2  8 Feb-01 Wickham
## 4         6    12  8 Feb-11 Wickham
## 5         3     3 42 Apr-15     Cox
## 6         4     4 50 Apr-17    <NA>

Magic.

This brings us to the topic of aliasing. Aliasing involves assigning another name to either a variable in a table or a table itself. Aliasing can be done with AS

## Change column names to all caps
dbGetQuery(con, "SELECT date AS DATE, id AS ID FROM orders AS o")

##     DATE ID
## 1 Jan-01  4
## 2 Feb-01  8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18  4
## 6 Feb-11  8
## 7 Jul-15 42
## 8 Jun-17 50

In this case, we see the column names have changed, but there has been no apparent use of the alias for orders. The utility of this becomes clearer when we consider its use in joins

dbGetQuery(con, "SELECT * FROM orders AS o
           LEFT JOIN customers AS c ON o.id = c.id")

##   row_names order id   date row_names id    name
## 1         1     1  4 Jan-01         1  4   Tukey
## 2         5    11  4 Apr-18         1  4   Tukey
## 3         2     2  8 Feb-01         2  8 Wickham
## 4         6    12  8 Feb-11         2  8 Wickham
## 5         3     3 42 Apr-15         6 42     Cox
## 6         7    13 42 Jul-15         6 42     Cox
## 7         4     4 50 Apr-17      <NA> NA    <NA>
## 8         8    14 50 Jun-17      <NA> NA    <NA>

Here, by aliasing tables as o and c respectively, we can call o.id = c.id rather than order.id = customer.id. Whether this was useful in this case is besides the point; aliasing is common in the wild and it’s important to see its syntax. When we use it in conjunction with joins or filters, the alias carries throughout. Here is the same query we ran before

dbGetQuery(con, "SELECT o.*, c.name FROM orders AS o
           LEFT JOIN customers AS c ON o.id = c.id
           WHERE o.row_names<7")

##   row_names order id   date    name
## 1         1     1  4 Jan-01   Tukey
## 2         5    11  4 Apr-18   Tukey
## 3         2     2  8 Feb-01 Wickham
## 4         6    12  8 Feb-11 Wickham
## 5         3     3 42 Apr-15     Cox
## 6         4     4 50 Apr-17    <NA>

Problem 2: Write a SQL query that performs the following:

Selects only order, id, and date from orders table, but change column names to be all caps
Left join on customers by id
Filter to only include ID values of 4, 8, and 50
(Hint: If something is ambiguous, it helps to specify which table it comes from)

Your output should look like this

##   ID   DATE ORDERS
## 1  4 Jan-01      1
## 2  4 Apr-18     11
## 3  8 Feb-01      2
## 4  8 Feb-11     12
## 5 50 Apr-17      4
## 6 50 Jun-17     14

Note that we also have RIGHT JOIN and INNER JOIN available to use as appropriate

Anti Joins

Recall that anti-joins are used to identify implicit missing values: which entries do I have that are missing values in a corresponding table?

For example, suppose we wanted to find all of the customers in our database who haven’t placed an order. To do this with dplyr, we would run something like

customers <- dbGetQuery(con, "SELECT * FROM customers")
orders <- dbGetQuery(con, "SELECT * FROM orders")

anti_join(customers, orders, by = "id")

##   row_names id   name
## 1         3 15  Mason
## 2         4 16 Jordan
## 3         5 23  Patil

Anti-joins don’t exist directly in SQL, but there is a way we can deploy them. Start by recognizing that what we are looking for is all customers without a corresponding ID in orders. A left join on customers will allow us to see which these are

dbGetQuery(con, "select * from customers as c
           LEFT JOIN orders as o ON o.id=c.id")

##   row_names id    name row_names order id   date
## 1         1  4   Tukey         1     1  4 Jan-01
## 2         2  8 Wickham         2     2  8 Feb-01
## 3         6 42     Cox         3     3 42 Apr-15
## 4         1  4   Tukey         5    11  4 Apr-18
## 5         2  8 Wickham         6    12  8 Feb-11
## 6         6 42     Cox         7    13 42 Jul-15
## 7         3 15   Mason      <NA>    NA NA   <NA>
## 8         4 16  Jordan      <NA>    NA NA   <NA>
## 9         5 23   Patil      <NA>    NA NA   <NA>

We see just as above, there are three names without corresponding orders: Mason, Jordan, and Patil. To select only those with a missing ID in orders, we can use a WHERE statement

dbGetQuery(con, "select * from customers as c
           LEFT JOIN orders as o ON o.id=c.id
           WHERE o.id IS NULL")

##   row_names id   name row_names order id date
## 1         3 15  Mason      <NA>    NA NA <NA>
## 2         4 16 Jordan      <NA>    NA NA <NA>
## 3         5 23  Patil      <NA>    NA NA <NA>

Then, to extract only the names, we need to modify our SELECT statement to only grab the variable name from the table customers (aliased as c)

## Ta-da!
dbGetQuery(con, "select c.name from customers as c
           LEFT JOIN orders as o ON o.id=c.id
           WHERE o.id IS NULL")

##     name
## 1  Mason
## 2 Jordan
## 3  Patil

Joins Exercises

Here are a few exercises covering just what we’ve learned. For these, we will be using students, enrollments, and courses (not course)

dbListTables(con)

##  [1] "Batting"            "Iowa_homes_1"       "Iowa_homes_2"      
##  [4] "MLBPlayers"         "MLBTeams"           "Police_shootings_1"
##  [7] "Police_shootings_2" "batting2"           "course"            
## [10] "course_offering"    "courses"            "customers"         
## [13] "enrollments"        "office_hour"        "orders"            
## [16] "players2"           "professor_info"     "students"          
## [19] "syllabus_info"      "teams2"

Problem 3: Identify the all unique student IDs for students who are enrolled in at least one course

Problem 4: Identify all of the courses (by course name) that have at least one student enrolled

Problem 5: How do you think you join on multiple tables? Try to produce the following output showing each student and the name of the courses in which they’re enrolled. Note also the change in variable names

##   Student_Name                Course_Name
## 1        Alice        Intro to Statistics
## 2        Alice             Linear Algebra
## 3          Bob        Intro to Statistics
## 4          Bob               Econometrics
## 5      Charlie               Econometrics
## 6        Fiona Intro to Political Science
## 7       George             Linear Algebra

Problem 6: Return the names and majors of all students not enrolled in any classes.

A Quick Note on Computation

This snippet should (and will be) placed at the top of this document, but as you have already read past that, I will include it here.

In dplyr, we understand that piped functions are executed in the order they are written. That is, if we wish to mutate, group by then summarize, we would write them in that order

## Fake data, don't run this it won't work
df %>% mutate(newvar = var) %>% 
  group_by(grp) %>% 
  summarize(mean(newvar))

SQL doesn’t work like this. We will always write our SQL commands in this order:

## Written order
"SELECT ...
FROM ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ..."

However, it will always be executed in this order:

Do not memorize this. But, if you run into issues below, it may be a helpful thing to come back to.

Aggregation Functions

Just as is the case with dplyr functions, in addition to joining and filtering data, we often want to aggregate or summarize our data in some way. When searching for these types of functions online or with the reference sheet, these functions are typically listed under aggregation functions.

Aggregation functions are included after the SELECT command, specifying that we want to select or return an aggregated column from our query. For example, the count the number of orders from the orders dataset, we would use COUNT(*), which will count the total number of rows in our table. AS, as before, allows us to rename the column before returning.

dbGetQuery(con, "SELECT COUNT(*) AS total FROM orders")

##   total
## 1     8

When using COUNT with a specific column or variable name, it will instead return the number of non-empty counts

## Table of joins
dbGetQuery(con, "select * from customers as c
           LEFT JOIN orders as o ON o.id=c.id")

##   row_names id    name row_names order id   date
## 1         1  4   Tukey         1     1  4 Jan-01
## 2         2  8 Wickham         2     2  8 Feb-01
## 3         6 42     Cox         3     3 42 Apr-15
## 4         1  4   Tukey         5    11  4 Apr-18
## 5         2  8 Wickham         6    12  8 Feb-11
## 6         6 42     Cox         7    13 42 Jul-15
## 7         3 15   Mason      <NA>    NA NA   <NA>
## 8         4 16  Jordan      <NA>    NA NA   <NA>
## 9         5 23   Patil      <NA>    NA NA   <NA>

## How many non-empty order numbers
dbGetQuery(con, "select COUNT(o.order) as num_orders from customers as c
           LEFT JOIN orders as o ON o.id=c.id")

##   num_orders
## 1          6

Other functions work as you might expect. Here we use AVG to find the average of the variable row_names (admittedly, this is rather unmotivated)

dbGetQuery(con, "SELECT AVG(row_names) as avg_row FROM orders")

##   avg_row
## 1     4.5

Like dplyr, these commands are typically associated with a GROUP BY operation, allowing us to perform aggregating functions according to a group. We simply specify the name of the group on which we wish to aggregate.

dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders FROM orders
           GROUP BY id")

##   total_orders
## 1            2
## 2            2
## 3            2
## 4            2

Somewhat unconventionally (for us), GROUP BY is included after filtering a dataset. From a query standpoint, this makes sense: aggregating data can be computationally expensive, and one can greatly reduce the time of long queries by subsetting the data first, then performing functions on it.

dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders FROM orders
           WHERE row_names > 2
           GROUP BY id")

##   total_orders
## 1            1
## 2            1
## 3            2
## 4            2

Depending on the size of our query, we may want to grab only the top N of a dataset. We can do this by first ordering the data then specifying how much of it we want the query to return. This is done with ORDER BY and LIMIT. ORDER BY is always the last command in a SQL query.

dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
           WHERE row_names > 2
           GROUP BY id
           ORDER BY total_orders")

##   total_orders id
## 1            1  4
## 2            1  8
## 3            2 42
## 4            2 50

Just as sort() in R takes a descending argument, we can change the order in a SQL query with DESC

dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
           WHERE row_names > 2
           GROUP BY id
           ORDER BY total_orders DESC")

##   total_orders id
## 1            2 42
## 2            2 50
## 3            1  4
## 4            1  8

Finally, if we wish to only retain the first N rows, we can do so with LIMIT

dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
           WHERE row_names > 2
           GROUP BY id
           ORDER BY total_orders DESC
           LIMIT 2")

##   total_orders id
## 1            2 42
## 2            2 50

Problem 7: Suppose we wanted to retain the IDs with the largest two orders (like we did just above) without also returning the total_orders column (that is, we just want to return the ID). How might we do this?

Problem 8: Revisit your code from Problem 3, and expand upon this to return the number of students who are enrolled in at least one course.

Problem 9: The tables batting2 and players2 include MLB batting and player information from the 2010 season or later, respectively. Only considering the 2020 season and later, return the given name of the top 10 MLB players with the most homeruns. It should look something like this: (your query may take a few seconds to run. Also, you may ignore warning about importing as numeric)

##    total_hr           name
## 1       196    Aaron James
## 2       178         Shohei
## 3       163    Kyle Joseph
## 4       157   Peter Morgan
## 5       156   Matthew Kent
## 6       136   Yordan Ruben
## 7       136       Vladimir
## 8       132      Juan Jose
## 9       128   Jose Enrique
## 10      127 Michael Austin

HAVING and Subqueries

There are two other topics to help us round off our SQL journey. The first of these introduces HAVING, a filter similar to WHERE that happens after aggregation has taken place.

For example, consider the situation above where we found the total number of orders from each ID

dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
           WHERE row_names > 2 
           GROUP BY id
           ORDER BY total_orders")

##   total_orders id
## 1            1  4
## 2            1  8
## 3            2 42
## 4            2 50

If we only wanted to retain IDs with more than 1 order, we would want to filter the total_orders variable. However WHERE executes before COUNMT(orders.order) is evaluated. Trying to filter on this results in an error

dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
           WHERE row_names > 2 AND total_orders > 1
           GROUP BY id
           ORDER BY total_orders")

## Error in dbSendQuery(conn, statement, ...): could not run statement: Unknown column 'total_orders' in 'WHERE'

HAVING is the second to last SQL command, a final filter on any query prior to being sorted. As such, we include it prior to ORDER BY

dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
           WHERE row_names > 2
           GROUP BY id
           HAVING total_orders > 1
           ORDER BY total_orders")

##   total_orders id
## 1            2 42
## 2            2 50

Like WHERE, having can use a number of logical qualifiers. We have already introduced AND, NOT, and IN. Also helpful are OR and BETWEEN (observe that when no aggregation takes place, HAVING functions identically to WHERE)

dbGetQuery(con, "SELECT * from customers
           HAVING id BETWEEN 15 AND 23 OR row_names = 6")

##   row_names id   name
## 1         3 15  Mason
## 2         4 16 Jordan
## 3         5 23  Patil
## 4         6 42    Cox

Problem 10: Why does the following query fail? What can you do to fix it?

dbGetQuery(con, "SELECT COUNT(orders.order), id from orders
           GROUP BY id
           HAVING row_names BETWEEN 2 AND 6 OR id = 50")

## Error in dbSendQuery(conn, statement, ...): could not run statement: Unknown column 'row_names' in 'HAVING'

Finally, we consider the topic of subqueries. A subquery is a summarized result set that is used as a part of a larger query. For example, suppose using our courses database we wish to find all of the students enrolled in at least two coures. Up to this point, we would first join students with enrollment data then keep those with two or more entries

## This shows us all the students from the join
dbGetQuery(con, "SELECT * FROM students AS s
           LEFT JOIN enrollments AS e on e.student_id = s.student_id")

##    row_names student_id    name            major row_names student_id course_id
## 1          1       S001   Alice       Statistics         1       S001      C101
## 2          1       S001   Alice       Statistics         2       S001      C102
## 3          2       S002     Bob      Mathematics         3       S002      C101
## 4          2       S002     Bob      Mathematics         4       S002      C103
## 5          3       S003 Charlie Computer Science         5       S003      C103
## 6          6       S007   Fiona          Biology         7       S007      C104
## 7          7       S008  George      Mathematics         8       S008      C102
## 8          7       S008  George      Mathematics         9       S008      C105
## 9          4       S004   Diana       Statistics      <NA>       <NA>      <NA>
## 10         5       S005    Evan          Physics      <NA>       <NA>      <NA>
##    semester
## 1      Fall
## 2      Fall
## 3      Fall
## 4      Fall
## 5      Fall
## 6      Fall
## 7      Fall
## 8      Fall
## 9      <NA>
## 10     <NA>

## Here we use HAVING to aggregate on the resulting set
dbGetQuery(con, "SELECT s.* FROM students AS s
           LEFT JOIN enrollments AS e on e.student_id = s.student_id
           GROUP BY s.student_id
           HAVING COUNT(*) > 1")

##   row_names student_id   name       major
## 1         1       S001  Alice  Statistics
## 2         2       S002    Bob Mathematics
## 3         7       S008 George Mathematics

This works, but we can make our intent more clear by creating a query more consistent with our goal as we would typically express it in English: “select all rows from students table that are in the enrollment table more than twice”

dbGetQuery(con, "SELECT * from students as s
           WHERE student_id IN (
              SELECT student_id
              FROM enrollments
              GROUP BY student_id
              HAVING COUNT(*) > 1
           )")

##   row_names student_id   name       major
## 1         1       S001  Alice  Statistics
## 2         2       S002    Bob Mathematics
## 3         7       S008 George Mathematics

This syntax of “find students IN some set” makes clear what our goal is. Note also that while the indenting here is optional, the parenthesis are not.

There is one oddity when using subqueries to be aware of: if I am pulling a variable from a subquery, I need to give that subquery an alias since it does not exist as a table. SQL insists that this be true. For this contrived example, suppose I want to take the average (by ID) of 2 times the row names from a subquery. This won’t work:

dbGetQuery(con, "SELECT AVG(double_row) FROM (
            SELECT 2*row_names AS double_row, id FROM orders
            )
           GROUP BY id")

## Error in dbSendQuery(conn, statement, ...): could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'GROUP BY id' at line 4

What I need to do is give an alias to the table generated between the parenthesis

dbGetQuery(con, "SELECT AVG(double_row), id FROM (
            SELECT 2*row_names AS double_row, id FROM orders
            ) AS new_table
           GROUP BY id")

##   AVG(double_row) id
## 1               6  4
## 2               8  8
## 3              10 42
## 4              12 50

This appears to only be the case when using FROM; you can SELECT all you wish from subqueries without ever having to give them a name.

Arithmetic

One final thing of note is that basic arithmetic in the construction of new variables works very similar to R

## This shows us everything in orders
dbGetQuery(con, "SELECT * from orders")

##   row_names order id   date
## 1         1     1  4 Jan-01
## 2         2     2  8 Feb-01
## 3         3     3 42 Apr-15
## 4         4     4 50 Apr-17
## 5         5    11  4 Apr-18
## 6         6    12  8 Feb-11
## 7         7    13 42 Jul-15
## 8         8    14 50 Jun-17

## Here is how we would do arithmetic
dbGetQuery(con, "SELECT
           row_names + 3*orders.order / id AS new_val, orders.*
           FROM orders")

##   new_val row_names order id   date
## 1  1.7500         1     1  4 Jan-01
## 2  2.7500         2     2  8 Feb-01
## 3  3.2143         3     3 42 Apr-15
## 4  4.2400         4     4 50 Apr-17
## 5 13.2500         5    11  4 Apr-18
## 6 10.5000         6    12  8 Feb-11
## 7  7.9286         7    13 42 Jul-15
## 8  8.8400         8    14 50 Jun-17

There are a few functions built into SQL, but more typically we will reserve complex data analysis and manipulation to R, leaving SQL to handle large-scale queries on the server. Here are a few others

dbGetQuery(con, "SELECT
           POWER(row_names, 2) AS powers, 
           ABS(orders.order - id) AS abs_val,
           MOD(row_names, orders.order) AS mod_remainders
           FROM orders")

##   powers abs_val mod_remainders
## 1      1       3              0
## 2      4       6              0
## 3      9      39              0
## 4     16      46              0
## 5     25       7              5
## 6     36       4              6
## 7     49      29              7
## 8     64      36              8

Extracises

Problem 11: In Homework 2, Question 3, you were asked to find the slugging percentage for each team from 1969 and return the top 10. Rewrite that query in SQL syntax, pulling instead from the MLBTeams table on the database. You should end up with the following query

##    yearID teamID    SLG
## 1    2023    ATL 0.5008
## 2    2019    HOU 0.4955
## 3    2019    MIN 0.4941
## 4    2003    BOS 0.4909
## 5    2019    NYA 0.4899
## 6    1997    SEA 0.4845
## 7    1994    CLE 0.4838
## 8    1996    SEA 0.4836
## 9    2001    COL 0.4830
## 10   2020    LAN 0.4829

Problem 12: Run a query to find the ten players with at least 5 career at-bats (AB) (that is, not 5 AB per season) with the highest home run percentage (home runs divided by at bats) from the Batting table

Problem 13: Run a query to determine how many players with at least 5 at-bats (AB) have a career hit percentage (total hits (H) divided by at bats) of at least 0.3. Confirm that it is 647

Problem 14: Run a query to find all players whose total career hits exceeds the average total career hitsk across all players since 1980

Problem 15: Determine the number of players who have at least 5 seasons in which they hit 20 or more home runs since 1980

Lab 10 – Structured Query Langauge (SQL)