This lab serves as an introduction to SQL (Structured Query Language) and how SQL is often implemented in R
Data are often stored in databases, or organized collections of structured information (data). Most databases are a collection of tables of data like those we’ve been working with so far in class. For this class, and most applications in general, we will work with Relational Databases. A system that manages a relational database is known (surprisingly) as an relational database management system (RDBMS). I bring this up only because you will see the term RDBMS thrown around a lot in SQL-land.
The essential component of a Relational Database is its relational structure. A Relational Database consists of a number of tables. Each Table consists of columns of attributes and unique rows of information. Tables may have a “Primary Key” which is a unique identifier. They may also have “Foreign Keys” which are keys that can be used to match to other tables. It is best practice that the first column be the Primary Key if it exists.
The combination of a Primary and Foreign Key is a relation and where the name comes from. The tables can be related to each other to store, access, and modify data more efficiently.
In effect, this will work nearly identically to our joins
lab, where instead of having the data stored locally, it will be
kept (and processed) on an RDBMS server, and instead of
dplyr, we will be using syntax from SQL.
Copy and run the line below in your console (do not put in your Rmd file) to install the packages necessary for this lab:
install.packages(c("DBI", "odbc", "RMySQL", "dbplyr"))
library(DBI)
library(odbc)
library(RMySQL)
library(dbplyr)
We start by establishing a connection to an active RDBMS server
hosted here on Grinnell’s campus (note: you must be on campus wi-fi to
do this). We can name this connection whatever we want (in this case, we
call it con), and all of our SQL operations will be
performed in the context of this connection.
con <- dbConnect(
MySQL(),
dbname = "STA230",
host = "aiken.cs.grinnell.edu",
port = 3306,
user = "student",
password = "password!"
)
Once we have finished working with our connection, it’s good practice to disconnect using
## Don't run this until you're finished
dbDisconnect(con)
There are only two functions we’ll need today:
dbListTables(), which will list all of the tables in our
relational database and dbGetQuery() which will execute a
query on the SQL server. We’ll start with the former, showing us which
tables we have to work with, passing in the connection we have already
established
dbListTables(con)
## [1] "Batting" "Iowa_homes_1" "Iowa_homes_2"
## [4] "MLBPlayers" "MLBTeams" "Police_shootings_1"
## [7] "Police_shootings_2" "batting2" "course"
## [10] "course_offering" "courses" "customers"
## [13] "enrollments" "office_hour" "orders"
## [16] "players2" "professor_info" "students"
## [19] "syllabus_info" "teams2"
This is a poor example of what a relational database may look like in
real life, as they contain a number of tables that have nothing to do
with each other. However, they’ll serve for the time being. For our
illustrations, we will continue using two of these tables,
customers and orders.
Here is a helpful reference sheet for common queries and commands in SQL. A short list (without examples) is also given below
A quick note on computation:
In dplyr, we understand that piped functions are
executed in the order they are written. That is, if we wish to mutate,
group by then summarize, we would write them in that order
## Fake data, don't run this it won't work
df %>% mutate(newvar = var) %>%
group_by(grp) %>%
summarize(mean(newvar))
SQL doesn’t work like this. We will always write our SQL commands in this order:
## Written order
"SELECT ...
FROM ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ..."
However, it will always be executed in this order:
\[
\text{FROM} \rightarrow \text{WHERE} \rightarrow \text{GROUP BY}
\rightarrow \text{HAVING} \rightarrow \text{SELECT} \rightarrow
\text{ORDER BY}
\] From a computational aspect, we can sort of reason it out.
FROM (including joins) and WHERE start by
selecting the tables we need and filtering them accordingly. This also
includes any aliasing created with AS. The ensures that we
are working on the smallest dataset we need. HAVING is a
filter like WHERE that works on data that has already been
summarized (more on this shortly), SELECT allows you to
select whichever columns we need after the rest of the query has been
resolved. Finally, ORDER BY takes the final query and sorts
it.
Do not memorize this. But, if you run into issues below, it may be a helpful thing to come back to.
Let’s begin by getting some basic information from one of our tables.
We do this with the SQL command DESCRIBE. All we need to do
this is the name of the connection and the table we wish to
describe:
dbGetQuery(con, "DESCRIBE orders")
## Field Type Null Key Default Extra
## 1 row_names text YES <NA>
## 2 order bigint(20) YES <NA>
## 3 id bigint(20) YES <NA>
## 4 date text YES <NA>
This returns a collection of column names (Field) and the type of value (Type). The (Null) column indicates if a column is allowed to contain null values (yes by default), (Key) if a key has been established in the RBDMS (it hasn’t), (Default) describes how new data is handled when adding rows (we don’t need this now), and (Extra) describes extra attributes a column may have (also not needed). In any event, we see we have four columns, two of type text and two of type integer.
We can run a query and return a data.frame using the syntax
SELECT [something] FROM [table name]. The
[something] can be a comma separated list of variables
names or we can use * to select them all
## Select all columns from orders table
dbGetQuery(con, "SELECT * FROM orders")
## row_names order id date
## 1 1 1 4 Jan-01
## 2 2 2 8 Feb-01
## 3 3 3 42 Apr-15
## 4 4 4 50 Apr-17
## 5 5 11 4 Apr-18
## 6 6 12 8 Feb-11
## 7 7 13 42 Jul-15
## 8 8 14 50 Jun-17
## Select date and id from orders table
dbGetQuery(con, "SELECT date, id FROM orders")
## date id
## 1 Jan-01 4
## 2 Feb-01 8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18 4
## 6 Feb-11 8
## 7 Jul-15 42
## 8 Jun-17 50
We can run into problems, though, if we have column names that are
associated with SQL commands. For example, ORDER is a
command in SQL as well as a name in this table. If I try to select it,
I’ll get an error
## Select order from orders table
dbGetQuery(con, "SELECT order FROM orders")
## Error in dbSendQuery(conn, statement, ...): could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'order FROM orders' at line 1
Again, because SQL is case insensitive, the fact we have
order in lowercase doesn’t matter. We can get around this
by specifying from which table we want to grab the order
variable from the table orders using a . with
the syntax TABLE.VAR_NAME
## Select order from orders table
dbGetQuery(con, "SELECT orders.order FROM orders")
## order
## 1 1
## 2 2
## 3 3
## 4 4
## 5 11
## 6 12
## 7 13
## 8 14
I can combine queries with different filtering functions, though the order matters. For example, to get distinct IDs from the order list, I would use
dbGetQuery(con, "SELECT DISTINCT id FROM orders")
## id
## 1 4
## 2 8
## 3 42
## 4 50
whereas to put a proper filter on this, I would use
WHERE
dbGetQuery(con, "SELECT * FROM orders WHERE id<10")
## row_names order id date
## 1 1 1 4 Jan-01
## 2 2 2 8 Feb-01
## 3 5 11 4 Apr-18
## 4 6 12 8 Feb-11
We can select from a list of values using IN syntax,
where the list is a parenthetical comma separated list
dbGetQuery(con, "SELECT * FROM orders WHERE id IN (4, 8, 42)")
## row_names order id date
## 1 1 1 4 Jan-01
## 2 2 2 8 Feb-01
## 3 3 3 42 Apr-15
## 4 5 11 4 Apr-18
## 5 6 12 8 Feb-11
## 6 7 13 42 Jul-15
Problem 1: First, display the variables that are
available in the customers table. Return the
id and name columns in all instances where the
row names are less than 4
One of the most common operations when working with SQL and tables is
to perform a join. Joins are called using JOIN and
ON to tell us what type of join we wish and how we want to
match them
dbGetQuery(con, "SELECT * from orders
LEFT JOIN customers ON orders.id = customers.id")
## row_names order id date row_names id name
## 1 1 1 4 Jan-01 1 4 Tukey
## 2 5 11 4 Apr-18 1 4 Tukey
## 3 2 2 8 Feb-01 2 8 Wickham
## 4 6 12 8 Feb-11 2 8 Wickham
## 5 3 3 42 Apr-15 6 42 Cox
## 6 7 13 42 Jul-15 6 42 Cox
## 7 4 4 50 Apr-17 <NA> NA <NA>
## 8 8 14 50 Jun-17 <NA> NA <NA>
Ok, what’s happening here:
orders.id = customers.id tells us which variables from
which table with the table.varname syntaxLEFT JOIN tells us left joinSELECT * we are retaining all columns
from both tables. Probably not what we wantIf all I want is the customer name from the customers table, I need to specify that I want everything from orders and only the customer name form customers. That would look something like this
dbGetQuery(con, "SELECT orders.*, customers.name FROM orders
LEFT JOIN customers ON orders.id = customers.id")
## row_names order id date name
## 1 1 1 4 Jan-01 Tukey
## 2 5 11 4 Apr-18 Tukey
## 3 2 2 8 Feb-01 Wickham
## 4 6 12 8 Feb-11 Wickham
## 5 3 3 42 Apr-15 Cox
## 6 7 13 42 Jul-15 Cox
## 7 4 4 50 Apr-17 <NA>
## 8 8 14 50 Jun-17 <NA>
It is no problem, then, to combine this with things we covered previously. For example, let’s take this join and only keep those with row_names less than 7
dbGetQuery(con, "SELECT orders.*, customers.name FROM orders
LEFT JOIN customers ON orders.id = customers.id
WHERE orders.row_names<7")
## row_names order id date name
## 1 1 1 4 Jan-01 Tukey
## 2 5 11 4 Apr-18 Tukey
## 3 2 2 8 Feb-01 Wickham
## 4 6 12 8 Feb-11 Wickham
## 5 3 3 42 Apr-15 Cox
## 6 4 4 50 Apr-17 <NA>
Magic.
This brings us to the topic of aliasing. Aliasing involves assigning
another name to either a variable in a table or a table itself. Aliasing
can be done with AS
## Change column names to all caps
dbGetQuery(con, "SELECT date AS DATE, id AS ID FROM orders AS o")
## DATE ID
## 1 Jan-01 4
## 2 Feb-01 8
## 3 Apr-15 42
## 4 Apr-17 50
## 5 Apr-18 4
## 6 Feb-11 8
## 7 Jul-15 42
## 8 Jun-17 50
In this case, we see the column names have changed, but there has
been no apparent use of the alias for orders. The utility
of this becomes clearer when we consider its use in joins
dbGetQuery(con, "SELECT * FROM orders AS o
LEFT JOIN customers AS c ON o.id = c.id")
## row_names order id date row_names id name
## 1 1 1 4 Jan-01 1 4 Tukey
## 2 5 11 4 Apr-18 1 4 Tukey
## 3 2 2 8 Feb-01 2 8 Wickham
## 4 6 12 8 Feb-11 2 8 Wickham
## 5 3 3 42 Apr-15 6 42 Cox
## 6 7 13 42 Jul-15 6 42 Cox
## 7 4 4 50 Apr-17 <NA> NA <NA>
## 8 8 14 50 Jun-17 <NA> NA <NA>
Here, by aliasing tables as o and c
respectively, we can call o.id = c.id rather than
order.id = customer.id. Whether this was useful in this
case is besides the point; aliasing is common in the wild and it’s
important to see its syntax. When we use it in conjunction with joins or
filters, the alias carries throughout. Here is the same query we ran
before
dbGetQuery(con, "SELECT o.*, c.name FROM orders AS o
LEFT JOIN customers AS c ON o.id = c.id
WHERE o.row_names<7")
## row_names order id date name
## 1 1 1 4 Jan-01 Tukey
## 2 5 11 4 Apr-18 Tukey
## 3 2 2 8 Feb-01 Wickham
## 4 6 12 8 Feb-11 Wickham
## 5 3 3 42 Apr-15 Cox
## 6 4 4 50 Apr-17 <NA>
Problem 2: Write a SQL query that performs the following:
Your output should look like this
## ID DATE ORDERS
## 1 4 Jan-01 1
## 2 4 Apr-18 11
## 3 8 Feb-01 2
## 4 8 Feb-11 12
## 5 50 Apr-17 4
## 6 50 Jun-17 14
Note that we also have RIGHT JOIN and
INNER JOIN available to use as appropriate
Recall that anti-joins are used to identify implicit missing values: which entries do I have that are missing values in a corresponding table?
For example, suppose we wanted to find all of the customers in our database who haven’t placed an order. To do this with dplyr, we would run something like
customers <- dbGetQuery(con, "SELECT * FROM customers")
orders <- dbGetQuery(con, "SELECT * FROM orders")
anti_join(customers, orders, by = "id")
## row_names id name
## 1 3 15 Mason
## 2 4 16 Jordan
## 3 5 23 Patil
Anti-joins don’t exist directly in SQL, but there is a way we can
deploy them. Start by recognizing that what we are looking for is all
customers without a corresponding ID in orders. A left join on
customers will allow us to see which these are
dbGetQuery(con, "select * from customers as c
LEFT JOIN orders as o ON o.id=c.id")
## row_names id name row_names order id date
## 1 1 4 Tukey 1 1 4 Jan-01
## 2 2 8 Wickham 2 2 8 Feb-01
## 3 6 42 Cox 3 3 42 Apr-15
## 4 1 4 Tukey 5 11 4 Apr-18
## 5 2 8 Wickham 6 12 8 Feb-11
## 6 6 42 Cox 7 13 42 Jul-15
## 7 3 15 Mason <NA> NA NA <NA>
## 8 4 16 Jordan <NA> NA NA <NA>
## 9 5 23 Patil <NA> NA NA <NA>
We see just as above, there are three names without corresponding
orders: Mason, Jordan, and Patil. To select only those with a missing ID
in orders, we can use a WHERE statement
dbGetQuery(con, "select * from customers as c
LEFT JOIN orders as o ON o.id=c.id
WHERE o.id IS NULL")
## row_names id name row_names order id date
## 1 3 15 Mason <NA> NA NA <NA>
## 2 4 16 Jordan <NA> NA NA <NA>
## 3 5 23 Patil <NA> NA NA <NA>
Then, to extract only the names, we need to modify our
SELECT statement to only grab the variable
name from the table customers (aliased as
c)
## Ta-da!
dbGetQuery(con, "select c.name from customers as c
LEFT JOIN orders as o ON o.id=c.id
WHERE o.id IS NULL")
## name
## 1 Mason
## 2 Jordan
## 3 Patil
Here are a few exercises covering just what we’ve learned. For these,
we will be using students, enrollments, and
courses (not course)
dbListTables(con)
## [1] "Batting" "Iowa_homes_1" "Iowa_homes_2"
## [4] "MLBPlayers" "MLBTeams" "Police_shootings_1"
## [7] "Police_shootings_2" "batting2" "course"
## [10] "course_offering" "courses" "customers"
## [13] "enrollments" "office_hour" "orders"
## [16] "players2" "professor_info" "students"
## [19] "syllabus_info" "teams2"
Problem 3: Identify the all unique student IDs for students who are enrolled in at least one course
Problem 4: Identify all of the courses (by course name) that have at least one student enrolled
Problem 5: How do you think you join on multiple tables? Try to produce the following output showing each student and the name of the courses in which they’re enrolled. Note also the change in variable names
## Student_Name Course_Name
## 1 Alice Intro to Statistics
## 2 Alice Linear Algebra
## 3 Bob Intro to Statistics
## 4 Bob Econometrics
## 5 Charlie Econometrics
## 6 Fiona Intro to Political Science
## 7 George Linear Algebra
Problem 6: Return the names and majors of all students not enrolled in any classes.
This snippet should (and will be) placed at the top of this document, but as you have already read past that, I will include it here.
In dplyr, we understand that piped functions are
executed in the order they are written. That is, if we wish to mutate,
group by then summarize, we would write them in that order
## Fake data, don't run this it won't work
df %>% mutate(newvar = var) %>%
group_by(grp) %>%
summarize(mean(newvar))
SQL doesn’t work like this. We will always write our SQL commands in this order:
## Written order
"SELECT ...
FROM ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ..."
However, it will always be executed in this order:
\[
\text{FROM} \rightarrow \text{WHERE} \rightarrow \text{GROUP BY}
\rightarrow \text{HAVING} \rightarrow \text{SELECT} \rightarrow
\text{ORDER BY}
\] From a computational aspect, we can sort of reason it out.
FROM (including joins) and WHERE start by
selecting the tables we need and filtering them accordingly. This also
includes any aliasing created with AS. The ensures that we
are working on the smallest dataset we need. HAVING is a
filter like WHERE that works on data that has already been
summarized (more on this shortly), SELECT allows you to
select whichever columns we need after the rest of the query has been
resolved. Finally, ORDER BY takes the final query and sorts
it.
Do not memorize this. But, if you run into issues below, it may be a helpful thing to come back to.
Just as is the case with dplyr functions, in addition to
joining and filtering data, we often want to aggregate or summarize our
data in some way. When searching for these types of functions online or
with the reference
sheet, these functions are typically listed under aggregation
functions.
Aggregation functions are included after the SELECT
command, specifying that we want to select or return an aggregated
column from our query. For example, the count the number of orders from
the orders dataset, we would use COUNT(*),
which will count the total number of rows in our table. AS,
as before, allows us to rename the column before returning.
dbGetQuery(con, "SELECT COUNT(*) AS total FROM orders")
## total
## 1 8
When using COUNT with a specific column or variable
name, it will instead return the number of non-empty counts
## Table of joins
dbGetQuery(con, "select * from customers as c
LEFT JOIN orders as o ON o.id=c.id")
## row_names id name row_names order id date
## 1 1 4 Tukey 1 1 4 Jan-01
## 2 2 8 Wickham 2 2 8 Feb-01
## 3 6 42 Cox 3 3 42 Apr-15
## 4 1 4 Tukey 5 11 4 Apr-18
## 5 2 8 Wickham 6 12 8 Feb-11
## 6 6 42 Cox 7 13 42 Jul-15
## 7 3 15 Mason <NA> NA NA <NA>
## 8 4 16 Jordan <NA> NA NA <NA>
## 9 5 23 Patil <NA> NA NA <NA>
## How many non-empty order numbers
dbGetQuery(con, "select COUNT(o.order) as num_orders from customers as c
LEFT JOIN orders as o ON o.id=c.id")
## num_orders
## 1 6
Other functions work as you might expect. Here we use
AVG to find the average of the variable
row_names (admittedly, this is rather unmotivated)
dbGetQuery(con, "SELECT AVG(row_names) as avg_row FROM orders")
## avg_row
## 1 4.5
Like dplyr, these commands are typically associated with a
GROUP BY operation, allowing us to perform aggregating
functions according to a group. We simply specify the name of the group
on which we wish to aggregate.
dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders FROM orders
GROUP BY id")
## total_orders
## 1 2
## 2 2
## 3 2
## 4 2
Somewhat unconventionally (for us), GROUP BY is included
after filtering a dataset. From a query standpoint, this makes
sense: aggregating data can be computationally expensive, and one can
greatly reduce the time of long queries by subsetting the data first,
then performing functions on it.
dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders FROM orders
WHERE row_names > 2
GROUP BY id")
## total_orders
## 1 1
## 2 1
## 3 2
## 4 2
Depending on the size of our query, we may want to grab only the top
N of a dataset. We can do this by first ordering the data then
specifying how much of it we want the query to return. This is done with
ORDER BY and LIMIT. ORDER BY is
always the last command in a SQL query.
dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
WHERE row_names > 2
GROUP BY id
ORDER BY total_orders")
## total_orders id
## 1 1 4
## 2 1 8
## 3 2 42
## 4 2 50
Just as sort() in R takes a descending argument, we can
change the order in a SQL query with DESC
dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
WHERE row_names > 2
GROUP BY id
ORDER BY total_orders DESC")
## total_orders id
## 1 2 42
## 2 2 50
## 3 1 4
## 4 1 8
Finally, if we wish to only retain the first N rows, we can do so
with LIMIT
dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
WHERE row_names > 2
GROUP BY id
ORDER BY total_orders DESC
LIMIT 2")
## total_orders id
## 1 2 42
## 2 2 50
Problem 7: Suppose we wanted to retain the IDs with
the largest two orders (like we did just above) without also returning
the total_orders column (that is, we just want to
return the ID). How might we do this?
Problem 8: Revisit your code from Problem 3, and expand upon this to return the number of students who are enrolled in at least one course.
Problem 9: The tables batting2 and
players2 include MLB batting and player information from
the 2010 season or later, respectively. Only considering the 2020 season
and later, return the given name of the top 10 MLB players with
the most homeruns. It should look something like this: (your query may
take a few seconds to run. Also, you may ignore warning about importing
as numeric)
## total_hr name
## 1 196 Aaron James
## 2 178 Shohei
## 3 163 Kyle Joseph
## 4 157 Peter Morgan
## 5 156 Matthew Kent
## 6 136 Yordan Ruben
## 7 136 Vladimir
## 8 132 Juan Jose
## 9 128 Jose Enrique
## 10 127 Michael Austin
There are two other topics to help us round off our SQL journey. The
first of these introduces HAVING, a filter similar to
WHERE that happens after aggregation has taken
place.
For example, consider the situation above where we found the total number of orders from each ID
dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
WHERE row_names > 2
GROUP BY id
ORDER BY total_orders")
## total_orders id
## 1 1 4
## 2 1 8
## 3 2 42
## 4 2 50
If we only wanted to retain IDs with more than 1 order, we would want
to filter the total_orders variable. However
WHERE executes before COUNMT(orders.order) is
evaluated. Trying to filter on this results in an error
dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
WHERE row_names > 2 AND total_orders > 1
GROUP BY id
ORDER BY total_orders")
## Error in dbSendQuery(conn, statement, ...): could not run statement: Unknown column 'total_orders' in 'WHERE'
HAVING is the second to last SQL command, a final filter
on any query prior to being sorted. As such, we include it prior to
ORDER BY
dbGetQuery(con, "SELECT COUNT(orders.order) as total_orders, id FROM orders
WHERE row_names > 2
GROUP BY id
HAVING total_orders > 1
ORDER BY total_orders")
## total_orders id
## 1 2 42
## 2 2 50
Like WHERE, having can use a number of logical
qualifiers. We have already introduced AND,
NOT, and IN. Also helpful are OR
and BETWEEN (observe that when no aggregation takes place,
HAVING functions identically to WHERE)
dbGetQuery(con, "SELECT * from customers
HAVING id BETWEEN 15 AND 23 OR row_names = 6")
## row_names id name
## 1 3 15 Mason
## 2 4 16 Jordan
## 3 5 23 Patil
## 4 6 42 Cox
Problem 10: Why does the following query fail? What can you do to fix it?
dbGetQuery(con, "SELECT COUNT(orders.order), id from orders
GROUP BY id
HAVING row_names BETWEEN 2 AND 6 OR id = 50")
## Error in dbSendQuery(conn, statement, ...): could not run statement: Unknown column 'row_names' in 'HAVING'
Finally, we consider the topic of subqueries. A subquery is a summarized result set that is used as a part of a larger query. For example, suppose using our courses database we wish to find all of the students enrolled in at least two coures. Up to this point, we would first join students with enrollment data then keep those with two or more entries
## This shows us all the students from the join
dbGetQuery(con, "SELECT * FROM students AS s
LEFT JOIN enrollments AS e on e.student_id = s.student_id")
## row_names student_id name major row_names student_id course_id
## 1 1 S001 Alice Statistics 1 S001 C101
## 2 1 S001 Alice Statistics 2 S001 C102
## 3 2 S002 Bob Mathematics 3 S002 C101
## 4 2 S002 Bob Mathematics 4 S002 C103
## 5 3 S003 Charlie Computer Science 5 S003 C103
## 6 6 S007 Fiona Biology 7 S007 C104
## 7 7 S008 George Mathematics 8 S008 C102
## 8 7 S008 George Mathematics 9 S008 C105
## 9 4 S004 Diana Statistics <NA> <NA> <NA>
## 10 5 S005 Evan Physics <NA> <NA> <NA>
## semester
## 1 Fall
## 2 Fall
## 3 Fall
## 4 Fall
## 5 Fall
## 6 Fall
## 7 Fall
## 8 Fall
## 9 <NA>
## 10 <NA>
## Here we use HAVING to aggregate on the resulting set
dbGetQuery(con, "SELECT s.* FROM students AS s
LEFT JOIN enrollments AS e on e.student_id = s.student_id
GROUP BY s.student_id
HAVING COUNT(*) > 1")
## row_names student_id name major
## 1 1 S001 Alice Statistics
## 2 2 S002 Bob Mathematics
## 3 7 S008 George Mathematics
This works, but we can make our intent more clear by creating a query more consistent with our goal as we would typically express it in English: “select all rows from students table that are in the enrollment table more than twice”
dbGetQuery(con, "SELECT * from students as s
WHERE student_id IN (
SELECT student_id
FROM enrollments
GROUP BY student_id
HAVING COUNT(*) > 1
)")
## row_names student_id name major
## 1 1 S001 Alice Statistics
## 2 2 S002 Bob Mathematics
## 3 7 S008 George Mathematics
This syntax of “find students IN some set” makes clear what our goal is. Note also that while the indenting here is optional, the parenthesis are not.
There is one oddity when using subqueries to be aware of: if I am pulling a variable from a subquery, I need to give that subquery an alias since it does not exist as a table. SQL insists that this be true. For this contrived example, suppose I want to take the average (by ID) of 2 times the row names from a subquery. This won’t work:
dbGetQuery(con, "SELECT AVG(double_row) FROM (
SELECT 2*row_names AS double_row, id FROM orders
)
GROUP BY id")
## Error in dbSendQuery(conn, statement, ...): could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'GROUP BY id' at line 4
What I need to do is give an alias to the table generated between the parenthesis
dbGetQuery(con, "SELECT AVG(double_row), id FROM (
SELECT 2*row_names AS double_row, id FROM orders
) AS new_table
GROUP BY id")
## AVG(double_row) id
## 1 6 4
## 2 8 8
## 3 10 42
## 4 12 50
This appears to only be the case when using FROM; you
can SELECT all you wish from subqueries without ever having
to give them a name.
One final thing of note is that basic arithmetic in the construction of new variables works very similar to R
## This shows us everything in orders
dbGetQuery(con, "SELECT * from orders")
## row_names order id date
## 1 1 1 4 Jan-01
## 2 2 2 8 Feb-01
## 3 3 3 42 Apr-15
## 4 4 4 50 Apr-17
## 5 5 11 4 Apr-18
## 6 6 12 8 Feb-11
## 7 7 13 42 Jul-15
## 8 8 14 50 Jun-17
## Here is how we would do arithmetic
dbGetQuery(con, "SELECT
row_names + 3*orders.order / id AS new_val, orders.*
FROM orders")
## new_val row_names order id date
## 1 1.7500 1 1 4 Jan-01
## 2 2.7500 2 2 8 Feb-01
## 3 3.2143 3 3 42 Apr-15
## 4 4.2400 4 4 50 Apr-17
## 5 13.2500 5 11 4 Apr-18
## 6 10.5000 6 12 8 Feb-11
## 7 7.9286 7 13 42 Jul-15
## 8 8.8400 8 14 50 Jun-17
There are a few functions built into SQL, but more typically we will reserve complex data analysis and manipulation to R, leaving SQL to handle large-scale queries on the server. Here are a few others
dbGetQuery(con, "SELECT
POWER(row_names, 2) AS powers,
ABS(orders.order - id) AS abs_val,
MOD(row_names, orders.order) AS mod_remainders
FROM orders")
## powers abs_val mod_remainders
## 1 1 3 0
## 2 4 6 0
## 3 9 39 0
## 4 16 46 0
## 5 25 7 5
## 6 36 4 6
## 7 49 29 7
## 8 64 36 8
Problem 11: In Homework 2,
Question 3, you were asked to find the slugging percentage for each
team from 1969 and return the top 10. Rewrite that query in SQL syntax,
pulling instead from the MLBTeams table on the database.
You should end up with the following query
## yearID teamID SLG
## 1 2023 ATL 0.5008
## 2 2019 HOU 0.4955
## 3 2019 MIN 0.4941
## 4 2003 BOS 0.4909
## 5 2019 NYA 0.4899
## 6 1997 SEA 0.4845
## 7 1994 CLE 0.4838
## 8 1996 SEA 0.4836
## 9 2001 COL 0.4830
## 10 2020 LAN 0.4829
Problem 12: Run a query to find the ten players with
at least 5 career at-bats (AB) (that is, not 5 AB per season) with the
highest home run percentage (home runs divided by at bats) from the
Batting table
Problem 13: Run a query to determine how many
players with at least 5 at-bats (AB) have a career hit percentage (total
hits (H) divided by at bats) of at least 0.3. Confirm that
it is 647
Problem 14: Run a query to find all players whose total career hits exceeds the average total career hitsk across all players since 1980
Problem 15: Determine the number of players who have at least 5 seasons in which they hit 20 or more home runs since 1980