Lecture 23: Databases

12 November 2014


  • What databases are, and why
  • SQL
  • Interfacing R and SQL

Reading: Spector, chapter 3; handout on class website


  • A record is a collection of fields
  • A table is a collection of records which all have the same fields (with different values)
  • A database is a collection of tables

Databases vs. Dataframes

  • R's dataframes are actually tables
R jargon Database jargon
column field
row record
dataframe table
types of the columns table schema
bunch of related dataframes database

So, Why Do We Need Database Software?

  • Size
    • R keeps its dataframes in memory
    • Industrial databases can be much bigger
    • Work with selected subsets
  • Speed
    • Clever people have worked very hard on getting just what you want fast
  • Concurrency
    • Many users accessing the same database simultaneously)
    • Lots of potential for trouble (two users want to change the same record at once)

The Client-Server Model

  • Databases live on a server, which manages them
  • Users interact with the server through a client program
  • Lets multiple users access the same database simultaneously


  • SQL (structured query language) is the standard for database software
  • Mostly about queries, which are like doing a selection in R
  • Let's look at how SQL does stuff like this


SELECT columns or computations
  FROM table
  WHERE condition
  GROUP BY columns
  HAVING condition
  ORDER BY column [ASC|DESC]
  LIMIT offset,count;
  • SELECT is the first word of a query, then modifiers say which fields/columns to use, and what conditions records/rows must meet, from which tables

  • The final semi-colon is obligatory


SELECT PlayerID,yearID,AB,H FROM Batting;

Four columns from table Batting

SELECT * FROM Salaries;

All columns from table Salaries

SELECT * FROM Salaries ORDER BY Salary;

As above, but by ascending value of Salary


Descending order


top 10 salaries


Picking out rows meeting a condition

SELECT PlayerID,yearID,AB,H FROM Batting WHERE AB > 100 AND H > 0;


Batting[Batting$AB>100 & Batting$H > 0, c("PlayerID","yearID","AB","H")]

Calculated Columns

  • SQL knows about some simple summary statistics:
  • It can do arithmetic
<small>Because `AB` and `H` are integers, and it won't give you a fractional part by default</small>
  • Calculated columns can get names:
SELECT PlayerID,yearID,H/CAST(AB AS REAL) AS BattingAvg FROM Batting
  ORDER BY BattingAvg DESC LIMIT 10;


We can do calculations on value-grouped subsets, like in aggregate or d*ply

SELECT playerID, SUM(salary) FROM Salaries GROUP BY playerID

Selecting Again

  • First cut of records is with WHERE
  • Aggregation of recordw with GROUP BY
  • Post-aggregation selection with HAVING
SELECT playerID, SUM(salary) AS totalSalary FROM Salaries GROUP BY playerID
  HAVING totalSalary > 200000000


  • So far FROM has just been one table
  • Sometimes we need to combine information from many tables
patient_last patient_first physician_id complaint
Morgan Dexter 37010 insomnia
Soprano Anthony 79676 malaise
Swearengen Albert NA healthy as a goddam horse
Garrett Alma 90091 nerves
Holmes Sherlock 43675 nicotine-patch addiction
physician_last physician_first physicianID plan
Meridian Emmett 37010 UPMC
Melfi Jennifer 79676 BCBS
Cochran Amos 90091 UPMC
Watson John 43675 VA


  • Suppose we want to know which doctors are treating patients for insomnia
    • Complaints are in one table
    • Physicians are in the other
  • In R, we'd use merge to link the tables up by physicianID
  • Here, physician_id or physicianID is acting as the key or unique identifier


  • SQL doesn't have merge, it has JOIN as a modifier to FROM
SELECT physician_first, physician_last FROM patients INNER JOIN physicians ON patients.physician_id == physicians.physicianID WHERE condition=="insomnia"

Creates a (virtual) table linking records where physician_id in one table matches physicianID in the other

  • If the names were the same in the two tables, we could write (e.g.)
SELECT nameLast,nameFirst,yearID,AB,H FROM Master INNER JOIN Batting

INNER JOIN ... USING links records with the same value of playerID

  • There are some syntax variants here; see the handout


  • LEFT OUTER JOIN includes records from the first table which don't match any record in the 2nd
    • The “extra” records get NA in the 2nd table's fields
  • RIGHT OUTER JOIN is just what you'd think
    • so is FULL OUTER JOIN

Updated Translation Table

R jargon Database jargon
column field
row record
dataframe table
types of the columns table schema
bunch of dataframes database
selections, subset SELECT ... FROM ... WHERE ... HAVING
aggregate, d*ply GROUP BY
merge JOIN
order ORDER BY

Connecting R to SQL

  • SQL is a language; database management systems (DMBS) actually implement it and do the work
    • MySQL, SQLite, etc., etc.
  • They all have somewhat different conventions
  • The R package DBI is a unified interface to them
  • Need a separate “driver” for each DBMS

Connecting R to SQL

install.packages("DBI", dependencies = TRUE) # Install DBI
install.packages("RSQLite", dependencies = TRUE) # Install driver for SQLite
drv <- dbDriver('SQLite')
con <- dbConnect(drv, dbname="baseball.db")

con is now a persistent connection to the database baseball.db

Connecting R to SQL

dbListTables(con)         # Get tables in the database (returns vector)
dbListFields(con, name)  # List fields in a table
dbReadTable(con, name)   # Import a table as a data frame

Connecting R to SQL

dbGetQuery(conn, statement)
df <- dbGetQuery(con, paste(
  "SELECT nameLast,nameFirst,yearID,salary",
  "FROM Master NATURAL JOIN Salaries"))

Connecting R to SQL

Usual workflow:

  • Load the driver, connect to the right database
  • R sends an SQL query to the DBMS
  • SQL executes the query, sending back a manageably small dataframe
  • R does the actual statistics
  • Close the connection when you're done

Going the Other Way

  • The sqldf package lets you use SQL commands on dataframes
  • Mostly useful if you already know SQL better than R…


  • A database is basically a way of dealing efficiently with lots of potentially huge dataframes
  • SQL is the standard language for telling databases what to do, especially what queries to run
  • Everything in an SQL query is something we've practiced already in R
    • subsetting/selection, aggregation, merging, ordering
  • Connect R to the database, send it an SQL query, analyse the returned dataframe