Chapter 4 Relational databases
4.1 Why use a database?
There are limitations on the types of data that R handles well. Since all data being manipulated by R are resident in memory, and several copies of the data can be created during execution of a function, R is not well suited to extremely large data sets. Data objects that are more than a (few) hundred megabytes in size can cause R to run out of memory, particularly on a 32-bit operating system.
R does not easily support concurrent access to data. That is, if more than one user is accessing, and perhaps updating, the same data, the changes made by one user will not be visible to the others.
R does support persistence of data, in that you can save a data object or an entire worksheet from one session and restore it at the subsequent session, but the format of the stored data is specific to R and not easily manipulated by other systems.
Database management systems (DBMSs) and, in particular, relational DBMSs (RDBMSs) are designed to do all of these things well. Their strengths are
- To provide fast access to selected parts of large databases.
- Powerful ways to summarize and cross-tabulate columns in databases.
- Store data in more organized ways than the rectangular grid model of spreadsheets and R data frames.
- Concurrent access from multiple clients running on multiple hosts while enforcing security constraints on access to the data.
- Ability to act as a server to a wide range of clients.
The sort of statistical applications for which DBMS might be used are to extract a 10% sample of the data, to cross-tabulate data to produce a multi-dimensional contingency table, and to extract data group by group from a database for separate analysis.
Increasingly OSes are themselves making use of DBMSs for these reasons, so it is nowadays likely that one will be already installed on your (non-Windows) OS. Akonadi is used by KDE4 to store personal information. Several macOS applications, including Mail and Address Book, use SQLite.
4.2 Overview of RDBMSs
Traditionally there had been large (and expensive) commercial RDBMSs (Informix; Oracle; Sybase; IBM’s DB2; Microsoft SQL Server on Windows) and academic and small-system databases (such as MySQL4, PostgreSQL, Microsoft Access, …), the former marked out by much greater emphasis on data security features. The line is blurring, with MySQL and PostgreSQL having more and more high-end features, and free ‘express’ versions being made available for the commercial DBMSs.
There are other commonly used data sources, including spreadsheets, non-relational databases and even text files (possibly compressed). Open Database Connectivity (ODBC) is a standard to use all of these data sources. It originated on Windows (see https://msdn.microsoft.com/en-us/library/ms710252%28v=vs.85%29.aspx) but is also implemented on Linux/Unix/macOS.
All of the packages described later in this chapter provide clients to client/server databases. The database can reside on the same machine or (more often) remotely. There is an ISO standard (in fact several: SQL92 is ISO/IEC 9075, also known as ANSI X3.135-1992, and SQL99 is coming into use) for an interface language called SQL (Structured Query Language, sometimes pronounced ‘sequel’: see Bowman et al. 1996 and Kline and Kline 2001) which these DBMSs support to varying degrees.
4.2.1 SQL queries
The more comprehensive R interfaces generate SQL behind the scenes for common operations, but direct use of SQL is needed for complex operations in all. Conventionally SQL is written in upper case, but many users will find it more convenient to use lower case in the R interface functions.
A relational DBMS stores data as a database of tables (or relations) which are rather similar to R data frames, in that they are made up of columns or fields of one type (numeric, character, date, currency, …) and rows or records containing the observations for one entity.
SQL ‘queries’ are quite general operations on a relational database. The classical query is a SELECT statement of the type
SELECT State, Murder FROM USArrests WHERE Rape > 30 ORDER BY Murder
SELECT t.sch, c.meanses, t.sex, t.achieve
FROM student as t, school as c WHERE t.sch = c.id
SELECT sex, COUNT(*) FROM student GROUP BY sex
SELECT sch, AVG(sestat) FROM student GROUP BY sch LIMIT 10
The first of these selects two columns from the R data frame USArrests
that has been copied across to a database table, subsets on a third column and asks the results be sorted. The second performs a database join on two tables student
and school
and returns four columns. The third and fourth queries do some cross-tabulation and return counts or averages. (The five aggregation functions are COUNT(*) and SUM, MAX, MIN and AVG, each applied to a single column.)
SELECT queries use FROM to select the table, WHERE to specify a condition for inclusion (or more than one condition separated by AND or OR), and ORDER BY to sort the result. Unlike data frames, rows in RDBMS tables are best thought of as unordered, and without an ORDER BY statement the ordering is indeterminate. You can sort (in lexicographical order) on more than one column by separating them by commas. Placing DESC after an ORDER BY puts the sort in descending order.
SELECT DISTINCT queries will only return one copy of each distinct row in the selected table.
The GROUP BY clause selects subgroups of the rows according to the criterion. If more than one column is specified (separated by commas) then multi-way cross-classifications can be summarized by one of the five aggregation functions. A HAVING clause allows the select to include or exclude groups depending on the aggregated value.
If the SELECT statement contains an ORDER BY statement that produces a unique ordering, a LIMIT clause can be added to select (by number) a contiguous block of output rows. This can be useful to retrieve rows a block at a time. (It may not be reliable unless the ordering is unique, as the LIMIT clause can be used to optimize the query.)
There are queries to create a table (CREATE TABLE, but usually one copies a data frame to the database in these interfaces), INSERT or DELETE or UPDATE data. A table is destroyed by a DROP TABLE ‘query’.
Kline and Kline (2001) discuss the details of the implementation of SQL in Microsoft SQL Server 2000, Oracle, MySQL and PostgreSQL.
4.2.2 Data types
Data can be stored in a database in various data types. The range of data types is DBMS-specific, but the SQL standard defines many types, including the following that are widely implemented (often not by the SQL name).
-
float(p)
-
Real number, with optional precision. Often called
real
ordouble
ordouble precision
. -
integer
-
32-bit integer. Often called
int
. -
smallint
-
16-bit integer
-
character(n)
-
fixed-length character string. Often called
char
. -
character varying(n)
-
variable-length character string. Often called
varchar
. Almost always has a limit of 255 chars. -
boolean
-
true or false. Sometimes called
bool
orbit
. -
date
-
calendar date
-
time
-
time of day
-
timestamp
-
date and time
There are variants on time
and timestamp
, with timezone
. Other types widely implemented are text
and blob
, for large blocks of text and binary data, respectively.
The more comprehensive of the R interface packages hide the type conversion issues from the user.
4.3 R interface packages
There are several packages available on CRAN to help R communicate with DBMSs. They provide different levels of abstraction. Some provide means to copy whole data frames to and from databases. All have functions to select data within the database via SQL queries, and to retrieve the result as a whole as a data frame or in pieces (usually as groups of rows).
All except RODBC are tied to one DBMS, but there has been a proposal for a unified ‘front-end’ package DBI (https://developer.r-project.org/db) in conjunction with a ‘back-end’, the most developed of which is RMySQL. Also on CRAN are the back-ends ROracle, RPostgreSQL and RSQLite (which works with the bundled DBMS SQLite
, https://www.sqlite.org) and RJDBC (which uses Java and can connect to any DBMS that has a JDBC driver).
PL/R (http://www.joeconway.com/plr/
) is a project to embed R into PostgreSQL.
Package RMongo provides an R interface to a Java client for ‘MongoDB’ (https://en.wikipedia.org/wiki/MongoDB) databases, which are queried using JavaScript rather than SQL. Package mongolite is another client using mongodb’s C driver.
4.3.1 Packages using DBI
Package RMySQL on CRAN provides an interface to the MySQL database system (see https://www.mysql.com and Dubois, 2000) or its fork MariaDB (see https://mariadb.org/). The description here applies to versions 0.5-0
and later: earlier versions had a substantially different interface. The current version requires the DBI package, and this description will apply with minor changes to all the other back-ends to DBI.
MySQL exists on Unix/Linux/macOS and Windows: there is a ‘Community Edition’ released under GPL but commercial licenses are also available. MySQL was originally a ‘light and lean’ database. (It preserves the case of names where the operating file system is case-sensitive, so not on Windows.)
The call dbDriver(“MySQL”)
returns a database connection manager object, and then a call to dbConnect
opens a database connection which can subsequently be closed by a call to the generic function dbDisconnect
. Use dbDriver(“Oracle”)
, dbDriver(“PostgreSQL”)
or dbDriver(“SQLite”)
with those DBMSs and packages ROracle, RPostgreSQL or RSQLite respectively.
SQL queries can be sent by either dbSendQuery
or dbGetQuery
. dbGetquery
sends the query and retrieves the results as a data frame. dbSendQuery
sends the query and returns an object of class inheriting from “DBIResult”
which can be used to retrieve the results, and subsequently used in a call to dbClearResult
to remove the result.
Function fetch
is used to retrieve some or all of the rows in the query result, as a list. The function dbHasCompleted
indicates if all the rows have been fetched, and dbGetRowCount
returns the number of rows in the result.
These are convenient interfaces to read/write/test/delete tables in the database. dbReadTable
and dbWriteTable
copy to and from an R data frame, mapping the row names of the data frame to the field row_names
in the MySQL
table.
> library(RMySQL) # will load DBI as well
## open a connection to a MySQL database
> con <- dbConnect(dbDriver("MySQL"), dbname = "test")
## list the tables in the database
> dbListTables(con)
## load a data frame into the database, deleting any existing copy
> data(USArrests)
> dbWriteTable(con, "arrests", USArrests, overwrite = TRUE)
TRUE
> dbListTables(con)
[1] "arrests"
## get the whole table
> dbReadTable(con, "arrests")
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
...
## Select from the loaded table
> dbGetQuery(con, paste("select row_names, Murder from arrests",
"where Rape > 30 order by Murder"))
row_names Murder
1 Colorado 7.9
2 Arizona 8.1
3 California 9.0
4 Alaska 10.0
5 New Mexico 11.4
6 Michigan 12.1
7 Nevada 12.2
8 Florida 15.4
> dbRemoveTable(con, "arrests")
> dbDisconnect(con)
4.3.2 Package RODBC
Package RODBC on CRAN provides an interface to database sources supporting an ODBC interface. This is very widely available, and allows the same R code to access different database systems. RODBC runs on Unix/Linux, Windows and macOS, and almost all database systems provide support for ODBC. We have tested Microsoft SQL Server, Access, MySQL, PostgreSQL, Oracle and IBM DB2 on Windows and MySQL, MariaDB, Oracle, PostgreSQL and SQLite on Linux.
ODBC is a client-server system, and we have happily connected to a DBMS running on a Unix server from a Windows client, and vice versa.
On Windows ODBC support is part of the OS. On Unix/Linux you will need an ODBC Driver Manager such as unixODBC (http://www.unixODBC.org) or iOBDC (http://www.iODBC.org: this is pre-installed in macOS) and an installed driver for your database system.
Windows provides drivers not just for DBMSs but also for Excel (.xls) spreadsheets, DBase (.dbf) files and even text files. (The named applications do not need to be installed. Which file formats are supported depends on the versions of the drivers.) There are versions for Excel and Access 2007/2010 (go to https://www.microsoft.com/en-us/download/default.aspx, and search for ‘Office ODBC’, which will lead to AccessDatabaseEngine.exe), the ‘2007 Office System Driver’ (the latter has a version for 64-bit Windows, and that will also read earlier versions).
On macOS the Actual Technologies (https://www.actualtech.com/product_access.php) drivers provide ODBC interfaces to Access databases and to Excel spreadsheets (not including Excel 2007/2010).
Many simultaneous connections are possible. A connection is opened by a call to odbcConnect
or odbcDriverConnect
(which on the Windows GUI allows a database to be selected via dialog boxes) which returns a handle used for subsequent access to the database. Printing a connection will provide some details of the ODBC connection, and calling odbcGetInfo
will give details on the client and server.
A connection is closed by a call to close
or odbcClose
, and also (with a warning) when not R object refers to it and at the end of an R session.
Details of the tables on a connection can be found using sqlTables
.
Function sqlSave
copies an R data frame to a table in the database, and sqlFetch
copies a table in the database to an R data frame.
An SQL query can be sent to the database by a call to sqlQuery
. This returns the result in an R data frame. (sqlCopy
sends a query to the database and saves the result as a table in the database.) A finer level of control is attained by first calling odbcQuery
and then sqlGetResults
to fetch the results. The latter can be used within a loop to retrieve a limited number of rows at a time, as can function sqlFetchMore
.
Here is an example using PostgreSQL, for which the ODBC driver maps column and data frame names to lower case. We use a database testdb
we created earlier, and had the DSN (data source name) set up in ~/.odbc.ini under unixODBC
. Exactly the same code worked using MyODBC to access a MySQL database under Linux or Windows (where MySQL also maps names to lowercase). Under Windows, DSNs are set up in the ODBC applet in the Control Panel (‘Data Sources (ODBC)’ in the ‘Administrative Tools’ section).
> library(RODBC)
## tell it to map names to l/case
> channel <- odbcConnect("testdb", uid="ripley", case="tolower")
## load a data frame into the database
> data(USArrests)
> sqlSave(channel, USArrests, rownames = "state", addPK = TRUE)
> rm(USArrests)
## list the tables in the database
> sqlTables(channel)
TABLE_QUALIFIER TABLE_OWNER TABLE_NAME TABLE_TYPE REMARKS
1 usarrests TABLE
## list it
> sqlFetch(channel, "USArrests", rownames = "state")
murder assault urbanpop rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
...
## an SQL query, originally on one line
> sqlQuery(channel, "select state, murder from USArrests
where rape > 30 order by murder")
state murder
1 Colorado 7.9
2 Arizona 8.1
3 California 9.0
4 Alaska 10.0
5 New Mexico 11.4
6 Michigan 12.1
7 Nevada 12.2
8 Florida 15.4
## remove the table
> sqlDrop(channel, "USArrests")
## close the connection
> odbcClose(channel)
As a simple example of using ODBC under Windows with a Excel spreadsheet, we can read from a spreadsheet by
> library(RODBC)
> channel <- odbcConnectExcel("bdr.xls")
## list the spreadsheets
> sqlTables(channel)
TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS
1 C:\\bdr NA Sheet1$ SYSTEM TABLE NA
2 C:\\bdr NA Sheet2$ SYSTEM TABLE NA
3 C:\\bdr NA Sheet3$ SYSTEM TABLE NA
4 C:\\bdr NA Sheet1$Print_Area TABLE NA
## retrieve the contents of sheet 1, by either of
> sh1 <- sqlFetch(channel, "Sheet1")
> sh1 <- sqlQuery(channel, "select * from [Sheet1$]")
Notice that the specification of the table is different from the name returned by sqlTables
: sqlFetch
is able to map the differences.