Appendix D — SQL primer

This brief appendix aims to serve two groups of users. The first group comprises those who have followed the material above and would like a quick introduction to SQL. The second group comprises those who know SQL (say, SAS’s PROC SQL) would like a quick introduction to the dplyr-based approach to R that we use in this book.

D.1 What is dplyr?

Throughout this course, we make use of the R package dplyr, which is a core part of the Tidyverse. From the Tidyverse website:

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names
  • filter() picks cases based on their values
  • summarize() reduces multiple values down to a single summary
  • arrange() changes the ordering of the rows

Prior to the advent of the dplyr in 2014, most users of R would have used base R functions and operators, such as subset, $, and [. However, dplyr provides a much more consistent framework for manipulating data that is easier to learn, and especially so for users familiar with SQL. This last point makes sense, as SQL provides something like a “grammar of data manipulation” of its own. In fact, each dplyr verb has an SQL equivalent keyword or concept, as we see below.

D.2 A translation table

dplyr verb (R) SQL equivalent
|> FROM
select SELECT
filter WHERE
group_by GROUP BY
arrange ORDER BY
mutate used-defined columns
summarize used-defined aggregate columns

As in earlier chapters, let’s set up a database connection that we can use within R.

library(DBI)
library(dplyr, warn.conflicts =  FALSE)

pg <- dbConnect(RPostgres::Postgres())

D.3 SQL terms SELECT and FROM

Let’s begin with a basic SQL query.

SELECT date, vwretd, ewretd
FROM crsp.dsi
Displaying records 1 - 10
date vwretd ewretd
1925-12-31 NA NA
1926-01-02 0.0056892 0.0095159
1926-01-04 0.0007065 0.0057804
1926-01-05 -0.0048213 -0.0019270
1926-01-06 -0.0004233 0.0011816
1926-01-07 0.0049883 0.0084531
1926-01-08 -0.0032378 -0.0016890
1926-01-09 0.0022093 0.0033121
1926-01-11 -0.0085396 -0.0099432
1926-01-12 -0.0009290 -0.0036231

This query

  • extracts the data in three columns (date, vwretd, ewretd) (the first line)
  • from the table named crsp.dsi (the second line)

While the syntax is a bit different, SQL’s SELECT operates very much like select from dplyr.

Translating this into dplyr code using the pipe operator (|>), it’s easy to see that the order of presentation is one of the big differences between SQL and the dplyr approach.

While we first need to set up the table on which to operate, once we’ve done so we can see that the FROM is implicit in the |> operator.

crsp.dsi <- tbl(pg, Id(schema = "crsp", table = "dsi"))

crsp.dsi |>
  select(date, vwretd, ewretd)  
date vwretd ewretd
1925-12-31 NA NA
1926-01-02 0.0057 0.0095
1926-01-04 0.0007 0.0058
1926-01-05 -0.0048 -0.0019
1926-01-06 -0.0004 0.0012

D.4 SQL WHERE

The filter verb from dplyr corresponds to WHERE in SQL. Note that WHERE goes after the FROM clause in SQL, though in practice the query optimizer will option execute the filter implied by the WHERE clause before executing other elements of a query (it would be wasteful to perform calculations on data that are going to be filtered out later on).

SELECT date, vwretd, ewretd
FROM crsp.dsi
WHERE date = '2015-01-02'
1 records
date vwretd ewretd
2015-01-02 -0.0002423 0.0014701

In dplyr, this query could be written as follows:

crsp.dsi |>
  select(date, vwretd, ewretd) |>
  filter(date == '2015-01-02')
date vwretd ewretd
2015-01-02 -2e-04 0.0015

But it could also be written with the filter coming first.

crsp.dsi |>
  filter(date == '2015-01-02') |>
  select(date, vwretd, ewretd) 
date vwretd ewretd
2015-01-02 -2e-04 0.0015

Note that each of these dplyr queries is implemented in the same way by the PostgreSQL query optimizer, as can be seen by examining the output from explain().

crsp.dsi |>
  select(date, vwretd, ewretd) |>
  filter(date == '2015-01-02') |>
  explain()
<SQL>
SELECT "date", "vwretd", "ewretd"
FROM "crsp"."dsi"
WHERE ("date" = '2015-01-02')

<PLAN>
                                                               QUERY PLAN
1 Index Scan using dsi_date_idx on dsi  (cost=0.29..8.30 rows=1 width=20)
2                                 Index Cond: (date = '2015-01-02'::date)
crsp.dsi |>
  filter(date == '2015-01-02') |>
  select(date, vwretd, ewretd) |>
  explain()
<SQL>
SELECT "date", "vwretd", "ewretd"
FROM "crsp"."dsi"
WHERE ("date" = '2015-01-02')

<PLAN>
                                                               QUERY PLAN
1 Index Scan using dsi_date_idx on dsi  (cost=0.29..8.30 rows=1 width=20)
2                                 Index Cond: (date = '2015-01-02'::date)

We can see that these are in turn viewed by the query optimizer as equivalents of the SQL above.

EXPLAIN
SELECT date, vwretd, ewretd
FROM crsp.dsi
WHERE date = '2015-01-02'
2 records
QUERY PLAN
Index Scan using dsi_date_idx on dsi (cost=0.29..8.30 rows=1 width=20)
Index Cond: (date = ‘2015-01-02’::date)

This illustrates a nice feature of SQL engines, namely that you can leave some of the details of the query to the software and focus on the higher-level requirements of your query (i.e., which observations to keep, etc.).

D.5 SQL ORDER BY

A R data frame can be considered as a list of vectors, where each vector has a well-defined order. In contrast, SQL tables should be considered to have any particular order absent an ORDER BY clause (see here for discussion).

An ORDER BY clause is placed at the end of an SQL query (reflecting in this case the order of operations) followed by the columns that the query is to be sorted on.

SELECT date, vwretd, ewretd
FROM crsp.dsi
ORDER BY vwretd
Displaying records 1 - 10
date vwretd ewretd
1987-10-19 -0.1713457 -0.1038971
1929-10-29 -0.1195386 -0.1348561
2020-03-16 -0.1181675 -0.1076310
1929-10-28 -0.1130215 -0.0952016
1929-11-06 -0.0970203 -0.0838187
2020-03-12 -0.0968405 -0.1005971
1933-07-21 -0.0932724 -0.1354675
2008-10-15 -0.0898179 -0.0707809
2008-12-01 -0.0894055 -0.0782403
1933-07-20 -0.0849734 -0.0931411

Thus ORDER BY works very much like arrange.

crsp.dsi |>
  select(date, vwretd, ewretd) |>
  arrange(vwretd)
date vwretd ewretd
1987-10-19 -0.1713 -0.1039
1929-10-29 -0.1195 -0.1349
2020-03-16 -0.1182 -0.1076
1929-10-28 -0.1130 -0.0952
1929-11-06 -0.0970 -0.0838

To reverse the order, use the DESC keyword after the relevant variable.

SELECT date, vwretd, ewretd
FROM crsp.dsi
ORDER BY vwretd DESC, date
Displaying records 1 - 10
date vwretd ewretd
1925-12-31 NA NA
1933-03-15 0.1568376 0.2030201
1929-10-30 0.1221548 0.1257142
2008-10-13 0.1149183 0.1074221
1931-10-06 0.1112778 0.0887420
1932-09-21 0.1098460 0.1130083
2008-10-28 0.0953434 0.0503864
2020-03-24 0.0915562 0.0821751
1939-09-05 0.0891618 0.1474485
1987-10-21 0.0866138 0.0692920

Thus DESC is very much equivalent to desc in dplyr apart from the minor syntactical difference that desc() is written as a function of the relevant variable.

crsp.dsi |>
  select(date, vwretd, ewretd) |>
  arrange(desc(vwretd), date)
date vwretd ewretd
1925-12-31 NA NA
1933-03-15 0.1568 0.2030
1929-10-30 0.1222 0.1257
2008-10-13 0.1149 0.1074
1931-10-06 0.1113 0.0887

It is important to note that some differences in the implementation details between data frames in R and tables in SQL are seen with arrange. While a data frame in R is fundamentally a list of ordered vectors, SQL tables are best thought of a sets of rows without a natural order. As such, one needs to pay attention to when arrange() is implemented in a query that depends on it.

To illustrate this issue, suppose we get get stock returns for Apple and order them from largest to smallest returns.1

crsp.dsf <- tbl(pg, Id(schema = "crsp", table = "dsf"))

apple <-
  crsp.dsf |>
  filter(permno == 14593L) |>
  select(permno, date, ret) 

apple |>
  arrange(desc(ret)) 
permno date ret
14593 1980-12-12 NA
14593 1997-08-06 0.3323
14593 1998-01-02 0.2381
14593 1996-07-18 0.2370
14593 1998-01-06 0.1929

The following code illustrates that the order created above is lost when we merged with crsp.dsi. Fortunately, dplyr provides a warning when we use arrange() in such situations.

apple |>
  inner_join(crsp.dsi, by = "date") |>
  select(permno, date, ret, vwretd)
permno date ret vwretd
14593 1980-12-12 NA 0.0149
14593 1980-12-15 -0.0521 0.0016
14593 1980-12-16 -0.0732 0.0075
14593 1980-12-17 0.0247 0.0162
14593 1980-12-18 0.0289 0.0041

D.6 SQL approach to mutate()

Creating new variables in SQL is quite straightforward. In addition to basic mathematical operators, such as + and *, PostgreSQL has an extensive array of functions available for use in queries.

The names of calculated variables can be specified using AS, as can be seen in the following query.

SELECT date, 
  vwretd - ewretd AS ret_diff,
  ln(1 + vwretd) AS log_vwretd,
  date_part('year', date) AS year
FROM crsp.dsi
Displaying records 1 - 10
date ret_diff log_vwretd year
1925-12-31 NA NA 1925
1926-01-02 -0.0038267 0.0056731 1926
1926-01-04 -0.0050740 0.0007062 1926
1926-01-05 -0.0028943 -0.0048330 1926
1926-01-06 -0.0016049 -0.0004234 1926
1926-01-07 -0.0034649 0.0049759 1926
1926-01-08 -0.0015489 -0.0032431 1926
1926-01-09 -0.0011027 0.0022069 1926
1926-01-11 0.0014036 -0.0085763 1926
1926-01-12 0.0026942 -0.0009294 1926

The translation of this query into dplyr requires the use of the mutate verb, but this translation is straightforward, as can be seen below.

crsp.dsi |>
  mutate(ret_diff = vwretd - ewretd, 
         log_vwretd = ln(1 + vwretd),
         year = date_part('year', date)) |>
  select(date, ret_diff, log_vwretd, year) 
date ret_diff log_vwretd year
1925-12-31 NA NA 1925
1926-01-02 -0.0038 0.0057 1926
1926-01-04 -0.0051 0.0007 1926
1926-01-05 -0.0029 -0.0048 1926
1926-01-06 -0.0016 -0.0004 1926

Note that in the code above, we are using the function ln() to get the natural logarithm, rather than the R function log(). We will discuss this further below, but for now we just note that log() in PostgreSQL returns the base-10 logarithm and—because the calculations are being done by PostgreSQL, not R—we need to use the appropriate PostgreSQL function when using dplyr in this way.2 Similarly, we are using the date_part() function from PostgreSQL, but in this case we could have used year, which is translated by dbplyr to provide the nearest equivalent of the year function found in the lubridate package.

Note that the dplyr code is slightly more verbose than the SQL because the SQL equivalent of mutate is implicit and part of the SELECT portion of the statement.

D.7 SQL GROUP BY and aggregates

One use for group_by in R is to create summary statistics and the like for each of the groups implied by the group_by variables. In SQL, GROUP BY is used with aggregate functions, which are equivalent to functions used with summarize.

Suppose we want to get the maximum and minimum returns, as well as a measure of the volatility of the returns, for each year. In SQL, we could do the following:

SELECT date_part('year', date) AS year,
  max(vwretd) AS max_ret,
  min(vwretd) AS min_ret,
  stddev(ln(1 + vwretd)) AS sd_ret
FROM crsp.dsi
GROUP BY year
ORDER BY sd_ret DESC
Displaying records 1 - 10
year max_ret min_ret sd_ret
1925 NA NA NA
1932 0.1098460 -0.0731177 0.0274218
1933 0.1568376 -0.0932724 0.0256663
2008 0.1149183 -0.0898179 0.0253972
2020 0.0915562 -0.1181675 0.0213797
1929 0.1221548 -0.1195386 0.0207289
1931 0.1112778 -0.0680485 0.0205423
1987 0.0866138 -0.1713457 0.0182097
2009 0.0691054 -0.0553249 0.0176678
1938 0.0573017 -0.0516716 0.0164642

And the dplyr equivalent would look like this:

crsp.dsi |>
  mutate(year = date_part('year', date)) |>
  group_by(year) |>
  summarize(max_ret = max(vwretd, na.rm = TRUE),
            min_ret = min(vwretd, na.rm = TRUE),
            sd_ret = stddev(ln(1 + vwretd))) |>
  arrange(desc(sd_ret))
year max_ret min_ret sd_ret
1925 NA NA NA
1932 0.1098 -0.0731 0.0274
1933 0.1568 -0.0933 0.0257
2008 0.1149 -0.0898 0.0254
2020 0.0916 -0.1182 0.0214

Note that the na.rm = TRUE arguments could be omitted, as SQL always omits these values. The inclusion of this argument serves only as a reminder to the user of the code that these values are omitted.

D.8 Benefits of using relational databases

  1. Fast random access to on-disk data. Relational databases make it easy to index data, which makes it “alive” rather than “dead” in the sense we use here. We will see that accessing specific observations from large data sets is fast and easy with a relational database. Accessing stock-return data on crsp.dsf in the WRDS PostgreSQL database for Microsoft for a few dates is quite fast.3

  2. Data accessible by any software package. Data stored in a PostgreSQL database is accessible from R, Stata, Python, Perl, … pretty much any software. This has subtle but important advantages for data collaboration and multilingual programming. For example, you may be a Stata devotee who needs help preparing a data set from Web data. Finding people with the skills to do this is going to be a lot easier if you don’t need them to have Stata skills, but can work with a Python guru. With a relational database, a Stata user is unlikely to care much if the data coming out of PostgreSQL was put there using Python. In another example, you may have written some complicated Perl code to generate some data sets, and occasionally need to tweak the code to accommodate changes in the source data. As long as the data sets are stored in a PostgreSQL database, you will have no need to translate the entire code into something you still know how to code in. This benefit is probably also the primary driver of WRDS’s decision to add a PostgreSQL database to its offerings. With more users wanting to use Python or R for analysis, sticking to a SAS-format-only approach was surely not viable.

  3. Data accessible from anywhere. Data on the WRDS PostgreSQL server can be accessed from anywhere with an internet connection (in the 2020s, this essentially means anywhere) by anyone with a WRDS account. For example, data stored on a PostgreSQL server in Boston can be easily accessed by co-authors in California, Australia, or Singapore.4

  4. Centralised processing power. In this course, we will see a number of cases where the heavy part of data processing for some tasks is accomplished on the WRDS PostgreSQL server even when executing R code locally. While SAS has similar functionality with RSUBMIT, running queries on a remote PostgreSQL server from RStudio is much more seamless.

  5. Data can be (and should be) properly typed. PostgreSQL in particular has a much richer set of data types than a typical statistical programming package. For example, everything in SAS is fundamentally of either a floating-point numerical type or a fixed-width character type. Things such as dates and times are essentially formatting overlays over these types, and such formatting is easy to break, sometimes causing serious issues in practice. (Stata is fairly similar.) In contrast, PostgreSQL offers standard types such as strings, floating-point, integers, large integers, along with timestamps with time zones, JSON, XML, and other types. While these data types are no doubt implemented in ways similar to how, say, Stata handles dates, the difference is that a user is never forced to be aware of these details. Most of the commonly used types are recognized fairly seamlessly and converted to appropriate types when data are brought into R (e.g., timestamps with time zones) or Python.

    We say “can be” in the heading to this section because whether the data actually have appropriate types is a function of how much care was used in getting the data into the database in the first place. For example, at the time of writing, WRDS is often fairly careless about data types. So variables that are naturally integers, such as firm identifiers like CIKs or PERMNOs, are double precision rather than integer types.5 Text variables are generally character varying, even though PostgreSQL’s text type is simpler and equally efficient. And timestamps are invariably of type timestamp without time zone when they arguably should always be timestamp with time zone, as a timestamp has no clear meaning without a time zone.

  6. Textual data are handled with ease. Stata says that “Unicode is the modern way that computers encode characters such as the letters in the words you are now reading.” Yet prior to Stata 14, Stata did not support it. SAS files provided by WRDS have traditionally been encoded in some kind of Latin encoding. And SAS and Stata have limitations with regard to the size of text fields.

    In contrast, PostgreSQL uses Unicode by default, as do Python and R, and there is effectively no constraint on the size of text fields in PostgreSQL.


  1. Note that the NA value from Apple’s first day of trading is viewed as being the largest value due to the way that SQL collates NULL values.↩︎

  2. See the PostgreSQL documentation for mathematical functions.↩︎

  3. To illustrate, pulling returns and prices for Microsoft using a remote query to the WRDS PostgreSQL takes 27 milliseconds. And most of this is due to remote access; a local query takes less than one millisecond.↩︎

  4. This book only requires PostgreSQL data stored in the the WRDS database, but we include a brief overview of the steps required to set up one’s own research server in Appendix C.↩︎

  5. This can have significant implications for performance.↩︎