9 min read

Accessing WRDS data

WRDS provides some sample code for accessing the WRDS PostgreSQL database here. The problem with this code examples is that they use dplyr, which offers a “lazy evaluation” approach that is vastly superior accessing remote data sources.

Another advantage of using a dplyr approach is that the code carries over to local data sources, whereas raw SQL would not.

Below, I translate each example provided by WRDS into equivalent dplyr.

Sys.setenv(PGUSER="iangow")
library(DBI)
Sys.setenv(PGHOST="wrds-pgdata.wharton.upenn.edu", 
           PGPORT=9737L, PGDATABASE="wrds")
wrds <- dbConnect(RPostgres::Postgres())

res <- dbSendQuery(wrds, "select * from crsp.dsf")
data <- dbFetch(res, n=10)
dbClearResult(res)
data
##       cusip permno permco issuno hexcd hsiccd       date bidlo askhi     prc
## 1  68391610  10000   7952  10396     3   3990 1986-01-07 2.375 2.750 -2.5625
## 2  68391610  10000   7952  10396     3   3990 1986-01-08 2.375 2.625 -2.5000
## 3  68391610  10000   7952  10396     3   3990 1986-01-09 2.375 2.625 -2.5000
## 4  68391610  10000   7952  10396     3   3990 1986-01-10 2.375 2.625 -2.5000
## 5  68391610  10000   7952  10396     3   3990 1986-01-13 2.500 2.750 -2.6250
## 6  68391610  10000   7952  10396     3   3990 1986-01-14 2.625 2.875 -2.7500
## 7  68391610  10000   7952  10396     3   3990 1986-01-15 2.750 3.000 -2.8750
## 8  68391610  10000   7952  10396     3   3990 1986-01-16 2.875 3.125 -3.0000
## 9  68391610  10000   7952  10396     3   3990 1986-01-17 2.875 3.125 -3.0000
## 10 68391610  10000   7952  10396     3   3990 1986-01-20 2.875 3.125 -3.0000
##      vol         ret bid ask shrout cfacpr cfacshr openprc numtrd        retx
## 1   1000          NA  NA  NA   3680      1       1      NA     NA          NA
## 2  12800 -0.02439024  NA  NA   3680      1       1      NA     NA -0.02439024
## 3   1400  0.00000000  NA  NA   3680      1       1      NA     NA  0.00000000
## 4   8500  0.00000000  NA  NA   3680      1       1      NA     NA  0.00000000
## 5   5450  0.05000000  NA  NA   3680      1       1      NA     NA  0.05000000
## 6   2075  0.04761905  NA  NA   3680      1       1      NA     NA  0.04761905
## 7  22490  0.04545455  NA  NA   3680      1       1      NA     NA  0.04545455
## 8  10900  0.04347826  NA  NA   3680      1       1      NA     NA  0.04347826
## 9   8470  0.00000000  NA  NA   3680      1       1      NA     NA  0.00000000
## 10  1000  0.00000000  NA  NA   3680      1       1      NA     NA  0.00000000

The WRDS page remarks here that

IMPORTANT: Setting n=10 limits the results to 10 records (also called observations). The table crsp.dsf is CRSP’s Daily Stock File and tracks thousands of stocks over almost a hundred years - such a query that returns all records would take a very long time. In reality, most queries are far more specific, as shown in the examples below. It is highly recommended to develop your code using such a limit, then simply remove that limit (by by setting n =-1) when you are ready to run your final code.

But that is precisely where dplyr shines.

Here’s the dplyr approach.

library(dplyr, warn.conflicts = FALSE)

dsf <- tbl(wrds, sql("SELECT * FROM crsp.dsf"))
dsf
## # Source:   SQL [?? x 20]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
##    cusip permno permco issuno hexcd hsiccd date       bidlo askhi   prc   vol
##    <chr>  <dbl>  <dbl>  <dbl> <dbl>  <dbl> <date>     <dbl> <dbl> <dbl> <dbl>
##  1 3672…  10001   7953  10398     2   4925 1992-08-20  12    12.5 -12.2     0
##  2 3672…  10001   7953  10398     2   4925 1992-08-21  12    12.5 -12.2     0
##  3 3672…  10001   7953  10398     2   4925 1992-08-24  12.5  12.5  12.5   100
##  4 3672…  10001   7953  10398     2   4925 1992-08-25  12    12.5 -12.2     0
##  5 3672…  10001   7953  10398     2   4925 1992-08-26  12.5  13    12.5  6267
##  6 3672…  10001   7953  10398     2   4925 1992-08-27  12.5  12.8  12.5  3600
##  7 3672…  10001   7953  10398     2   4925 1992-08-28  13    13    13     200
##  8 3672…  10001   7953  10398     2   4925 1992-08-31  13    13    13    1000
##  9 3672…  10001   7953  10398     2   4925 1992-09-01  12.5  13    13     720
## 10 3672…  10001   7953  10398     2   4925 1992-09-02  13    13.2  13.2  1600
## # … with more rows, and 9 more variables: ret <dbl>, bid <dbl>, ask <dbl>,
## #   shrout <dbl>, cfacpr <dbl>, cfacshr <dbl>, openprc <dbl>, numtrd <dbl>,
## #   retx <dbl>

The WRDS page continues:

Datasets often contain a large number of variables (column headers) such as date, ticker, cusip, price, or a host of other values depending on the dataset. Limiting the number of variables returned in queries speeds up the execution time and decreases the size of the returned data. Once you looked at the metadata and see the available variables, you probably want to specify only those you are interested in. You can do this by specifying each variable to query explicitly using the select statement, instead of selecting all (using the asterisk * which matches all variables).

Here is the WRDS code:

res <- dbSendQuery(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf")
data <- dbFetch(res, n=10)
dbClearResult(res)
data
##       cusip permno       date bidlo askhi
## 1  36720410  10001 1992-08-20  12.0 12.50
## 2  36720410  10001 1992-08-21  12.0 12.50
## 3  36720410  10001 1992-08-24  12.5 12.50
## 4  36720410  10001 1992-08-25  12.0 12.50
## 5  36720410  10001 1992-08-26  12.5 13.00
## 6  36720410  10001 1992-08-27  12.5 12.75
## 7  36720410  10001 1992-08-28  13.0 13.00
## 8  36720410  10001 1992-08-31  13.0 13.00
## 9  36720410  10001 1992-09-01  12.5 13.00
## 10 36720410  10001 1992-09-02  13.0 13.25

But we already have dsf defined above, so we can do this using dplyr.

dsf %>%
  select(cusip, permno, date, bidlo, askhi)
## # Source:   lazy query [?? x 5]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
##    cusip    permno date       bidlo askhi
##    <chr>     <dbl> <date>     <dbl> <dbl>
##  1 36720410  10001 2008-04-07  8.65  8.99
##  2 36720410  10001 2008-04-08  8.58  9.45
##  3 36720410  10001 2008-04-09  8.75  8.90
##  4 36720410  10001 2008-04-10  8.75  9.45
##  5 36720410  10001 2008-04-11  8.90  9.09
##  6 36720410  10001 2008-04-14  9.10  9.40
##  7 36720410  10001 2008-04-15  8.56  9.05
##  8 36720410  10001 2008-04-16  9.17  9.19
##  9 36720410  10001 2008-04-17  9     9.10
## 10 36720410  10001 2008-04-18  8.93  9.05
## # … with more rows

WRDS then discussing WHERE clauses.

You can also further refine your query by selecting data that meets a certain criteria for one or more of these variables. You can limit the returned results to data which has an askhi value above 2500 and a bidlo value under 2000 as follows.

res <- dbSendQuery(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where askhi > 2500
                   and bidlo < 2000")
data <- dbFetch(res, n=10)
dbClearResult(res)
data
## [1] cusip  permno date   bidlo  askhi 
## <0 rows> (or 0-length row.names)

Again, dplyr is more elegant.

dsf %>%
  select(cusip, permno, date, bidlo, askhi) %>%
  filter(askhi > 2500, bidlo < 2000)
## # Source:   lazy query [?? x 5]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
## # … with 5 variables: cusip <chr>, permno <dbl>, date <date>, bidlo <dbl>,
## #   askhi <dbl>

It turns out that (as of 2017) there are only 27 records that match the above criteria.

Actually, it seems there are none!

Searching by Date

One of the more common methods of querying data is by date. WRDS uses the date notation convention of yyyy-mm-dd, so January 4th, 2013, the first trading day of 2013, would be formatted as 2013-01-04. Dates in your SQL queries must be surrounded by single quotes.

Well, this is ISO 8601 and PostgreSQL, not WRDS. But, fine.

To query by date:

res <- dbSendQuery(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where date = '2013-01-04'")
data <- dbFetch(res, n=-1)
dbClearResult(res)

And the dplyr version:

dsf %>%
  select(cusip, permno, date, bidlo, askhi) %>%
  filter(date == '2013-01-04')
## # Source:   lazy query [?? x 5]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
##    cusip    permno date        bidlo  askhi
##    <chr>     <dbl> <date>      <dbl>  <dbl>
##  1 36720410  10001 2013-01-04  9.28   9.48 
##  2 05978R10  10002 2013-01-04  2.92   2.92 
##  3 00103110  10025 2013-01-04 59.2   61.0  
##  4 46603210  10026 2013-01-04 64.5   65.8  
##  5 29402E10  10028 2013-01-04  5.11   5.37 
##  6 72913210  10032 2013-01-04 25.9   26.5  
##  7 77467X10  10044 2013-01-04 10.9   11.2  
##  8 41043F20  10051 2013-01-04 28.4   29.2  
##  9 00621210  10065 2013-01-04 10.8   10.9  
## 10 02713510  10100 2013-01-04  0.400  0.470
## # … with more rows

Searching by date ranges is similar, as shown in the following example.

To query by date range:

res <- dbSendQuery(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where date between '2013-01-07'
                   and '2013-01-08'")
data <- dbFetch(res, n=-1)
dbClearResult(res)

And with dplyr:

dsf %>%
  select(cusip, permno, date, bidlo, askhi) %>%
  filter(between(date, '2013-01-07', '2013-01-08'))
## # Source:   lazy query [?? x 5]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
##    cusip    permno date        bidlo  askhi
##    <chr>     <dbl> <date>      <dbl>  <dbl>
##  1 36720410  10001 2013-01-07  9.30   9.48 
##  2 05978R10  10002 2013-01-07  2.92   2.98 
##  3 00103110  10025 2013-01-07 59.4   61.5  
##  4 46603210  10026 2013-01-07 63.4   64.8  
##  5 29402E10  10028 2013-01-07  5.09   5.38 
##  6 72913210  10032 2013-01-07 25.7   26.3  
##  7 77467X10  10044 2013-01-07 11.1   11.2  
##  8 41043F20  10051 2013-01-07 28.2   28.8  
##  9 00621210  10065 2013-01-07 10.9   10.9  
## 10 02713510  10100 2013-01-07  0.400  0.420
## # … with more rows

Joining Data from Separate Datasets

Data from separate datasets can be joined and analyzed together. The following example will join the Compustat Fundamentals data set (comp.funda) with Compustat’s pricing dataset (comp.secm), and then query for total assets and liabilities mixed with monthly close price and shares outstanding.

To join and query two Compustat datasets:

res <- dbSendQuery(wrds, "select a.gvkey, a.datadate, a.at, a.lt, b.prccm, b.cshoq
                   from comp.funda a join comp.secm b
                   on a.gvkey = b.gvkey
                   and a.datadate = b.datadate
                   where a.tic = 'IBM'
                   and a.datafmt = 'STD'
                   and a.consol = 'C'
                   and a.indfmt = 'INDL'")
data <- dbFetch(res, n = -1)
dbClearResult(res)
head(data)
##    gvkey   datadate       at       lt    prccm cshoq
## 1 006066 1962-12-31 2112.301  731.700 389.9996    NA
## 2 006066 1963-12-31 2373.857  782.119 506.9994    NA
## 3 006066 1964-12-31 3309.152 1055.072 409.4995    NA
## 4 006066 1965-12-31 3744.917 1166.771 498.9991    NA
## 5 006066 1966-12-31 4660.777 1338.149 371.4997    NA
## 6 006066 1967-12-31 5598.668 1767.067 626.9995    NA

But dplyr does joins (actually, almost anything that can be written in SQL can be done elegantly using dplyr). In this case, I think the code can be rewritten to better reflect the flow of thinking, which is "get some fundamental data and also pull in a couple of elements of data from secm.

funda <- tbl(wrds, sql("SELECT * FROM comp.funda"))
secm <- tbl(wrds, sql("SELECT * FROM comp.secm"))

prices <-
  secm %>%
  select(gvkey, datadate, prccm, cshoq)

data <-
  funda %>%
  filter(datafmt == 'STD', consol == 'C', indfmt == 'INDL') %>%
  filter(tic == 'IBM') %>%
  inner_join(prices, by=c("gvkey", "datadate")) %>%
  select(gvkey, datadate, at, lt, prccm, cshoq)
data
## # Source:   lazy query [?? x 6]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
##    gvkey  datadate      at    lt prccm cshoq
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>
##  1 006066 1962-12-31 2112.  732.  390.   NA 
##  2 006066 1963-12-31 2374.  782.  507.   NA 
##  3 006066 1964-12-31 3309. 1055.  409.   NA 
##  4 006066 1965-12-31 3745. 1167.  499.   NA 
##  5 006066 1966-12-31 4661. 1338.  371.   NA 
##  6 006066 1967-12-31 5599. 1767.  627.   NA 
##  7 006066 1968-12-31 6743. 2174.  315.   NA 
##  8 006066 1969-12-31 7390. 2113.  364.  114.
##  9 006066 1970-12-31 8539. 2592.  318.  115.
## 10 006066 1971-12-31 9576. 2934.  336.  116.
## # … with more rows