WRDS provides some sample code for accessing the WRDS PostgreSQL database here.
The problem with this code examples is that they use dplyr
, which offers a “lazy evaluation” approach
that is vastly superior accessing remote data sources.
Another advantage of using a dplyr
approach is that the code carries over to local data sources, whereas raw SQL would not.
Below, I translate each example provided by WRDS into equivalent dplyr
.
Sys.setenv(PGUSER="iangow")
library(DBI)
Sys.setenv(PGHOST="wrds-pgdata.wharton.upenn.edu",
PGPORT=9737L, PGDATABASE="wrds")
wrds <- dbConnect(RPostgres::Postgres())
res <- dbSendQuery(wrds, "select * from crsp.dsf")
data <- dbFetch(res, n=10)
dbClearResult(res)
data
## cusip permno permco issuno hexcd hsiccd date bidlo askhi prc
## 1 68391610 10000 7952 10396 3 3990 1986-01-07 2.375 2.750 -2.5625
## 2 68391610 10000 7952 10396 3 3990 1986-01-08 2.375 2.625 -2.5000
## 3 68391610 10000 7952 10396 3 3990 1986-01-09 2.375 2.625 -2.5000
## 4 68391610 10000 7952 10396 3 3990 1986-01-10 2.375 2.625 -2.5000
## 5 68391610 10000 7952 10396 3 3990 1986-01-13 2.500 2.750 -2.6250
## 6 68391610 10000 7952 10396 3 3990 1986-01-14 2.625 2.875 -2.7500
## 7 68391610 10000 7952 10396 3 3990 1986-01-15 2.750 3.000 -2.8750
## 8 68391610 10000 7952 10396 3 3990 1986-01-16 2.875 3.125 -3.0000
## 9 68391610 10000 7952 10396 3 3990 1986-01-17 2.875 3.125 -3.0000
## 10 68391610 10000 7952 10396 3 3990 1986-01-20 2.875 3.125 -3.0000
## vol ret bid ask shrout cfacpr cfacshr openprc numtrd retx
## 1 1000 NA NA NA 3680 1 1 NA NA NA
## 2 12800 -0.02439024 NA NA 3680 1 1 NA NA -0.02439024
## 3 1400 0.00000000 NA NA 3680 1 1 NA NA 0.00000000
## 4 8500 0.00000000 NA NA 3680 1 1 NA NA 0.00000000
## 5 5450 0.05000000 NA NA 3680 1 1 NA NA 0.05000000
## 6 2075 0.04761905 NA NA 3680 1 1 NA NA 0.04761905
## 7 22490 0.04545455 NA NA 3680 1 1 NA NA 0.04545455
## 8 10900 0.04347826 NA NA 3680 1 1 NA NA 0.04347826
## 9 8470 0.00000000 NA NA 3680 1 1 NA NA 0.00000000
## 10 1000 0.00000000 NA NA 3680 1 1 NA NA 0.00000000
The WRDS page remarks here that
IMPORTANT: Setting
n=10
limits the results to 10 records (also called observations). The tablecrsp.dsf
is CRSP’s Daily Stock File and tracks thousands of stocks over almost a hundred years - such a query that returns all records would take a very long time. In reality, most queries are far more specific, as shown in the examples below. It is highly recommended to develop your code using such a limit, then simply remove that limit (by by settingn =-1
) when you are ready to run your final code.
But that is precisely where dplyr
shines.
Here’s the dplyr
approach.
library(dplyr, warn.conflicts = FALSE)
dsf <- tbl(wrds, sql("SELECT * FROM crsp.dsf"))
dsf
## # Source: SQL [?? x 20]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
## cusip permno permco issuno hexcd hsiccd date bidlo askhi prc vol
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 3672… 10001 7953 10398 2 4925 1992-08-20 12 12.5 -12.2 0
## 2 3672… 10001 7953 10398 2 4925 1992-08-21 12 12.5 -12.2 0
## 3 3672… 10001 7953 10398 2 4925 1992-08-24 12.5 12.5 12.5 100
## 4 3672… 10001 7953 10398 2 4925 1992-08-25 12 12.5 -12.2 0
## 5 3672… 10001 7953 10398 2 4925 1992-08-26 12.5 13 12.5 6267
## 6 3672… 10001 7953 10398 2 4925 1992-08-27 12.5 12.8 12.5 3600
## 7 3672… 10001 7953 10398 2 4925 1992-08-28 13 13 13 200
## 8 3672… 10001 7953 10398 2 4925 1992-08-31 13 13 13 1000
## 9 3672… 10001 7953 10398 2 4925 1992-09-01 12.5 13 13 720
## 10 3672… 10001 7953 10398 2 4925 1992-09-02 13 13.2 13.2 1600
## # … with more rows, and 9 more variables: ret <dbl>, bid <dbl>, ask <dbl>,
## # shrout <dbl>, cfacpr <dbl>, cfacshr <dbl>, openprc <dbl>, numtrd <dbl>,
## # retx <dbl>
The WRDS page continues:
Datasets often contain a large number of variables (column headers) such as
date
,ticker
,cusip
,price
, or a host of other values depending on the dataset. Limiting the number of variables returned in queries speeds up the execution time and decreases the size of the returned data. Once you looked at the metadata and see the available variables, you probably want to specify only those you are interested in. You can do this by specifying each variable to query explicitly using the select statement, instead of selecting all (using the asterisk*
which matches all variables).
Here is the WRDS code:
res <- dbSendQuery(wrds, "select cusip,permno,date,bidlo,askhi
from crsp.dsf")
data <- dbFetch(res, n=10)
dbClearResult(res)
data
## cusip permno date bidlo askhi
## 1 36720410 10001 1992-08-20 12.0 12.50
## 2 36720410 10001 1992-08-21 12.0 12.50
## 3 36720410 10001 1992-08-24 12.5 12.50
## 4 36720410 10001 1992-08-25 12.0 12.50
## 5 36720410 10001 1992-08-26 12.5 13.00
## 6 36720410 10001 1992-08-27 12.5 12.75
## 7 36720410 10001 1992-08-28 13.0 13.00
## 8 36720410 10001 1992-08-31 13.0 13.00
## 9 36720410 10001 1992-09-01 12.5 13.00
## 10 36720410 10001 1992-09-02 13.0 13.25
But we already have dsf
defined above, so we can do this using dplyr
.
dsf %>%
select(cusip, permno, date, bidlo, askhi)
## # Source: lazy query [?? x 5]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
## cusip permno date bidlo askhi
## <chr> <dbl> <date> <dbl> <dbl>
## 1 36720410 10001 2008-04-07 8.65 8.99
## 2 36720410 10001 2008-04-08 8.58 9.45
## 3 36720410 10001 2008-04-09 8.75 8.90
## 4 36720410 10001 2008-04-10 8.75 9.45
## 5 36720410 10001 2008-04-11 8.90 9.09
## 6 36720410 10001 2008-04-14 9.10 9.40
## 7 36720410 10001 2008-04-15 8.56 9.05
## 8 36720410 10001 2008-04-16 9.17 9.19
## 9 36720410 10001 2008-04-17 9 9.10
## 10 36720410 10001 2008-04-18 8.93 9.05
## # … with more rows
WRDS then discussing WHERE
clauses.
You can also further refine your query by selecting data that meets a certain criteria for one or more of these variables. You can limit the returned results to data which has an
askhi
value above 2500 and abidlo
value under 2000 as follows.
res <- dbSendQuery(wrds, "select cusip,permno,date,bidlo,askhi
from crsp.dsf
where askhi > 2500
and bidlo < 2000")
data <- dbFetch(res, n=10)
dbClearResult(res)
data
## [1] cusip permno date bidlo askhi
## <0 rows> (or 0-length row.names)
Again, dplyr
is more elegant.
dsf %>%
select(cusip, permno, date, bidlo, askhi) %>%
filter(askhi > 2500, bidlo < 2000)
## # Source: lazy query [?? x 5]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
## # … with 5 variables: cusip <chr>, permno <dbl>, date <date>, bidlo <dbl>,
## # askhi <dbl>
It turns out that (as of 2017) there are only 27 records that match the above criteria.
Actually, it seems there are none!
Searching by Date
One of the more common methods of querying data is by date. WRDS uses the date notation convention of
yyyy-mm-dd
, so January 4th, 2013, the first trading day of 2013, would be formatted as2013-01-04
. Dates in your SQL queries must be surrounded by single quotes.
Well, this is ISO 8601 and PostgreSQL, not WRDS. But, fine.
To query by date:
res <- dbSendQuery(wrds, "select cusip,permno,date,bidlo,askhi
from crsp.dsf
where date = '2013-01-04'")
data <- dbFetch(res, n=-1)
dbClearResult(res)
And the dplyr
version:
dsf %>%
select(cusip, permno, date, bidlo, askhi) %>%
filter(date == '2013-01-04')
## # Source: lazy query [?? x 5]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
## cusip permno date bidlo askhi
## <chr> <dbl> <date> <dbl> <dbl>
## 1 36720410 10001 2013-01-04 9.28 9.48
## 2 05978R10 10002 2013-01-04 2.92 2.92
## 3 00103110 10025 2013-01-04 59.2 61.0
## 4 46603210 10026 2013-01-04 64.5 65.8
## 5 29402E10 10028 2013-01-04 5.11 5.37
## 6 72913210 10032 2013-01-04 25.9 26.5
## 7 77467X10 10044 2013-01-04 10.9 11.2
## 8 41043F20 10051 2013-01-04 28.4 29.2
## 9 00621210 10065 2013-01-04 10.8 10.9
## 10 02713510 10100 2013-01-04 0.400 0.470
## # … with more rows
Searching by date ranges is similar, as shown in the following example.
To query by date range:
res <- dbSendQuery(wrds, "select cusip,permno,date,bidlo,askhi
from crsp.dsf
where date between '2013-01-07'
and '2013-01-08'")
data <- dbFetch(res, n=-1)
dbClearResult(res)
And with dplyr
:
dsf %>%
select(cusip, permno, date, bidlo, askhi) %>%
filter(between(date, '2013-01-07', '2013-01-08'))
## # Source: lazy query [?? x 5]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
## cusip permno date bidlo askhi
## <chr> <dbl> <date> <dbl> <dbl>
## 1 36720410 10001 2013-01-07 9.30 9.48
## 2 05978R10 10002 2013-01-07 2.92 2.98
## 3 00103110 10025 2013-01-07 59.4 61.5
## 4 46603210 10026 2013-01-07 63.4 64.8
## 5 29402E10 10028 2013-01-07 5.09 5.38
## 6 72913210 10032 2013-01-07 25.7 26.3
## 7 77467X10 10044 2013-01-07 11.1 11.2
## 8 41043F20 10051 2013-01-07 28.2 28.8
## 9 00621210 10065 2013-01-07 10.9 10.9
## 10 02713510 10100 2013-01-07 0.400 0.420
## # … with more rows
Joining Data from Separate Datasets
Data from separate datasets can be joined and analyzed together. The following example will join the Compustat Fundamentals data set (
comp.funda
) with Compustat’s pricing dataset (comp.secm
), and then query for total assets and liabilities mixed with monthly close price and shares outstanding.To join and query two Compustat datasets:
res <- dbSendQuery(wrds, "select a.gvkey, a.datadate, a.at, a.lt, b.prccm, b.cshoq
from comp.funda a join comp.secm b
on a.gvkey = b.gvkey
and a.datadate = b.datadate
where a.tic = 'IBM'
and a.datafmt = 'STD'
and a.consol = 'C'
and a.indfmt = 'INDL'")
data <- dbFetch(res, n = -1)
dbClearResult(res)
head(data)
## gvkey datadate at lt prccm cshoq
## 1 006066 1962-12-31 2112.301 731.700 389.9996 NA
## 2 006066 1963-12-31 2373.857 782.119 506.9994 NA
## 3 006066 1964-12-31 3309.152 1055.072 409.4995 NA
## 4 006066 1965-12-31 3744.917 1166.771 498.9991 NA
## 5 006066 1966-12-31 4660.777 1338.149 371.4997 NA
## 6 006066 1967-12-31 5598.668 1767.067 626.9995 NA
But dplyr
does joins (actually, almost anything that can be written in SQL can be done elegantly using dplyr
).
In this case, I think the code can be rewritten to better reflect the flow of thinking, which is "get some fundamental data and also pull in a couple of elements of data from secm
.
funda <- tbl(wrds, sql("SELECT * FROM comp.funda"))
secm <- tbl(wrds, sql("SELECT * FROM comp.secm"))
prices <-
secm %>%
select(gvkey, datadate, prccm, cshoq)
data <-
funda %>%
filter(datafmt == 'STD', consol == 'C', indfmt == 'INDL') %>%
filter(tic == 'IBM') %>%
inner_join(prices, by=c("gvkey", "datadate")) %>%
select(gvkey, datadate, at, lt, prccm, cshoq)
data
## # Source: lazy query [?? x 6]
## # Database: postgres [iangow@wrds-pgdata.wharton.upenn.edu:9737/wrds]
## gvkey datadate at lt prccm cshoq
## <chr> <date> <dbl> <dbl> <dbl> <dbl>
## 1 006066 1962-12-31 2112. 732. 390. NA
## 2 006066 1963-12-31 2374. 782. 507. NA
## 3 006066 1964-12-31 3309. 1055. 409. NA
## 4 006066 1965-12-31 3745. 1167. 499. NA
## 5 006066 1966-12-31 4661. 1338. 371. NA
## 6 006066 1967-12-31 5599. 1767. 627. NA
## 7 006066 1968-12-31 6743. 2174. 315. NA
## 8 006066 1969-12-31 7390. 2113. 364. 114.
## 9 006066 1970-12-31 8539. 2592. 318. 115.
## 10 006066 1971-12-31 9576. 2934. 336. 116.
## # … with more rows