Appendix C — PostgreSQL

Throughout the book, we have used the WRDS PostgreSQL server. While WRDS PostgreSQL server offers access to a large menagerie of data with very little set-up, there are limitations to using the WRDS server alone for research. The purpose of this appendix is provide some ideas for how you might run your own PostgreSQL server.

C.1 Benefits of using relational databases

  1. Fast random access to on-disk data. Relational databases make it easy to index data, which makes it “alive” rather than “dead” in the sense we use here. We will see that accessing specific observations from large data sets is fast and easy with a relational database. Accessing stock-return data on crsp.dsf in the WRDS PostgreSQL database for Microsoft for a few dates is quite fast.1

  2. Data accessible by any software package. Data stored in a PostgreSQL database is accessible from R, Stata, Python, Perl, … pretty much any software. This has subtle but important advantages for data collaboration and multilingual programming. For example, you may be a Stata devotee who needs help preparing a data set from Web data. Finding people with the skills to do this is going to be a lot easier if you don’t need them to have Stata skills, but can work with a Python guru. With a relational database, a Stata user is unlikely to care much if the data coming out of PostgreSQL was put there using Python. In another example, you may have written some complicated Perl code to generate some data sets, and occasionally need to tweak the code to accommodate changes in the source data. As long as the data sets are stored in a PostgreSQL database, you will have no need to translate the entire code into something you still know how to code in. This benefit is probably also the primary driver of WRDS’s decision to add a PostgreSQL database to its offerings. With more users wanting to use Python or R for analysis, sticking to a SAS-format-only approach was surely not viable.

  3. Data accessible from anywhere. Data on the WRDS PostgreSQL server can be accessed from anywhere with an internet connection (in the 2020s, this essentially means anywhere) by anyone with a WRDS account. For example, data stored on a PostgreSQL server in Boston can be easily accessed by coauthors in California, Australia, or Singapore.2

  4. Centralised processing power. In this course, we will see a number of cases where the heavy part of data processing for some tasks is accomplished on the WRDS PostgreSQL server even when executing R code locally. While SAS has similar functionality with RSUBMIT, running queries on a remote PostgreSQL server from RStudio is much more seamless.

  5. Data can be (and should be) properly typed. PostgreSQL in particular has a much richer set of data types than a typical statistical programming package. For example, everything in SAS is fundamentally of either a floating-point numerical type or a fixed-width character type. Things such as dates and times are essentially formatting overlays over these types, and such formatting is easy to break, sometimes causing serious issues in practice. (Stata is fairly similar.) For example, PostgreSQL offers standard types such as strings, floating-point, integers, large integers, along with timestamps with time zones, JSON, XML, and other types. Most of the commonly used types are recognized fairly seamlessly and converted to appropriate types when data are brought into R (e.g., timestamps with time zones).

    We say “can be” in the heading to this section because whether the data actually have appropriate types is a function of how much care was used in getting the data into the database in the first place. For example, at the time of writing, WRDS is often fairly careless about data types. So variables that are naturally integers, such as firm identifiers like CIKs or PERMNOs, are double precision rather than integer types.3 Text variables are generally character varying, even though PostgreSQL’s text type is simpler and equally efficient. And timestamps are invariably of type timestamp without time zone when they arguably should always be timestamp with time zone, as a timestamp has no clear meaning without a time zone.

  6. Textual data are handled with ease. Stata says that “Unicode is the modern way that computers encode characters such as the letters in the words you are now reading.” Yet prior to Stata 14, Stata did not support it. SAS files provided by WRDS have traditionally been encoded in some kind of Latin encoding. And SAS and Stata have limitations with regard to the size of text fields.

    In contrast, PostgreSQL uses Unicode by default, as do Python and R, and there is effectively no constraint on the size of text fields in PostgreSQL.

C.2 Setting up a personal server

Setting up your own PostgreSQL server overcomes some of the disadvantages of using the WRDS server.

One disadvantage is that, because WRDS only allows read-only connections, one cannot store computed results on the WRDS PostgreSQL server. There are two types of computed tables that one can create. Temporary tables, which are created using the compute() function from dplyr, can dramatically simplify queries and increase performance, but disappear once you disconnect from the data base. Permanent tables, which can be created using the compute() function with temporary = FALSE argument, can be useful to break the research process into several discrete steps.

Another disadvantage of the WRDS server is it limits our ability to use our own data on the WRDS server. While we can use the copy_inline() function from the dbplyr package to create something similar to a temporary table on the server using local data, this will not be the best approach if our local data sets are large.4 And while we could download the crsp.dsf and do the analysis on our local computer, but this is likely to be painful even if we have enough RAM to store crsp.dsf.

An alternative approach that build on framework used in this book is to create your own PostgreSQL database and use that. While there are some set-up and maintenance costs associated with this, it’s actually fairly manageable using the steps we describe below.

C.2.1 Installation steps

  1. Install Python. You may already have Python installed. If not Miniconda is a good option.
  2. Install wrds2pg package.
pip install wrds2pg --upgrade
  1. Install PostgreSQL and initialize the server

Installation of PostgreSQL varies by platform (and on some platforms there are multiple approaches).5 A good place to start is the PostgreSQL webpage.

  1. Create a database.

You may have already done this in the previous step. If not, and you have the server running, you could do this using SQL:

CREATE DATABASE wrds;
  1. Set up environment variables and so on for wrds2pg.

See here for details.

export PGHOST="localhost"
export PGDATABASE="wrds"
export WRDS_ID="iangow"
export PGUSER="igow"

You might also use the environment variable PGPASSWORD:

export PGPASSWORD="password"

But it is probably better to use a password file.

  1. Run Python.

  2. Within Python, use the wrds2pg module to get data.

from wrds2pg import wrds_update
wrds_update("dsi", "crsp", dbname="wrds", host="localhost")

C.2.2 Getting the tables used here

Following the steps above, the script below can be used to get the tables used in this book. Note that a few tables require special handling.

Audit Analytics tables are merged (seemingly by WRDS) to create very large tables where most of the variables relate to financial statement items. Most researchers do not use these data, so the code below using SAS wildcard to drop such variables (note that wrds2pg is actually run SAS code on the WRDS server behind the scenes). While modifying these tables in this way breaks the idea of the local tables closely reflecting their WRDS counterparts, the benefit from having a cleaner, leaner table seems worthwhile in this case.

Some tables on CRSP and Compustat have special missing values that SAS’s PROC EXPORT function turns into simple character values, which PostgreSQL cannot accept in fields that are not text types. The fix_missing = True argument to wrds_update converts such missing values into regular missing values.

The script also fixes variable types that are not well-formatted in the original SAS files (e.g., permno should be integer).

Finally, the script creates indexes, which dramatically increase performance of table joins or filter() operations.

Note that the script below likely takes a few hours to run, primarily because of crsp.dsf, which is about 20GB of data. However, subsequent runs of the script will only download data if the SAS data file on WRDS has been updated and so will usually run much faster.

#!/usr/bin/env python3
from wrds2pg import wrds_update, make_engine, process_sql

engine = make_engine()

# Audit Analytics
updated = wrds_update("auditnonreli", "audit", 
                      drop="prior: match: closest: disc_text:",
                      col_types = {"res_accounting": "boolean",
                                   "res_fraud": "boolean", 
                                   "res_cler_err": "boolean",
                                   "res_adverse": "boolean", 
                                   "res_improves": "boolean", 
                                   "res_other": "boolean",
                                   "res_sec_invest": "boolean",
                                   "res_begin_aud_fkey": "integer", 
                                   "res_notif_key": "integer", 
                                   "current_aud_fkey": "integer", 
                                   "res_begin_aud_fkey": "integer", 
                                   "res_end_aud_fkey": "integer", 
                                   "file_date_aud_fkey": "integer"})

if updated:
    process_sql("CREATE INDEX ON audit.auditnonreli (res_notif_key)", engine)

# CRSP
ccmxpf_lnkhist = wrds_update("ccmxpf_lnkhist", "crsp", fix_missing=True,
                                 col_types = {'lpermno':'integer', 
                                              'lpermco': 'integer'})
if updated:
    process_sql("CREATE INDEX ON crsp.ccmxpf_lnkhist (gvkey)", engine)
    process_sql("CREATE INDEX ON crsp.ccmxpf_lnkhist (lpermno)", engine)
    process_sql("CREATE INDEX ON crsp.ccmxpf_lnkhist (lpermco)", engine)

dsf = wrds_update("dsf", "crsp", fix_missing=True,
                  col_types = {"permno": "integer",
                               "permco": "integer"})
if dsf:
    process_sql("CREATE INDEX ON crsp.dsf (permno, date)", engine)
    process_sql("CREATE INDEX ON crsp.dsf (permco)", engine)

dsi = wrds_update("dsi", "crsp")
if dsi:
    process_sql("CREATE INDEX ON crsp.dsi (date)", engine)

wrds_update("comphist", "crsp", fix_missing=True)

dsedelist = wrds_update("dsedelist", "crsp", fix_missing=True,
                        col_types = {'permno':'integer', 'permco': 'integer'})
                        
if dsedelist:
    process_sql("CREATE INDEX ON crsp.dsedelist (permno)", engine)
    
dseexchdates = wrds_update("dseexchdates", "crsp",
                           col_types = {'permno':'integer', 
                                        'permco': 'integer'})
if dseexchdates:
    process_sql("CREATE INDEX ON crsp.dseexchdates (permno)", engine)
    
erdport1 = wrds_update("erdport1", "crsp", fix_missing = True,
                       col_types = {'permno':'integer', 'capn': 'integer'})

if erdport1:
    process_sql("CREATE INDEX ON crsp.erdport1 (permno, date)", engine)

updated = wrds_update("ccmxpf_lnkhist", "crsp", fix_missing=True,
                      col_types = {"lpermno": "integer",
                                   "lpermco": "integer"})
if updated:
    process_sql("CREATE INDEX ON crsp.ccmxpf_lnkhist (gvkey)", engine)

updated = wrds_update("msf", "crsp", fix_missing=True,
                      col_types = {"permno": "integer",
                                   "permco": "integer"})
if updated:
    process_sql("CREATE INDEX ON crsp.msf (permno, date)", engine)

updated = wrds_update("msi", "crsp")
if updated:
    process_sql("CREATE INDEX ON crsp.msi (date)", engine)

mse = wrds_update("mse", "crsp", fix_missing=True,
                  col_types = {"permno": "integer",
                                      "permco": "integer"})

stocknames = wrds_update("stocknames", "crsp",
                         col_types = {"permno": "integer",
                                      "permco": "integer"})
dsedist = wrds_update("dsedist", "crsp", fix_missing=True,
                      col_types = {'permno':'integer',
                                   'permco':'integer'})
if dsedist:
    process_sql("CREATE INDEX ON crsp.dsedist (permno)", engine)
    
updated = wrds_update("factors_daily", "ff")

# Compustat
updated = wrds_update("company", "comp")
if updated:
    process_sql("CREATE INDEX ON comp.company (gvkey)", engine)

updated = wrds_update("funda", "comp", fix_missing = True)
if updated:
    process_sql("CREATE INDEX ON comp.funda (gvkey)", engine)

updated = wrds_update("fundq", "comp", fix_missing = True)
if updated:
    process_sql("CREATE INDEX ON comp.fundq (gvkey, datadate)", engine)
    
updated = wrds_update("r_auditors", "comp")

updated = wrds_update("idx_daily", "comp")
if updated:
    process_sql("CREATE INDEX ON comp.idx_daily (gvkeyx)", engine)

updated = wrds_update("aco_pnfnda", "comp")
if updated:
    process_sql("CREATE INDEX ON comp.aco_pnfnda (gvkey, datadate)", engine)

# The segment data is in comp_segments_hist_daily in PostgreSQL,
# but in compsegd in SAS, so we need to specify sas_schema.
wrds_update("seg_customer", "comp_segments_hist_daily",
            sas_schema="compsegd")
wrds_update("names_seg", "comp_segments_hist_daily",
            sas_schema="compsegd")

C.3 Setting up a shared server

While setting up a personal PostgreSQL server gives you more flexibility, greater potential can be realized if you can share your server with collaborators. For example, you might write Python code to perform textual analysis on SEC filings, then have a co-author pull the data from the database directly into Stata for regression analysis. Having data in a single database is in many ways superior to solutions such as Dropbox that are not really set up for sharing multi-gigabyte data files. This short appendix describes some details of sharing a database.

C.3.1 Opening up the database

By default, a PostgreSQL cluster is not open to connections from other computers and we need to edit two configuration files to change this.

  1. Issue two SQL commands to locate the two configuration files (pg_hba.conf and postgresql.conf) that we need to edit. Here we use the command-line program for accessing PostgreSQL (psql), but anything that allows you to execute SQL against the server should work.
(base) brontegow@brontegow-ubuntu-mate:~$ psql
psql (13.2)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

crsp=# SHOW hba_file ;
              hba_file               
-------------------------------------
 /etc/postgresql/13/main/pg_hba.conf
(1 row)

crsp=# SHOW config_file ;
               config_file               
-----------------------------------------
 /etc/postgresql/13/main/postgresql.conf
(1 row)
  1. Add the following line to the end of pg_hba.conf
host all all 0.0.0.0/0 md5

You may need to edit the file as a super user (e.g., sudo vim /etc/postgresql/13/main/pg_hba.conf).

  1. Change listen_addresses in postgresql.conf

There should already be a line related to listen_addresses in the file, but it is likely commented out (i.e., starts with #). Uncomment the line and edit it so that PostgreSQL listens from all addresses. After editing the line should look like this:

listen_addresses = '*'

Again, you may need to edit the file as a super user (e.g., sudo vim /etc/postgresql/13/main/postgresql.conf).

After making these two edits, it will be necessary to reboot the server. The details of how to reboot the server will depend on how you installed it, so consult instructions specific to your installation (in the worst case, restarting the computer should do the trick).

C.3.2 Creating user logins

Suppose that Bronte has a co-author named Gemma who she wants to grant access to the database. Note that wrds_update function automatically creates two roles for each schema that it creates tables for For example, in creating a local copy of crsp.dsf, which is stored in schema crsp, it creates the role crsp, which owns the data (and therefore can delete or modify it), and the role crsp_access, which only has read access to the data. Similar roles are created for data in the audit and comp schemas.

In this case, we want to create the role gemma and give her read-only access to the data in the three schemas: crsp, comp, and audit. The following commands achieve this:

(base) brontegow@brontegow-ubuntu-mate:~$ psql
psql (13.2 (Ubuntu 13.2-1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

crsp=# CREATE USER gemma ENCRYPTED PASSWORD 'bronte2006';
CREATE ROLE
crsp=# GRANT crsp_access TO gemma;
GRANT ROLE
crsp=# GRANT comp_access TO gemma;
GRANT ROLE
crsp=# GRANT audit_access TO gemma;
GRANT ROLE

C.3.3 Testing access

Now Gemma should have access to the database. We need to give Gemma the IP address for the server and with that information, she should be able to access the data. In the following example, the IP address for the server is assumed to be 108.26.187.8.

psql -h 108.26.187.8 -d wrds -U gemma
Password for user gemma: 
psql (13.2 (Ubuntu 13.2-1.pgdg20.04+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

crsp=> SELECT date, vwretd FROM crsp.dsi LIMIT 10;
    date    |  vwretd   
------------+-----------
 1925-12-31 |          
 1926-01-02 |  0.005689
 1926-01-04 |  0.000706
 1926-01-05 | -0.004821
 1926-01-06 | -0.000423
 1926-01-07 |  0.004988
 1926-01-08 | -0.003238
 1926-01-09 |  0.002209
 1926-01-11 |  -0.00854
 1926-01-12 | -0.000929
(10 rows)

Note that the above assumed that the machine on which the database is hosted can be accessed via an IP address (or a URL). If you are using a home computer, there is a good chance that your IP address is not fixed. So the following steps may be necessary to give access to others outside your home network:

  1. Fix the local IP address of your server. For example, I have one computer set to 192.168.1.3.
  2. Direct all traffic on the port for your database to this computer. This is accomplished via the port-forwarding settings of your internet router. For example, the default port for PostgreSQL is 5432 and we might forward all traffic on this port to 192.168.1.3.
  3. Get a domain name. For example, a service such as GoDaddy can tell us what domains are available to acquire. At the time of writing brontesdb.com is available and would cost $12 for the first year.
  4. Use a dynamic DNS service (such as dyn.com) to make sure that your chosen domain name (brontesdb.com in our example) always points to your IP address.

After taking the steps above, Gemma could now connect to our database like this:

psql -h brontesdb.com -d wrds -U gemma

Note that Gemma could change her password to something more to her liking using SQL such as the following:

ALTER ROLE gemma PASSWORD 'gemma2007';

Because the password is encrypted on the database server, Bronte cannot see the password Gemma that has chosen.


  1. To illustrate, pulling returns and prices for Microsoft using a remote query to the WRDS PostgreSQL takes 27 milliseconds. And most of this is due to remote access; a local query takes less than one millisecond.↩︎

  2. This book only requires PostgreSQL data stored in the the WRDS database, but we include a brief overview of the steps required to set up one’s own research server in Appendix C.↩︎

  3. This can have significant implications for performance.↩︎

  4. As can be seen here, this function was added to dbplyr after a request related to creation of this course.↩︎

  5. On Ubuntu, we install using say sudo apt install postgresql-14; on MacOS, Postgres.app provides a very easy-to-manage PostgreSQL implementation.↩︎