Appendix E — Making a parquet repository

Throughout the book we have relied on the WRDS PostgreSQL server as our primary source of larger data sets, primarily from CRSP and Compustat. However, there are downsides to relying on the WRDS PostgreSQL server to this degree. One (minor) downside is that one needs to be online to get data. Another downside is that we quickly run into limits when we have large amounts of non-WRDS data that need to be combined with WRDS data for analysis. Finally, many researchers would prefer to have greater clarity on the underlying data sets than is provided by an approach that relies on the ever-changing data sets on WRDS.

In Appendix D, we considered an approach that uses a user-created PostgreSQL server. While this has the merit of using the same underlying approach used by WRDS and the rock-solid backbone provided by PostgreSQL, it does require the user to invest in understanding a technology at a deeper level than many researchers would be comfortable with.

In Chapter 23, we introduced an approach that used parquet data files as a foundation for analysis. This approach, based on individual data files, is likely to be more transparent to many researchers accustomed to storing data in individual data files. This may lead many researchers to ask if such an approach could be extended to handle data from CRSP and Compustat more generally.

The answer is “yes” and this appendix outlines some options for researchers interested in pursuing such an approach.

The code in this appendix uses the following packages. For instructions on how to set up your computer to use the code found in this book, see Section 1.2.

E.1 Data management approaches

Once upon a time, research computing was largely done on shared computers managed by others. Data of the kind provided by CRSP and Compustat would have resided on these shared computers and often researchers would have had limited allocations for data storage. For example, many academics would have relied on the servers provided by WRDS to conduct research. If WRDS updated Compustat or CRSP, then the results from running a researcher’s code might change.

A computing power and storage became greater and cheaper, many researchers conduct research on their personal computers. It is quite feasible to maintain local copies of major CRSP and Computstat data tables on the hard drive of one’s laptop. In fact, one could keep multiple copies, say one per project.

E.2 Organizing data

For brevity, we assume that your approach to data management is that you have elected to maintain a core repository of data files shared across project (e.g., Compustat, CRSP) across projects along with project-specific data files. The shared data might also include data from non-WRDS sources, such as conference call files from StreetEvents or FactSet or data obtained from SEC EDGAR. The project-specific data files might be hand-collected or scraped data relevant to a specific project.

This approach to data organization is easily implemented using schemas in a relational database, such as PostgreSQL or MariaDB. An obvious choice for schema names for WRDS-derived data is the name of the corresponding schema in the WRDS PostgreSQL database (e.g., crsp for CRSP data and comp for Compustat data).

But if we want to maintain a file system–based approach rather than setting up a relational database, we might choose to organize our data files into directories. With such an approach, we could mimic the schemas discussed above, we could put tables from CRSP in crsp, tables from Compustat in comp, and so on. If we wanted a project-specific version of crsp.dsf for a project stored in mega_paper, we could just put that in the mega_paper directory.

Having elected to choose a data file approach to storing data, a natural question arises about the format to use. Many Stata users would use Stata data files (.dta). R users might use RDS files (.rds). SAS users typically keep SAS data files (.sas7bdat). However, .dta and .rds files are a bad choice for large files.1 Loading these files into Stata or R means reading the whole file off disk into RAM. This takes time and uses up memory. Storing crsp.dsf as dsf.dta or dsf.rds is likely to result in a file that is several gigabytes on disk and even more in memory.

The approach we use in this chapter uses parquet files, which we first saw in Chapter 23. Parquet files use a format optimized for data analysis, have a rich type system, and are created in a way that makes it easier to get just the data we need. Additionally, parquet files are column-oriented which typically leads to superior performance for data analysis than formats that are row-oriented, such as CSV files or database tables. Finally, parquet files are typically much smaller than the equivalent CSV file.

Relative to text files or most package-specific storage formats, Parquet files offer most of the benefits of relational databases.

  1. Fast random access to on-disk data.
  2. Data accessible by any software package. This is almost true for parquet files, but not quite. Stata cannot really work with on-disk data and has no native support for parquet data.
  3. Data can be (and should be) properly typed.
  4. Textual data are handled with ease.

A couple of potential advantages of a PostgreSQL server are lost with this approach. First, it’s slightly more difficult to centralize computing in the way that can often be achieved with a database server. Second, data are not “accessible from anywhere” in quite the same way. However, parquet files offer benefits without requiring a database server to be set up, a complex step for many users.

E.3 Canonical WRDS data

There are two underlying sources of WRDS data:

  1. The SAS files on the WRDS cloud server
  2. The WRDS PostgreSQL database

For many years (or even decades), the canonical WRDS data were those in SAS files on the WRDS servers. These SAS files formed the basis of the web queries familiar to many researchers.2

In recent years, WRDS has offered a PostgreSQL server that provides a more platform-neutral offering for researchers using more modern languages, such as Python or R. While the PostgreSQL server was initially constructed from the SAS data files, it seems possible that some data are ingested by the PostgreSQL server directly without an intermediate step using SAS data formats.

WRDS highlights support for four languages: R, Python, Stata, and SAS. For R, Python and Stata, it is assumed that users will access WRDS data via the PostgreSQL server. Only for SAS is it assumed that the SAS data files will be used directly.

Initially, the SAS data files appeared to be the canonical data sources. Some data sets were not covered by the PostgreSQL database (e.g., TAQ data). Other data sets were in the PostgreSQL server, but with some data loss. For example, non-ASCII text in SAS data files was lost in early versions of the PostgreSQL database; this appears no longer to be the case.

At this stage, it seems that both the WRDS SAS data files and the WRDS PostgreSQL database can be viewed as canonical sources for WRDS data. Nonetheless there are minor differences between these data sources. First, the underlying schema names can differ in some cases.

Second, there may be differences in the underlying data types. As discussed in Appendix C, SAS has only two underlying data types and is therefore much less strict about what data goes into each column. The more specific data types in SAS are actually implemented as presentations of the underlying data. Nonetheless some data may be presented as one type in SAS and another type in PostgreSQL. For example, permno is presented as a whole number in SAS, but as a floating-point value in PostgreSQL. Practically, this makes little difference.3

One data source that is perhaps incorrectly viewed as a canonical data source is the web query interface offered by WRDS. The web query interface leaves much to be desired from a reproducibility perspective as manual steps are needed to extract data and often it is necessary to eliminate variables or periods to manage the size of the resulting files. Additionally, the relation between the data returned by some queries and the underlying tables in SAS or PostgreSQL is not always clear. Some web queries appear to join data from multiple underlying tables in ways that are not always transparent.

E.4 Converting WRDS data to parquet

In this section we discuss two practical approaches that readers could use to create their own parquet-based repositories of WRDS data.
To keep the repository we create organized, we will put it under a single directory, which we flag using the DATA_DIR environment variable. As we need to interact with WRDS, we need to communicate our WRDS ID to the code, which we can also do using an environment variable.

Sys.setenv(DATA_DIR = "~/Dropbox/pq_data/",
           WRDS_ID = "iangow")

E.4.1 Approach 1: Get WRDS PostgreSQL data using DuckDB

One approach to getting WRDS data is to connect to the PostgreSQL server much as we have done throughout the book. Having made this connection, we can easily copy the data to a local DuckDB database, and from there it is easy to write the data to parquet files on our local storage.

For the reasons discussed in Section 6.1, we use environment variables to give DuckDB the information it needs to connect to the WRDS PostgreSQL server.

Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
           PGDATABASE = "wrds", 
           PGUSER = "iangow", 
           PGPORT = 9737)

We can now connect to the WRDS PostgreSQL server (pg here) and to a DuckDB instance (db). If we want to copy the crsp.msi table from WRDS to our local machine, we simply create msi, a remote data table, then copy this to db using the copy_to() function provided by dplyr. We name the table (name = "msi") so that we can refer to it in subsequent queries. Then, we have DuckDB save the data to a file named msi.parquet (DuckDB will infer that we want a parquet file from the .parquet file extension). Finally, we disconnect from both databases (pg and db).

pg <- dbConnect(RPostgres::Postgres())
db <- dbConnect(duckdb::duckdb())
msi <- tbl(pg, Id(schema = "crsp", table = "msi"))
copy_to(db, df = msi, name = "msi", overwrite = TRUE)

if (!dir.exists(file.path(Sys.getenv("DATA_DIR"), "crsp"))) {
  dir.create(file.path(Sys.getenv("DATA_DIR"), "crsp"))
}
file_name <- file.path(Sys.getenv("DATA_DIR"),
                       "crsp/msi.parquet")
dbExecute(db, str_c("COPY msi TO '", file_name, "'"))
[1] 1177

This basic approach is taken by the pg_to_parquet() function provided as part of the farr package. As we have the PostgreSQL connection details set up for WRDS and also have the DATA_DIR environment variable set up, we can issue a single line to get a local copy of crsp.msi.

pg_to_parquet("msi", "crsp")

The main downside of this approach is that it is very memory-intensive with large tables.

E.4.2 Approach 2: Get WRDS SAS data using Python

In Appendix D, we saw how was can use the wrds2pg Python library to import WRDS SAS data into a local PostgreSQL server with the wrds_update() function. The wrds2pg Python library also offers wrds_update_pq(), a function modelled after wrds_update() that uses SAS code to generate data that are saved as a parquet file.

To install the required version of wrds2pg use the following command from the command line.4

pip install wrds2pg --upgrade

Assuming we have the necessary environment variables set up (e.g., WRDS_ID and DATA_DIR), creating a parquet file from a WRDS SAS file is straightforward. (Note that the following lines are Python code.)

from wrds2pg import wrds_update_pq
wrds_update_pq("msi", "crsp")

If you don’t have the DATA_DIR environment variable set, you can supply it to the wrds_update_pq() function directly:

wrds_update_pq("msi", "crsp", data_dir = "~/Dropbox/pq_data/")

One benefit of this Python-based approach over the first approach is that we can leverage the features of wrds2pg, such as the ability to specify variable types, rename and drop variables, and so on. Additionally, importing from WRDS SAS data is fast and often results in better typed data than that provided by the WRDS PostgreSQL database. Finally, while wrds_update_pq() does require a significant amount of RAM, it places less of a burden than the R approach using pg_to_parquet() above. The main downside is the need to use Python.

Scripts to obtain all tables needed to run the code in this book are provided below in Section E.6.1 and Section E.6.2.

E.4.3 Approach 3: Get WRDS PostgreSQL data using Python

E.5 Working with parquet files

To illustrate how we can work with these parquet files, we retrace some of the steps we took in Chapter 21. Here we assume that you have parquet versions of the three data tables in your repository, perhaps as a result of running either the Python code or the R code below: Downloading comp.funda will take a few minutes.

Python code

As the segment data sets are in the compsegd library in SAS, but we want to put them in comp_segments_hist_daily to match their location in PostgreSQL, we specify sas_schema="compsegd" so the function can find the SAS data. As the comp.funda SAS data file includes special missing value codes that are exported by SAS as text values, we need to use fix_missing = True to convert these to missing values.

from wrds2pg import wrds_update_pq

wrds_update_pq("seg_customer", "comp_segments_hist_daily", 
                sas_schema="compsegd")
wrds_update_pq("names_seg", "comp_segments_hist_daily",
                sas_schema="compsegd")
wrds_update_pq("funda", "comp", fix_missing = True)

R code

The R code is simpler, but pg_to_parquet("funda", "comp") will take longer, require more RAM, and is less robust than the wrds_update_pq() approach.

pg_to_parquet("seg_customer", "compseg")
pg_to_parquet("names_seg", "compseg")
pg_to_parquet("funda", "comp")

E.5.1 Replicating analysis using parquet files

Assuming you now have the three data tables, we can proceed with some analysis from Chapter 21. We start by creating an in-memory DuckDB database.

db <- dbConnect(duckdb::duckdb())

We next use load_parquet() from the farr package to load the tables we created above into our DuckDB database, db. By default, load_parquet() will use the environment variable DATA_DIR to locate the repository of parquet files. If DATA_DIR has not been set, we can do that now.

Sys.setenv(DATA_DIR = "~/Dropbox/pq_data/")

seg_customer <- load_parquet(db, "seg_customer", "compseg")
names_seg <- load_parquet(db, "names_seg", "compseg")
funda <- load_parquet(db, "funda", "comp")

As an alternative to setting the DATA_DIR environment variable, you can supply the location of the repository of parquet files to the load_parquet() function directly as the data_dir argument.

load_parquet(db, "funda", "comp", data_dir = "~/Dropbox/pq_data/")

The next two tables come with the farr package, but need to be copied to db.

undisclosed_names <- copy_to(db, undisclosed_names)
llz_2018 <- copy_to(db, llz_2018)

The next block of code is essentially unchanged from the code in Chapter 21. The two changes are:

  • We do not use collect(), as we want the data to remain in our database as long as possible. In the original code, we needed to collect() data to merge with undisclosed_names and llz_2018. These are now all in a single database.
  • We add na.rm = TRUE in a couple of places to suppress warnings. SQL always uses na.rm = TRUE.

customers <-
  seg_customer |>
  filter(ctype == "COMPANY") 

disclosure_raw <-
  customers |>
  filter(between(datadate, "1994-01-01", "2010-12-31")) |>
  left_join(undisclosed_names, by = "cnms") |>
  mutate(disclosed = coalesce(disclosed, TRUE)) |>
  select(gvkey, datadate, cnms, salecs, disclosed) 

sales <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") |>
  select(gvkey, datadate, sale) 

prin_cust_df <-
  disclosure_raw |> 
  inner_join(sales, by = c("gvkey", "datadate")) |>
  group_by(gvkey, datadate) |> 
  filter(!is.na(salecs), sale > 0) |>
  summarize(prin_cust = max(salecs / sale, na.rm = TRUE), 
            .groups = 'drop') |>
  mutate(has_prin_cust = prin_cust >= 0.1)

disclosure <-
  disclosure_raw |> 
  inner_join(sales, by = c("gvkey", "datadate")) |>
  semi_join(llz_2018, by = "gvkey") |>
  group_by(gvkey, datadate) |>
  summarize(ratio = mean(as.double(!disclosed), na.rm = TRUE),
            ratio_sale = sum(as.double(!disclosed) * salecs, na.rm = TRUE) / 
              sum(salecs, na.rm = TRUE),
            .groups = "drop") |>
  mutate(year = year(datadate))
disclosure |> 
  summarize(across(c(ratio, ratio_sale), 
                   \(x) mean(x, na.rm = TRUE))) |>
  collect()
# A tibble: 1 × 2
  ratio ratio_sale
  <dbl>      <dbl>
1 0.459      0.464

E.5.2 Working with larger parquet files

The power of parquet files becomes even more apparent when working with larger data sets, such as crsp.dsf. Here we assume that you have a copy of crsp.dsf in your repository, perhaps as a result of running Python code like this. Downloading crsp.dsf will take 30 minutes or more. The resulting data file will be over 3 GB (this is much compressed relative to the original 16 GB SAS data file).

from wrds2pg import wrds_update_pq
wrds_update_pq("dsf", "crsp")

The following code does a number of analytical tasks. First, we count the number of rows in the table. Second, we count the number of rows by year. Third, we extract the data for a particular date. Next, we identify the PERMNO for Apple, then use that to plot cumulative returns for Apple over time.

Completing all these tasks takes about 3 seconds!

now <- Sys.time()
db <- dbConnect(duckdb::duckdb())

dsf <- load_parquet(db, "dsf", "crsp")
stocknames <- load_parquet(db, "stocknames", "crsp")

dsf |> 
  count() |> 
  collect()
# A tibble: 1 × 1
          n
      <dbl>
1 105258380
dsf |> 
  mutate(year = year(date)) |>
  count(year) |>
  arrange(desc(year)) |>
  collect()
# A tibble: 99 × 2
    year       n
   <dbl>   <dbl>
 1  2023 2353668
 2  2022 2389874
 3  2021 2187044
 4  2020 1948489
 5  2019 1911581
 6  2018 1869102
 7  2017 1827464
 8  2016 1828303
 9  2015 1820106
10  2014 1770410
# ℹ 89 more rows
dsf_subset <-
  dsf |>
  filter(date == "1986-01-07") |>
  collect()

apple_permno <-
  stocknames |> 
  filter(str_detect(comnam, "^APPLE COM")) |> 
  pull(permno)

dsf |> 
  filter(permno == apple_permno) |>
  group_by(permno) |>
  window_order(date) |>
  mutate(cumret = exp(cumsum(log(1 + coalesce(ret, 0))))) |>
  ggplot(aes(x = date, y = cumret)) +
  geom_line()

Sys.time() - now
Time difference of 7.264827 secs

E.6 Create a parquet library

E.6.1 Using wrds2pg

  1. Install Python. You may already have Python installed. If not Miniconda is a one option.

  2. Install the wrds2pg package.

pip install wrds2pg --upgrade
  1. Set up environment variables for wrds2pg. See here for details.
export WRDS_ID="iangow"
export DATA_DIR="~/Dropbox/pq_data"
  1. Run Python.

  2. Within Python, use the wrds2pg module to get data. Running the following script creates a repository comprising all the data sets used in this book. (If you have already downloaded some tables, you might want to comment those out so that you don’t download them again.)

#!/usr/bin/env python3
from wrds2pg import wrds_update_pq

# CRSP
wrds_update_pq('ccmxpf_lnkhist', 'crsp', fix_missing=True,
               col_types={'lpermno': 'integer',
                          'lpermco': 'integer'})
wrds_update_pq('dsf', 'crsp', fix_missing=True,
               col_types={'permno': 'integer',
                          'permco': 'integer'})
wrds_update_pq('dsi', 'crsp')
wrds_update_pq('erdport1', 'crsp', fix_missing=True)
wrds_update_pq('comphist', 'crsp', fix_missing=True)
wrds_update_pq('dsedelist', 'crsp', fix_missing=True,
               col_types={'permno': 'integer', 
                          'permco': 'integer'})
wrds_update_pq('dseexchdates', 'crsp',
               col_types={'permno': 'integer',
                           'permco': 'integer'})
wrds_update_pq('msf', 'crsp', fix_missing=True,
               col_types={'permno': 'integer',
                          'permco': 'integer'})
wrds_update_pq('msi', 'crsp')
wrds_update_pq('mse', 'crsp', fix_missing=True,
               col_types={'permno': 'integer',
                          'permco': 'integer'})
wrds_update_pq('stocknames', 'crsp',
               col_types={'permno': 'integer',
                          'permco': 'integer'})
wrds_update_pq('dsedist', 'crsp', fix_missing=True,
               col_types={'permno': 'integer',
                          'permco': 'integer'})

# Fama-French library
wrds_update_pq('factors_daily', 'ff')

# Compustat
wrds_update_pq('company', 'comp',)
wrds_update_pq('funda', 'comp', fix_missing=True)
wrds_update_pq('funda_fncd', 'comp')
wrds_update_pq('fundq', 'comp', fix_missing=True)
wrds_update_pq('r_auditors', 'comp')
wrds_update_pq('idx_daily', 'comp')
wrds_update_pq('aco_pnfnda', 'comp')

# The segment data is in comp_segments_hist_daily in PostgreSQL,
# but in compsegd in SAS, so we need to use sas_schema to find the SAS data.
wrds_update_pq('seg_customer', 'compseg')
wrds_update_pq('names_seg', 'compseg')

E.6.2 Using db2pq

  1. Install Python. You may already have Python installed. If not Miniconda is a one option.

  2. Install the db2pq package.

pip install db2pq --upgrade
  1. Set up environment variables for db2pq. See here for details.
export WRDS_ID="iangow"
export DATA_DIR="~/Dropbox/pq_data"
  1. Run Python.

  2. Within Python, use the db2pq module to get data. Running the following script creates a repository comprising all the data sets used in this book. (If you have already downloaded some tables, you might want to comment those out so that you don’t download them again.)

#!/usr/bin/env python3
from db2pq import wrds_update_pq

# CRSP
wrds_update_pq('ccmxpf_lnkhist', 'crsp', 
               col_types={'lpermno': 'int32',
                          'lpermco': 'int32'})
wrds_update_pq('dsf', 'crsp', 
               col_types={'permno': 'int32',
                          'permco': 'int32'})
wrds_update_pq('dsi', 'crsp')
wrds_update_pq('erdport1', 'crsp')
wrds_update_pq('comphist', 'crsp')
wrds_update_pq('dsedelist', 'crsp', 
               col_types={'permno': 'int32', 
                          'permco': 'int32'})
wrds_update_pq('dseexchdates', 'crsp',
               col_types={'permno': 'int32',
                           'permco': 'int32'})
wrds_update_pq('msf', 'crsp', 
               col_types={'permno': 'int32',
                          'permco': 'int32'})
wrds_update_pq('msi', 'crsp')
wrds_update_pq('mse', 'crsp', 
               col_types={'permno': 'int32',
                          'permco': 'int32'})
wrds_update_pq('stocknames', 'crsp',
               col_types={'permno': 'int32',
                          'permco': 'int32'})
wrds_update_pq('dsedist', 'crsp', 
               col_types={'permno': 'int32',
                          'permco': 'int32'})

# Fama-French library
wrds_update_pq('factors_daily', 'ff')

# Compustat
wrds_update_pq('company', 'comp')
wrds_update_pq('funda', 'comp')
wrds_update_pq('funda_fncd', 'comp')
wrds_update_pq('fundq', 'comp')
wrds_update_pq('r_auditors', 'comp')
wrds_update_pq('idx_daily', 'comp')
wrds_update_pq('aco_pnfnda', 'comp')

# The segment data is in comp_segments_hist_daily in PostgreSQL,
# but in compsegd in SAS, so we need to use sas_schema to find the SAS data.
wrds_update_pq('seg_customer', 'compseg')
wrds_update_pq('names_seg', 'compseg')

  1. SAS works in a way that makes SAS data files a reasonable choice.↩︎

  2. It seems plausible that these web queries still draw on SAS data files to this day.↩︎

  3. SAS stores the data as floating-point values in any case.↩︎

  4. If you installed wrds2pg before December 2023, then you will need to update to the current version.↩︎