8 Financial statements: A second look
In this chapter, we will dive a bit more deeply into financial statements than we did in Chapter 6. The focus of this chapter is an exploration of core attributes of financial accounting data (e.g., that balance sheets balance). After exploring how financial statement data are represented in Compustat, we dive into our first detailed exploration of an accounting research paper, Koh and Reeb (2015), which explores issues regarding how R&D spending is reported by firms and coded in Compustat.
This chapters is optional in the sense that subsequent chapters do not depend on an understanding of the content of this chapter. However, the material of this chapter provides a good opportunity for readers to better understand the messiness of representing seemingly tidy relationships in accounting data in a database. Some of the exercises help readers to understand better where data in Compustat come from. Working through this chapter also serves to dispel any notion that commercial databases are error-free, as we find discrepancies that are apparent even from just the information in Compustat.
In this chapter, we will use the R libraries below. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available on GitHub.
8.1 Core attributes of financial statements
Three core attributes of financial statements that should hold are the following:
- Balance-sheet balance. Balance sheets should balance (i.e., for total assets should equal the sum of total liabilities and shareholders’ equity). This is a sine qua non of double-entry bookkeeping.
- Within-statement articulation. The various sub-totals within a financial statement make sense with regard to the items they comprise. For example, the amount in total assets should equal the sum of the component assets.
- Across-statement articulation. Financial statements should articulate. For example, the amounts reported for cash on the beginning and ending balance sheets should be explained by the statement of cash flows.
Below we explore how these attributes are reflected in Compustat. As in Chapter 6, we will focus on two tables from Compustat’s North American database: comp.funda
and comp.company
and we will construct a table funda_mod
that contains the “standard” set of observations for Compustat.
We begin by setting up the remote data frames we used in Chapter 6.
db <- dbConnect(duckdb::duckdb())
funda <- load_parquet(db, schema = "comp", table = "funda")
company <- load_parquet(db, schema = "comp", table = "company")
funda_mod <-
funda |>
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D")
8.2 Balance sheets
An essential feature of statements of financial position is that they balance, hence the more traditional term “balance sheets”. Let’s check that this holds on Compustat. Total assets is at
and the sum of liabilities and shareholders’ equity is lse
.
So everything balances at this level. Note that if either at
or lse
is NA
, then the filter will not apply. For completeness, the following code examines missingness of at
and lse
. Below we see that if at
is missing, so is lse
; if at
is present, so is lse
.
8.3 Within-statement articulation
Having established that balance sheets always balance on Compustat, the next question we study is whether the various balancing models for funda
hold. Compustat provides balancing models that explain how various items on financial statements in its database relate to each other. WRDS provides these balancing models as Excel files on its website (requires WRDS access).
One example of a balancing model relates to the decomposition of total assets: at
= act
+ ppent
+ ivaeq
+ ivao
+ intan
+ ao
, where the description of each item in the equation is provided in Table 8.1.
Item | Description |
---|---|
act |
Current Assets—Total |
ppent |
Property Plant and Equipment—Total (Net) |
ivaeq |
Investment and Advances—Equity |
ivao |
Investment and Advances—Other |
intan |
Intangible Assets—Total |
ao |
Assets Other—Total |
dc |
Deferred Charges (component of AO) |
aox |
Assets Other (Sundry) (component of AO) |
The first issue we need to think about is the presence of NA
values in the components of at
even when at
itself is not NA
. Let’s look at data for one company (GVKEY: 008902
) where this creates issues.
From the output from the code below, we see that, because ivaeq
is NA
for these years, the sum of the components of at
is also NA
, even though the value of at
is not NA
. What’s going on? The answer is that ivaeq
is NA
because this firm doesn’t report amounts for “Investment and Advances - Equity”, because they’re either zero or not material. Thus it seems reasonable to assume that these should be zero.
funda_mod |>
semi_join(na_sample_firm_years, by = c("gvkey", "datadate")) |>
mutate(at_calc = act + ppent + ivaeq + ivao + intan + ao) |>
select(gvkey, datadate, at, at_calc, act,
ppent, ivaeq, ivao, intan, ao) |>
arrange(datadate)
gvkey | datadate | at | at_calc | act | ppent | ivaeq | ivao | intan | ao |
---|---|---|---|---|---|---|---|---|---|
008902 | 2000-05-31 | 2099.20 | NA | 785.09 | 366.23 | NA | 0 | 915.74 | 32.15 |
008902 | 2001-05-31 | 2078.49 | NA | 819.42 | 362.04 | NA | 0 | 871.65 | 25.39 |
008902 | 2002-05-31 | 2036.40 | NA | 801.31 | 355.80 | NA | 0 | 856.86 | 22.43 |
008902 | 2003-05-31 | 2247.21 | NA | 928.09 | 370.79 | NA | 0 | 914.20 | 34.13 |
008902 | 2004-05-31 | 2353.12 | NA | 994.62 | 381.06 | NA | 0 | 930.62 | 46.83 |
To convert NA
amounts to zero, we can use the coalesce()
function, which takes two arguments and returns the first argument if it is not NA
and returns the second argument otherwise.1 So, coalesce(ivaeq, 0)
equals ivaeq
when ivaeq
is not NA
, and 0
otherwise.
tol <- 1e-3
funda_na_fixed <-
funda_mod |>
select(gvkey, datadate, act, ppent, ivaeq, ivao, intan, ao, at,
dc, aox) |>
mutate(at_calc = coalesce(act, 0) + coalesce(ppent, 0) +
coalesce(ivaeq, 0) + coalesce(ivao, 0) +
coalesce(intan, 0) + coalesce(ao, 0),
at_diff = at_calc - at,
balance = abs(at_diff) < tol)
Let’s see if applying coalesce(x, 0)
helps. Below we see that, in this case, we can tie the value for at
with its components (act
, ppent
, ivaeq
, ivao
, intan
, and ao
).
funda_na_fixed |>
semi_join(na_sample_firm_years,
by = c("gvkey", "datadate")) |>
select(gvkey, datadate, at, at_calc, balance) |>
arrange(datadate)
gvkey | datadate | at | at_calc | balance |
---|---|---|---|---|
008902 | 2000-05-31 | 2099.20 | 2099.20 | True |
008902 | 2001-05-31 | 2078.49 | 2078.49 | True |
008902 | 2002-05-31 | 2036.40 | 2036.40 | True |
008902 | 2003-05-31 | 2247.21 | 2247.21 | True |
008902 | 2004-05-31 | 2353.12 | 2353.12 | True |
Note that we specified a “tolerance” (tol
) equal to 0.001 because testing for equality of floating-point calculations is not always exact.
But, does this approach resolve all issues with articulation of at
with its components? Alas the output from the follow code suggests the answer is “no”. We use the exercises to look more closely at a few of the problem cases.
funda_na_fixed |>
count(balance)
balance | n |
---|---|
False | 79,567 |
NA | 88,568 |
True | 404,579 |
8.3.1 Exercises
What is the value in
funda_na_fixed
ofat
whenbalance
isNA
? Is this surprising?Write code to calculate
decade
, the decade in whichdatadate
for each observation falls. (Hint: The functionsfloor()
andyear()
and the number10
may be helpful.)Are there any cases in
funda_na_fixed
whereat_calc
is greater than zero andat
isNA
? Which decades are these mostly found in? If you were doing research with these data, how might you handle these cases?Consider the firm with
gvkey
equal to016476
. Write code to obtain the company name and CIK from thecompany
table defined above. Using that CIK, find the 10-K filing for the year ending November 2003 on the SEC’s website. (Note: The 10-K will be released some time after2003-11-30
.)Looking at the 10-K you just found, what seems to be going on with the observation below?
funda_na_fixed |>
filter(gvkey == "016476", datadate == "2003-11-30") |>
select(datadate, at, at_calc, at_diff, act, ppent, intan, dc, aox)
datadate | at | at_calc | at_diff | act | ppent | intan | dc | aox |
---|---|---|---|---|---|---|---|---|
2003-11-30 | 2983.76 | 3038.56 | 54.80 | 1675.12 | 486.71 | 244.63 | 54.80 | 577.30 |
- Using the approach above, we can find the 10-K for the observation examined in the following code on the SEC’s EDGAR site. What’s going on with this case? What’s the most significant difference between this case and the one above? (Hint: The following additional Compustat balancing model may help:
act = ppent + intan + ivao + ao
.)
8.4 Across-statement articulation
pf Under IAS 1 Presentation of Financial Statements, a complete set of financial statements will include the following four statements:
- A statement of financial position (i.e., a balance sheet)
- A statement of profit or loss and other comprehensive income (i.e., an income statement)
- A statement of changes in equity
- A statement of cash flows
Of these four statements, the balance sheet represents a statement of stocks, or balances at particular points of time, while the other three represent statements of flows, or changes in balances over time. Of the three flow statements, only the last two explicitly reconcile beginning and ending balance sheet items.2 The statement of changes in equity “provides a reconciliation of the opening and closing amounts of each component of equity for the period” and the statement of cash flows performs an analogous function with respect to the opening and closing amounts of cash and cash equivalents.
Compustat’s balancing models only address the balance sheet, income statement, and statement of cash flows. There is no balancing model for the statement of changes in equity on Compustat. So we only really have the statement of cash flows available to test the across-statement articulation within Compustat and therefore focus on that statement in the discussion below.
Before moving on to the cash flow statement, we examine the balancing model for cash on the balance sheet: ch
+ ivst
= che
, where the description of each item in the equation is given in Table 8.2.
Item | Description |
---|---|
ch |
Cash |
ivst |
Short-Term Investments |
che |
Cash and Short-Term Investments—Total |
The wrinkle here can be seen in the output below: There are a few cases where we are missing the total (che
), but we have values for one of its components (ivst
or ch
).
funda_mod |>
select(gvkey, datadate, che, ch, ivst) |>
count(missing_che = is.na(che),
missing_ch = is.na(ch),
missing_ivst = is.na(ivst)) |>
arrange(missing_che, missing_ch, missing_ivst)
missing_che | missing_ch | missing_ivst | n |
---|---|---|---|
False | False | False | 415,414 |
False | False | True | 1,206 |
False | True | False | 4 |
False | True | True | 62,322 |
True | False | True | 74 |
True | True | False | 61 |
True | True | True | 93,633 |
While more digging might be appropriate to work out how best to handle these cases, below we take the simple expedient of using ch
or ivst
as the value for che
when che
is missing, but ch
is not. The following suggests that this modified balancing model holds in almost every case where at least one of components is not missing (though there is a non-trivial number of cases where che
, ch
and ivst
are all missing).
funda_mod |>
select(gvkey, datadate, che, ch, ivst) |>
mutate(missing_che = is.na(che),
missing_che_comps = is.na(ch) & is.na(ivst),
che = case_when(!is.na(che) ~ che,
!is.na(ivst) ~ ivst,
!is.na(ch) ~ ch),
che_calc = coalesce(ch, 0) + coalesce(ivst, 0),
che_diff = che_calc - che,
che_balance = abs(che_diff) < tol) |>
count(che_balance, missing_che, missing_che_comps) |>
arrange(desc(n))
che_balance | missing_che | missing_che_comps | n |
---|---|---|---|
True | False | False | 416,623 |
NA | True | True | 93,633 |
False | False | True | 62,064 |
True | False | True | 258 |
True | True | False | 135 |
False | False | False | 1 |
Having explored the Compustat balancing model for cash on the balance sheet, we now consider the Compustat balancing model for the annual cash flow statement, which is expressed as follows: oancf
+ ivncf
+ fincf
+ exre
= chech
, where the description of each item in the equation is given in Table 8.3.
Item | Description |
---|---|
oancf |
Operating Activities—Net Cash Flow |
ivncf |
Investing Activities—Net Cash Flow |
fincf |
Financing Activities—Net Cash Flow |
exre |
Exchange Rate Effect |
chech |
Cash and Cash Equivalents—Increase (Decrease) |
Let’s collect some data to check whether the cash flow statement balancing model holds. The first step is to calculate its left-hand side, chech_calc
, converting missing values for each of the four elements to zero. Here we use across()
to modify multiple columns in one step. (The across()
function is covered in some detail in Chapter 26 of R for Data Science.) We then compare chech_calc
and chech
and, if these two values are essentially equal, we set chech_balance
to TRUE
.
funda_mod |>
select(gvkey, datadate, oancf, ivncf, fincf, exre, chech) |>
mutate(across(oancf:exre, \(x) coalesce(x, 0)),
chech_calc = oancf + ivncf + fincf + exre,
chech_diff = chech_calc - chech,
chech_balance = abs(chech_diff) < tol) |>
filter(!is.na(chech), chech_calc != 0) |>
count(chech_balance)
chech_balance | n |
---|---|
False | 488 |
True | 308,899 |
So there are relatively few cases where the first equation doesn’t hold.
Note that chech
is likely collected from the statement of cash flows itself, so this is really a within-statement relationship. This leaves the final question: Do the amounts provided in Compustat items related to the balance sheet tie to amounts provided in corresponding items related to the statement of cash flows? Here things get a little more complicated because we need to compare changes in cash implied by two balance sheets with chech
. In practice, it turns out that, while in many cases ch
(“cash”) equals che
(“cash and cash equivalents”), in cases where ch
does not equal che
, there are firms whose statements of cash flows reconcile with changes in ch
and firms whose statements of cash flows reconcile with changes in ch
. Note that we loosen the tolerance a bit (tol <- 0.1
) in this test to allow for rounding errors in the presentation of multiple financial statements.
Note that we use the window function lag()
in this query. With a window function, the data are organized into partitions and the function operates on each partition independently and possibly uses data from multiple rows of data within that window. Here we construct partitions based on gvkey
using group_by(gvkey)
because we only want to consider values for a given firm. The data within each partition are ordered by datadate
using window_order()
. As its name suggests, the lag()
function for each value simply returns the value for the previous element in the window.3 We will see several window functions in this book, including lead()
(the opposite of lag()
), row_number()
(the position of the row in the window), and cumsum()
(the cumulative sum of the window’s values). A short discussion of window functions is found in Chapter 21 of R for Data Science.
tol <- 0.1
funda_cf_balance <-
funda_mod |>
select(gvkey, datadate, oancf, ivncf, fincf,
exre, chech, che, ch, ivst) |>
mutate(across(oancf:exre, \(x) coalesce(x, 0)),
chech_calc = oancf + ivncf + fincf + exre,
chech_balance = abs(chech_calc - chech) < tol) |>
filter(chech_calc != 0, chech_balance) |>
group_by(gvkey) |>
window_order(datadate) |>
mutate(lag_datadate = lag(datadate),
d_che = che - lag(che),
d_ch = ch - lag(ch)) |>
ungroup() |>
filter(!is.na(d_che) | !is.na(d_ch)) |>
mutate(artic_desc =
case_when(abs(d_ch - d_che) < tol & abs(chech - d_che) < tol
~ "Articulates (CHE == CH)",
abs(chech - d_che) < tol ~ "Articulates using CHE",
abs(chech - d_ch) < tol ~ "Articulates using CH",
abs(chech - d_che) >= tol ~ "Does not articulate",
abs(chech - d_ch) >= tol ~ "Does not articulate",
.default = "Other"))
Below we see that there is a non-trivial number of cases where the cash flow statement appears not to articulate with changes in cash balances on the balance sheet. We explore some cases of this failure to articulate in the exercises.
Description | Number of cases |
---|---|
Articulates (CHE == CH) | 139,825 |
Articulates using CH | 80,814 |
Articulates using CHE | 16,827 |
Does not articulate | 42,340 |
8.4.1 Exercises
In checking cash flow statement articulation, we used the
across()
function (the documentation? dplyr::across
provides more detail). Rewrite the code to createfunda_na_fixed
above to use a similar approach. Check that you get the same results with respect tocount(balance)
as you get using the original code.Consider the case of American Airlines (GVKEY:
001045
) for the year ended 31 December 2020. Look up the relevant American Airlines 10-K onsec.gov
. (Hint: You can get the CIK fromcomp.company
.) What explains the gaps seen in the output below?
8.5 Missing R & D
In this section, we discuss Koh and Reeb (2015), who “investigate whether missing R&D expenditures in financial statements indicates a lack of innovation activity. Patent records reveal that 10.5% of missing R&D firms file and receive patents, which is 14 times greater than zero R&D firms.”
8.5.1 Discussion questions
- What is the following code doing? Koh and Reeb (2015) use a similar filter. What is their rationale for this?
included_firms <-
company |>
mutate(sic = as.integer(sic)) |>
filter(!between(sic, 4900, 4999), !between(sic, 6000, 6999)) |>
select(gvkey)
rd_data <-
funda_mod |>
semi_join(included_firms, by = "gvkey") |>
mutate(missing_rd = is.na(xrd),
zero_rd = xrd == 0,
year = year(datadate)) |>
select(gvkey, datadate, year, xrd, missing_rd, zero_rd, at) |>
collect()
Is there a balancing model that includes R&D spending?
Focusing on years 1980-2019, what have been the trends in the proportion of firms not reporting R&D and the “average” amount of R&D (scaled by assets)?4 (In calculating the “average” does it make sense to use the
mean()
ormedian()
?) Speculate as to what might explain these trends.Koh and Reeb (2015) find that many “studies in The Accounting Review use R&D in their analysis and code the missing values as zero, implicitly assuming that blank R&D is equal to zero R&D, [while many] articles in the Journal of Finance [code] … the blank values as zero and including a dummy variable to indicate blank R&D firms. … In contrast, 42% of the studies in the Strategic Management Journal use R&D and they take a very different approach, often replacing the missing R&D values with either the industry average R&D, or a historical value from prior years.” Why do scholars from different fields make such different choices? In light of Koh and Reeb (2015), do you think that one approach is more correct than the others?
Consider the 10-K filing made by IHS Markit Ltd on 2021-01-22. Based on the information in the filing, do you think that IHS Markit engages in research and development activity? Does IHS Markit generate patents? Are there other forms of intellectual property protection that IHS Markit relies on? Does IHS Markit report an amount for research and development expenditure? Who is IHS Markit’s external auditor? Do you think that IHS Markit is in violation of GAAP? Or is its reporting choice with respect to R&D a within-GAAP use of reporting discretion?
Koh and Reeb (2015) state that “our first set of tests compare patent activity between non-reporting R&D firms and firms that report zero R&D. … We use both full sample and propensity score matched samples. … While full sample tests allow for greater external validity, the matched sample tests potentially improve the local treatment effect. … Based on the propensity score matched sample, our multivariate tests indicate that, on average, non-reporting R&D firms file about 14 times more patent applications than the matched zero R&D firms.” In speaking of “treatment effects” Koh and Reeb (2015) implicitly view disclosure of R&D as a treatment variable and patent applications as an outcome. Does this make sense to you? Do you think that this is really what Koh and Reeb (2015) want to do? If not, what is the inference they are trying to draw?
What exactly is the take-away from the analysis in section 4.4 of Koh and Reeb (2015), which uses “the rapid demise of Arthur Andersen (AA) as a quasi-natural experiment”? For example, what is the treatment? What assumptions are needed to generalize from the specific treatment to a treatment of wider applicability? What additional analyses can you suggest that might provide additional assurance that the results reflect a causal effect of interest?
What are the suggested implications of the simulation analysis of section 4.5 of Koh and Reeb (2015)? What are the inherent limitations in a simulation analysis like this one?
As discussed in its documentation, the
coalesce()
function fromdplyr
is “inspired by the SQLCOALESCE
function which does the same thing forNULL
s”.↩︎While the income statement needs to articulate with retained earnings, it does not provide a complete account of changes in that balance sheet account.↩︎
A careful reader might have noticed that we actually used the window function
fill()
in Chapter 2. However, there we usedarrange()
instead ofwindow_order()
. Thewindow_order()
function is only available for remote data frames because it relies on the SQL back-end to provide functionality not available with local data frames. Readers coming from an SQL background might observe thatdplyr
’sgroup_by()
is “overloaded” in the sense that it does the work of both theGROUP BY
statement and thePARTITION BY
clause in SQL.↩︎At the time of initial writing, 2020 was too incomplete to be meaningful.↩︎