6  Financial statements: A first look

In this chapter, we provide an introduction to data on financial statements, which are the primary focus of financial reporting. While a lot of contemporary accounting research has only a tangential connection to financial reporting, it’s not clear there would be a field of accounting research without financial reporting and most accounting research paper use financial reporting data to some degree (e.g., as controls). In this chapter, we will focus on annual financial statement data of North American firms, as provided by Compustat.

Tip

In this chapter, we will use three R libraries. We have seen dplyr and ggplot2 in earlier chapters, but the DBI library is new. The DBI library provides a standardized interface to relational database systems, including PostgreSQL (the system used by WRDS), MySQL, and SQLite. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available at https://github.com/iangow/far_templates.

6.1 Setting up WRDS

Academic researchers generally get Compustat data through Wharton Research Data Services, more commonly referred to as WRDS (pronounced like “words”). In the following, we assume that you have run the code above, that you have a WRDS account, and that you are connected to the internet.

Now let’s connect to the WRDS database. To actually use the code below, you should replace your_WRDS_ID with your actual WRDS ID. You may also need to add a line Sys.setenv(PGPASSWORD = "your_password") as the third line of code.1

Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
           PGPORT = 9737L,
           PGUSER = "your_WRDS_ID", 
           PGDATABASE = "wrds")

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

The first line of code above sets up environment variables containing information that R can use as defaults when connecting to a PostgreSQL database. The second line of code creates the actual connection to the database and assigns it to the variable pg.2

An alternative to the above code would be to pass connection information to the dbConnect() function directly, as shown in the code below.3

pg <- dbConnect(RPostgres::Postgres(), 
                host = "wrds-pgdata.wharton.upenn.edu", 
                port = 9737L, 
                # user = "your_WRDS_ID",
                # password = "a_really_good_password",
                dbname ="wrds",
                bigint = "integer")

However, we recommend using the environment variable-based approach above. We also recommend using the approach based on a .pgpass file discussed in detail in the WRDS instructions.4 Putting the connection details outside your code (e.g., in environment variables) makes it possible to write code that works both for you and for others, including your co-authors.

Now that we have established a connection to the WRDS database, we can get some data. Data sources on WRDS are organized into libraries according to their source (in PostgreSQL, these libraries are called schemas, but we will stick with the WRDS terminology for now).

In this chapter we will focus on Compustat data, which is found in the comp library. While there are other sources of financial statement data, in accounting research the pre-eminent data source is Compustat, which is typically obtained via WRDS.

We first look at the table, comp.company, which contains information about the companies in Compustat’s North American database (the equivalent table in Compustat’s “Global” database is comp.g_company). By running the following code, we create R objects that point to the data on the WRDS database and behave a lot like the data frames you have seen in earlier chapters.

company <- tbl(pg, Id(schema = "comp", table = "company"))

What have we done here? In the first line, we have created an object company that “points to” the table company in schema comp in the data source pg (which is the WRDS PostgreSQL database). From the help for the tbl() function, we learn that in tbl(src, ...), src refers to the data source. Here pg is an object representing the connection to the WRDS PostgreSQL database. The ... allows us to pass other arguments to the function, and we pass Id(schema = "comp", table = "company") to indicate the underlying database object we are looking for.

But what exactly is company? One way to get information about company is to use the class() function:

class(company)
[1] "tbl_PqConnection" "tbl_dbi"          "tbl_sql"          "tbl_lazy"        
[5] "tbl"             

The output here is a little complicated for a new user, but the basic idea is that company is a tbl_PqConnection object, which is a kind of tbl_sql, which is a kind of tbl_dbi, which is a kind of tbl_lazy, which is a kind of tbl, which means it a tibble of some kind. The critical idea here is that funda is a tbl_sql object, which means it behaves like a tibble, but is actually effectively a reference to a table or query in a relational database. For want of a better term, company is a remote data frame.

While there are some technical details here that are unimportant for our discussion, the important points are that, as a tbl_sql, funda has three important properties:

  1. It’s not a data frame in memory on our computer. This means that we don’t need to download the all of the data and load it into memory.
  2. It’s not a data frame in memory on the remote server (in this case, the WRDS PostgreSQL server). This means that we didn’t load the full data set into memory even onto the remote computer.
  3. Notwithstanding the previous two points, company behaves in many important ways just like a data frame.

We can use filter() to get only rows meeting certain conditions and we can use select to indicate which columns we want to retrieve:

company |> 
  filter(gsubind == "45202030", 
         substr(conm, 1, 1) == "A") |>
  select(gvkey, conm, state)
# Source:   SQL [4 x 3]
# Database: postgres  [igow@/tmp:5432/igow]
  gvkey  conm                state
  <chr>  <chr>               <chr>
1 001690 APPLE INC           CA   
2 001820 ASTRONOVA INC       RI   
3 027925 AVID TECHNOLOGY INC MA   
4 324055 ADVANCETC LTD       <NA> 

When we are dealing with a remote data frame, the dplyr code that we write is automatically translated into SQL, a specialized language from manipulating tabular data “used by pretty much every database in existence” (Appendix D has more on SQL and dplyr).

We can inspect the translated SQL using the show_query() function:

company |> 
  filter(gsubind == "45202030", 
         substr(conm, 1, 1) == "A") |>
  select(gvkey, conm, state) |>
  show_query()
<SQL>
SELECT "gvkey", "conm", "state"
FROM "comp"."company"
WHERE ("gsubind" = '45202030') AND (SUBSTR("conm", 1, 1) = 'A')

Here we see that the filter() verb is translated into equivalent SQL using WHERE.

While remote data frames behave a lot like local data frames, they are not quite the same thing and, for many purposes, we need to use local data frames. For example, the lm() function for fitting regressions assumes that the object provided as its data argument is a data frame and it will not work if we give it a remote data frame instead. Conversion of a table from a remote data frame to a local one is achieved using the collect() verb. Here we use collect() and store the result in df_company.

df_company <-
  company |> 
  filter(gsubind == "45202030", 
         substr(conm, 1, 1) == "A") |>
  select(gvkey, conm, state) |>
  collect()

Looking at class(df_company), we see that it is not only a “tibble” (tbl_df), but also data.frame, meaning that we can pass it to functions (such as lm()) that expect data.frame objects.

class(df_company)
[1] "tbl_df"     "tbl"        "data.frame"
df_company
# A tibble: 4 × 3
  gvkey  conm                state
  <chr>  <chr>               <chr>
1 001690 APPLE INC           CA   
2 001820 ASTRONOVA INC       RI   
3 027925 AVID TECHNOLOGY INC MA   
4 324055 ADVANCETC LTD       <NA> 

6.2 Financial statement data

Two core tables for the North American Compustat data are comp.funda, which contains annual financial statement data, and comp.fundq, which contains quarterly data.

fundq <- tbl(pg, Id(schema = "comp", table = "fundq"))
funda <- tbl(pg, Id(schema = "comp", table = "funda"))

We can examine the first object in R by typing its name in the R console:

funda
# Source:   table<funda> [?? x 948]
# Database: postgres  [igow@/tmp:5432/igow]
   gvkey datadate   fyear indfmt consol popsrc datafmt tic   cusip conm  acctchg
   <chr> <date>     <int> <chr>  <chr>  <chr>  <chr>   <chr> <chr> <chr> <chr>  
 1 0011… 1985-12-31  1985 INDL   C      D      STD     AVCD  0075… ADVA… <NA>   
 2 0011… 1985-12-31  1985 INDL   C      D      SUMM_S… AVCD  0075… ADVA… <NA>   
 3 0011… 1986-12-31  1986 INDL   C      D      STD     AVCD  0075… ADVA… <NA>   
 4 0011… 1986-12-31  1986 INDL   C      D      SUMM_S… AVCD  0075… ADVA… <NA>   
 5 0011… 1987-12-31  1987 INDL   C      D      STD     AVCD  0075… ADVA… <NA>   
 6 0011… 1987-12-31  1987 INDL   C      D      SUMM_S… AVCD  0075… ADVA… <NA>   
 7 0011… 1988-12-31  1988 INDL   C      D      STD     AVCD  0075… ADVA… <NA>   
 8 0011… 1988-12-31  1988 INDL   C      D      SUMM_S… AVCD  0075… ADVA… <NA>   
 9 0011… 1989-12-31  1989 INDL   C      D      STD     AVCD  0075… ADVA… <NA>   
10 0011… 1989-12-31  1989 INDL   C      D      SUMM_S… AVCD  0075… ADVA… <NA>   
# ℹ more rows
# ℹ 937 more variables: acctstd <chr>, acqmeth <chr>, adrr <dbl>, ajex <dbl>,
#   ajp <dbl>, bspr <chr>, compst <chr>, curcd <chr>, curncd <chr>,
#   currtr <dbl>, curuscn <dbl>, final <chr>, fyr <int>, ismod <int>,
#   ltcm <chr>, ogm <chr>, pddur <int>, scf <int>, src <int>, stalt <chr>,
#   udpl <chr>, upd <int>, apdedate <date>, fdate <date>, pdate <date>,
#   acchg <dbl>, acco <dbl>, accrt <dbl>, acdo <dbl>, aco <dbl>, acodo <dbl>, …

From this code we can see that comp.funda is a very wide table, with 948 columns. With that many columns, it’s a moderately large table (at the time of writing, 1212 MB in PostgreSQL), but by focusing on certain columns, the size of the data can be dramatically reduced. Let’s learn more about this table.

First, how many rows does it have?

funda |>
  count() |> 
  pull()
[1] 888903

From this snippet we can see that comp.funda has 888903 rows. When given an ungrouped data frame as the first argument, the count function simply counts the number of rows in the data frame.5 In this case funda |> count() returns a server-side data frame (tbl_df) with a single column (n). The command pull() extracts that single column as a vector in R.6

While not evident from casual inspection of the table, the primary key of comp.funda is (gvkey, datadate, indfmt, consol, popsrc, datafmt). One requirement for a valid primary key is that each value of the primary key is associated with only one row of the data set.7 That this is a valid key can be seen from the following code, which counts the number of rows associated with each set of values for (gvkey, datadate, indfmt, consol, popsrc, datafmt) and stores that number in num_rows and then displays the various values of num_rows.

funda |>
  group_by(gvkey, datadate, indfmt, consol, popsrc, datafmt) |>
  summarize(num_rows = n(), .groups = "drop") |>
  count(num_rows) |> 
  collect()
# A tibble: 1 × 2
  num_rows      n
     <int>  <int>
1        1 888903

Because num_rows is equal to 1 in 888,903 cases and this is the number of rows in the dataset, we have (gvkey, datadate, indfmt, consol, popsrc, datafmt) as a valid key.

Another requirement of a primary key is that none of the columns contains null values (in SQL, NULL; in R, NA). The code below checks this.

funda |>
  filter(is.na(gvkey) | is.na(datadate) | is.na(indfmt) |
         is.na(consol) | is.na(popsrc) | is.na(datafmt)) |>
  count() |>
  pull()
[1] 0

Teasing apart the primary key, we have a firm identifier (gvkey) and financial period identifier (datadate), along with four variables that are more technical in nature: indfmt, consol, popsrc, and datafmt.

funda |>
  count(indfmt, consol, popsrc, datafmt) |>
  arrange(desc(n)) |>
  collect() |>
  kbl()
Table 6.1: Observations by (indfmt, consol, popsrc, datafmt) values
indfmt consol popsrc datafmt n
INDL C D STD 566656
INDL C D SUMM_STD 274036
FS C D STD 46105
INDL P D STD 1169
INDL R D STD 480
INDL P D SUMM_STD 299
FS R D STD 51
INDL R D SUMM_STD 28
INDL D D SUMM_STD 25
INDL D D STD 25
INDL C D PRE_AMENDS 19
INDL C D PRE_AMENDSS 10

As discussed on the CRSP website, the first set of values, which covers 566,656 observations, represents the standard (STD) secondary keyset, as it is what is used by most researchers when using Compustat. We can create a version of funda that limits data to this STD secondary keyset as follows:

funda_mod <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D")

For funda_mod, (gvkey, datadate) should represent a primary key. Let’s check this using code like that we used above:

funda_mod |>
  group_by(gvkey, datadate) |>
  summarize(num_rows = n(), .groups = "drop") |>
  count(num_rows) |>
  collect()
# A tibble: 1 × 2
  num_rows      n
     <int>  <int>
1        1 566656

Now, let’s look at the table comp.company, which the following code confirms has 39 columns and for which gvkey is a primary key.

company
# Source:   table<company> [?? x 39]
# Database: postgres  [igow@/tmp:5432/igow]
   conm    gvkey add1  add2  add3  add4  addzip busdesc cik   city  conml costat
   <chr>   <chr> <chr> <chr> <chr> <chr> <chr>  <chr>   <chr> <chr> <chr> <chr> 
 1 A & E … 0010… <NA>  <NA>  <NA>  <NA>  <NA>   A & E … <NA>  <NA>  A & … I     
 2 A & M … 0010… 1924… <NA>  <NA>  <NA>  94104  <NA>    0000… Tulsa A & … I     
 3 AAI CO… 0010… 124 … <NA>  <NA>  <NA>  21030… Textro… 0001… Hunt… AAI … I     
 4 A.A. I… 0010… 7700… <NA>  <NA>  <NA>  63125  A.A. I… 0000… St. … A.A.… I     
 5 AAR CO… 0010… One … <NA>  <NA>  <NA>  60191  AAR Co… 0000… Wood… AAR … A     
 6 A.B.A.… 0010… 1026… <NA>  <NA>  <NA>  33782  A.B.A.… <NA>  Pine… A.B.… I     
 7 ABC IN… 0010… 301 … <NA>  <NA>  <NA>  46590  ABC In… <NA>  Wino… ABC … I     
 8 ABKCO … 0010… 1700… <NA>  <NA>  <NA>  10019  ABKCO … 0000… New … ABKC… I     
 9 ABM CO… 0010… 3 Wh… <NA>  <NA>  <NA>  92714  ABM Co… <NA>  Irvi… ABM … I     
10 ABS IN… 0010… Inte… <NA>  <NA>  <NA>  44904  ABS In… 0000… Will… ABS … I     
# ℹ more rows
# ℹ 27 more variables: county <chr>, dlrsn <chr>, ein <chr>, fax <chr>,
#   fic <chr>, fyrc <int>, ggroup <chr>, gind <chr>, gsector <chr>,
#   gsubind <chr>, idbflag <chr>, incorp <chr>, loc <chr>, naics <chr>,
#   phone <chr>, prican <chr>, prirow <chr>, priusa <chr>, sic <chr>,
#   spcindcd <int>, spcseccd <int>, spcsrc <chr>, state <chr>, stko <int>,
#   weburl <chr>, dldte <date>, ipodate <date>

First, each gvkey value is associated with just one row.

company |>
  group_by(gvkey) |>
  summarize(num_rows = n(), .groups = "drop") |>
  count(num_rows) |>
  collect()
# A tibble: 1 × 2
  num_rows     n
     <int> <int>
1        1 53523

Second, there are no missing values of gvkey.

company |>
  filter(is.na(gvkey)) |>
  count() |>
  collect()
# A tibble: 1 × 1
      n
  <int>
1     0

6.2.1 Illustration: Microsoft versus Apple

Suppose that we were interested in comparing the profitability of Apple and Microsoft over time. To measure performance, we will calculate a measure of return on assets, here measured as the value of “Income Before Extraordinary Items” scaled by “Total Assets” for Microsoft (GVKEY: 012141) and Apple (GVKEY: 001690) over time.8

sample <-
  company |>
  filter(gvkey %in% c("001690", "012141")) |>
  select(conm, gvkey)

sample
# Source:   SQL [2 x 2]
# Database: postgres  [igow@/tmp:5432/igow]
  conm           gvkey 
  <chr>          <chr> 
1 APPLE INC      001690
2 MICROSOFT CORP 012141

Here we can see that sample is a server-side data frame with the company name (conm) and identifier (gvkey) for Apple and Microsoft. This might be a useful juncture at which to explain what the server-side data frame represents. To this end, we can use show_query():

sample |>
  show_query()
<SQL>
SELECT "conm", "gvkey"
FROM "comp"."company"
WHERE ("gvkey" IN ('001690', '012141'))

Here filter() is translated into a WHERE condition, %in% is translated to IN, and c("001690", "012141") is translated to ('001690', '012141').

It turns out that “Income Before Extraordinary Items” is represented by ib and “Total Assets” is at (see the CRSP manual for details). We can further restrict funda beyond funda_mod so that it contains just the key variables (gvkey, datadate) and the values for ib and at as follows. (In the following code, we will gradually refine the data set assigned to ib_at.)

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) 

ib_at
# Source:   SQL [?? x 4]
# Database: postgres  [igow@/tmp:5432/igow]
   gvkey  datadate       ib     at
   <chr>  <date>      <dbl>  <dbl>
 1 001141 1985-12-31  0.526   7.84
 2 001141 1986-12-31  0.7     9.95
 3 001141 1987-12-31 -0.065   8.56
 4 001141 1988-12-31 -1.97    7.68
 5 001141 1989-12-31 -1.78    4.34
 6 001738 1981-12-31 11.4   930.  
 7 001141 1990-12-31 -0.183   3.65
 8 001141 1991-12-31  0.006   3.27
 9 001147 1975-12-31  0.171   3.97
10 001141 1992-12-31  0.059   3.31
# ℹ more rows

and then restrict this by joining it with sample as follows:

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) |>
  inner_join(sample)
Joining with `by = join_by(gvkey)`
ib_at
# Source:   SQL [?? x 5]
# Database: postgres  [igow@/tmp:5432/igow]
   gvkey  datadate      ib     at conm     
   <chr>  <date>     <dbl>  <dbl> <chr>    
 1 001690 1980-09-30  11.7   65.4 APPLE INC
 2 001690 1981-09-30  39.4  255.  APPLE INC
 3 001690 1982-09-30  61.3  358.  APPLE INC
 4 001690 1983-09-30  76.7  557.  APPLE INC
 5 001690 1984-09-30  64.1  789.  APPLE INC
 6 001690 1985-09-30  61.2  936.  APPLE INC
 7 001690 1986-09-30 154.  1160.  APPLE INC
 8 001690 1987-09-30 217.  1478.  APPLE INC
 9 001690 1988-09-30 400.  2082.  APPLE INC
10 001690 1989-09-30 454.  2744.  APPLE INC
# ℹ more rows

In the code above, we used a natural join, which joins by variables found in both data sets (in this case, gvkey). In general, we want to be explicit about the join variables, which we can do using the by argument to the join function:

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) |>
  inner_join(sample, by = "gvkey")

ib_at
# Source:   SQL [?? x 5]
# Database: postgres  [igow@/tmp:5432/igow]
   gvkey  datadate      ib     at conm     
   <chr>  <date>     <dbl>  <dbl> <chr>    
 1 001690 1980-09-30  11.7   65.4 APPLE INC
 2 001690 1981-09-30  39.4  255.  APPLE INC
 3 001690 1982-09-30  61.3  358.  APPLE INC
 4 001690 1983-09-30  76.7  557.  APPLE INC
 5 001690 1984-09-30  64.1  789.  APPLE INC
 6 001690 1985-09-30  61.2  936.  APPLE INC
 7 001690 1986-09-30 154.  1160.  APPLE INC
 8 001690 1987-09-30 217.  1478.  APPLE INC
 9 001690 1988-09-30 400.  2082.  APPLE INC
10 001690 1989-09-30 454.  2744.  APPLE INC
# ℹ more rows

We next calculate a value for return on assets (using income before extraordinary items and the ending balance of total assets for simplicity).

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) |>
  inner_join(sample, by = "gvkey") |>
  mutate(roa = ib/at)

The final step is to bring these data into R. At this stage, ib_at is still a server-side data frame (tbl_sql), which we can see using the class function.9

"tbl_sql" %in% class(ib_at)
[1] TRUE

To bring the data into R, we can use the collect() function, as follows:

ib_at <-
  funda_mod |>
  select(gvkey, datadate, ib, at) |>
  inner_join(sample, by = "gvkey") |>
  mutate(roa = ib / at) |>
  collect()

Now we have a local data frame that is no longer an instance of the tbl_sql class:

class(ib_at)
[1] "tbl_df"     "tbl"        "data.frame"

Note that, because we have used select() to get to just five fields and filter() to focus on two firms, the tibble ib_at is just 5.9 kB in size. Note the placement of collect() at the end of this pipeline means that the amount of data we need to retrieve from the database and load into R is quite small. If we had placed collect() immediately after funda_mod, we would have been retrieving data on thousands of observations and 948 variables. If we had placed collect() immediately after the select() statement, we would have been retrieving data on thousands of observations, but just 4 variables. But by placing collect() after the inner_join(), we are only retrieving data for firm-years related to Microsoft and Apple. Retrieving data into our local R instance requires both reading it off disk and, if our PostgreSQL server is remote, transferring it over the internet. Each of these operations is (relatively) slow and thus should be avoided where possible.

Having collected our data, we can make Figure 6.1.10

ib_at |>
  ggplot(aes(x = datadate, y = roa, 
             linetype = conm, color = conm)) +
  geom_line()
Plot of ROA over time for Microsoft and Apple
Figure 6.1: ROA over time for Microsoft and Apple

6.3 Exercises

  1. Suppose we didn’t have access to Compustat (or an equivalent database) for the analysis above, describe a process you would use to get the data required to make the plot above comparing performance of Microsoft and Apple.

  2. In the following code, how do funda_mod and funda_mod_alt differ? (For example, where are the data for each table?) What does the statement collect(n = 100) at the end of this code do?

funda <- tbl(pg, Id(schema = "comp", table = "funda"))

funda_mod <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") |>
  filter(fyear >= 1980)

funda_mod_alt <-
  funda_mod |>
  collect(n = 100)
  1. The table comp.company has data on SIC (Standard Industrial Classification) codes in the field sic. In words, what is the case_when() function doing in the following code? Why do we end up with just two rows?
company <- tbl(pg, Id(schema = "comp", table = "company"))

sample <-
  company |>
  select(gvkey, sic) |>
  mutate(co_name = case_when(gvkey == "001690" ~ "Apple",
                             gvkey == "012141" ~ "Microsoft")) |>
  filter(!is.na(co_name))
  1. What does the data frame another_sample represent? What happens if we change the inner_join() statement below to simply inner_join(sample). What happens if we change it to inner_join(sample, by = "sic") (i.e., omit the suffix = c("", "_other") portion)? Why do you think we want filter(gvkey != gvkey_other)?
another_sample <-
  company |>
  select(gvkey, sic) |>
  inner_join(sample, by = "sic", suffix = c("", "_other")) |>
  filter(gvkey != gvkey_other) |>
  mutate(group = paste(co_name, "peer")) |>
  select(gvkey, group)
  1. What is the following code doing?
total_sample <-
  sample |>
  rename(group = co_name) |>
  select(gvkey, group) |>
  union_all(another_sample)
  1. Suppose that we are interested in how firms’ level of R&D activity. One measure of R&D activity is R&D Intensity, which can be defined as “R&D expenses” (Compustat item xrd) scaled by “Total Assets” (Compustat item at). In xrd_at, what’s the difference between rd_intensity and rd_intensity_alt? Does filter(at > 0) seem like a reasonable filter? What happens if we omit it?
xrd_at <-
  funda_mod |>
  select(gvkey, datadate, fyear, conm, xrd, at) |>
  filter(at > 0) |>
  mutate(rd_intensity = xrd / at,
         xrd_alt = coalesce(xrd, 0),
         rd_intensity_alt = xrd_alt / at) |>
  inner_join(total_sample, by = "gvkey") 
  1. Looking at a sample of rows from xrd_at_sum, it appears that the three R&D intensity measures are always identical for Apple and Microsoft, but generally different for their peer groups. What explains these differences? Can you say that one measure is “correct”? Or would you say “it depends”?
xrd_at_sum <-
  xrd_at |>
  group_by(group, fyear) |>
  summarize(total_at = sum(at, na.rm = TRUE),
            total_rd = sum(xrd, na.rm = TRUE),
            rd_intensity1 = mean(xrd / at, na.rm = TRUE),
            rd_intensity2 = mean(xrd_alt / at, na.rm = TRUE),
            .groups = "drop") |>
  mutate(rd_intensity3 = if_else(total_at > 0, total_rd / total_at, 
                                 NA_real_)) 
  
xrd_at_sum |>
  select(-total_at, -total_rd) |>
  arrange(desc(fyear), group) |>
  collect(n = 8)
# A tibble: 8 × 5
  group          fyear rd_intensity1 rd_intensity2 rd_intensity3
  <chr>          <int>         <dbl>         <dbl>         <dbl>
1 Apple           2023        0.0848        0.0848        0.0848
2 Apple peer      2023        0.115         0.115         0.122 
3 Microsoft       2023        0.0660        0.0660        0.0660
4 Microsoft peer  2023        0.219         0.182         0.0898
5 Apple           2022        0.0744        0.0744        0.0744
6 Apple peer      2022        0.133         0.127         0.0955
7 Microsoft       2022        0.0672        0.0672        0.0672
8 Microsoft peer  2022        0.420         0.351         0.0778
  1. Write code to produce Figure 6.2. Also produce plots that use rd_intensity1 and rd_intensity2 as measures of R&D intensity. Do the plots help you think about which of the three measures makes most sense?
Plot of ROA over time for Microsoft and Apple and peers
Figure 6.2: ROA over time for Microsoft and Apple and peers

  1. We put our passwords in a special password file, as described here, so we don’t need this step. It’s obviously not a good idea to put your password in code.↩︎

  2. The bigint = "integer" is included here to avoid some technical issues later on. See the help for RPostgres for more information. You can access this help by typing ? RPostgres::Postgres in the R console.↩︎

  3. Remove the # for the line beginning # password and replace the password and user name if you try this.↩︎

  4. See https://go.unimelb.edu.au/kv68. Note that we do not recommend setting up a connection to WRDS in your .Rprofile, as this makes code that relies on such a less transparent and usable by others. Additionally, it means setting up a connection even when you may not need one.↩︎

  5. A theme of dplyr is that many functions take data frames as (first) arguments and return data frames as values.↩︎

  6. As that vector has a single value it is displayed without any information about the database connection that we used to get the information. This looks cleaner in this setting, as you can see yourself if you run the command yourself without pull() at the end.↩︎

  7. See Chapter 19 of R for Data Science for more discussion of this point.↩︎

  8. Financial analysts might object to using an end-of-period value as the denominator; we do so here because it simplifies the coding and our goal here is just illustrative.↩︎

  9. An alternative to this line would be inherits(ib_at, "tbl_sql"). More on classes can be found in “Advanced R”.↩︎

  10. Note that, even if we hadn’t called collect(), recent versions of ggplot2 are aware of remote data frames and will take care of this for us. But other functions (e.g., lm(), which we saw in Chapter 4) are not aware of remote data frames and thus require us to use collect() before passing data to them.↩︎