21  Panel data

Most data sets used in accounting research are panel data sets, which means repeated observations on multiple units. Typically units are companies and observations are available for several points of time or period. As we saw in Chapter 3, in certain conditions, researchers can exploit panel data sets to gain insights into causal effects.

In this chapter we explore two approaches commonly used with panel data sets. The first approach uses difference in differences to measure causal effects. The second approach uses multi-way fixed effects to account both for what is commonly referred to as “unobserved time-invariant heterogeneity” between units (i.e., unit-specific effects) and period-specific effects. In practice, these approaches are often combined. We discussed difference in differences in Chapter 19 and recommend that you read that chapter before this one.

Tip

The code in this chapter uses the following packages. For instructions on how to set up your computer to use the code found in this book, see Section 1.2. Quarto templates for the exercises below are available at https://github.com/iangow/far_templates.

21.1 Analysis of simulated data

In Chapter 3, we explored the test_scores data set. There we used a number of approaches, including one where we estimated the effect size using a difference-in-difference estimator with grade and individual fixed effects as 15.736.

We can now reveal that the test_scores data set is simulated using the function get_test_scores(), with the default values for each of its arguments: effect_size = 15, n_students = 1000, and n_grades = 4.

The data-generating process embedded in get_test_scores() produces scores using the following equation

\[ y_{ig} = \alpha_i + \beta x_{ig} + \gamma_g + \epsilon_{ig} \] where \(i\) and \(g\) denote individuals and grades, respectively. Denoting \(G\) as the grade after application of treatment (i.e., 7 in the test_scores data), \(x_{ig}\) is the treatment indicator:

\[ x_{ig} = \begin{cases} 1 \text{ if } i \text{ is treated and } g \geq G \\ 0 \text{ otherwise} \end{cases} \]

Thus, as we will see, this is precisely the setting where multi-way fixed effects are appropriate. The individual fixed effect picks up \(\alpha_i\) and the grade fixed effect picks up \(\gamma_g\). While the estimated effect size of 15.736 seems close to the true value \(\beta = 15\), but is it close enough? To better examine this issue, we generate a version of the data where the treatment effect is zero (i.e., \(\beta = 0\)).

set.seed(2021)
test_scores_alt <- get_test_scores(effect_size = 0)
fms <- list()
fms[[1]] <- feols(score ~ I(treat * post) | grade  + id, 
              data = test_scores_alt, vcov = "iid")
fms[[2]] <- feols(score ~ I(treat * post) | grade  + id, 
                  vcov = ~ grade + id, data = test_scores_alt)

Because we set the seed to the same value (2021) as used to generate the original data, we get the same random draws and our estimated treatment effect is exactly as before (0.736), but reduced by \(15\) and the estimated standard error (without clustering) is exactly as we saw earlier (0.319). However, we see evidence that the coefficient is not “close enough” to the true value, as we would reject the true null at the 5% level (\(p\)-value of 0.0211).

modelsummary(fms,
             estimate = "{estimate}{stars}",
             gof_map = c("nobs", "r.squared", "adj.r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 21.1: Regression analysis: Simulated data
 (1)   (2)
I(treat * post) 0.736** 0.736
(0.319) (0.995)
Num.Obs. 4000 4000
R2 0.923 0.923
R2 Adj. 0.897 0.897

What has “gone wrong” here? Some possible explanations are:

  1. Our standard errors are too low.
  2. This is simply bad luck (i.e., we would not expect these results for a different seed value).
  3. Our estimator is biased due to small-sample properties of the fixed-effects estimator.
  4. Our estimator is biased due to endogenous selection.

Let’s take these explanations one at a time. For the first, the fact that cluster-robust standard errors are significantly higher seems consistent with this explanation. But recall that the standard error is essentially an estimate of the standard deviation of the relevant coefficient over independent random draws of the data. So a more rigorous way to test this explanation is to simulate the data and examine whether the OLS standard errors tend to underestimate the variability of the coefficients.

extract_stats <- function(fm, model_name) {
  tibble(model = model_name, 
         estimate = coef(fm),
         se = se(fm),
         t = coef(fm) / se(fm), 
         `p-value` = pvalue(fm)) 
}

The following function generates a data set using the get_test_scores() function, then runs two regressions—one using “OLS” standard errors and one using two-way cluster-robust standard errors (“CL-2”)—then returns statistics from the fitted models.

sim_run <- function(i, ...) {
  df <- get_test_scores(effect_size = 0, ...)
  
  fm1 <- feols(score ~ I(treat * post) | grade  + id, 
              data = df, vcov = "iid")
  fm2 <- feols(score ~ I(treat * post) | grade  + id, 
                  vcov = ~ grade + id, data = df)
  
  bind_rows(extract_stats(fm1, model_name = "OLS"),
            extract_stats(fm2, model_name = "CL-2"))
}

The following code runs sim_run() 1,000 times and stores the results in sim_results.

num_sims <- 1000L

sim_results <- bind_rows(map(1:num_sims, sim_run))

Given these data, we want to compare the standard deviation of the estimate coefficient with the mean of its estimated standard error for the two approaches (OLS and CL-2):

sim_results |> 
  group_by(model) |> 
  summarize(se_obs = sd(estimate), 
            se_est = mean(se))
# A tibble: 2 × 3
  model se_obs se_est
  <chr>  <dbl>  <dbl>
1 CL-2   0.335  0.984
2 OLS    0.335  0.315

We can see that the two-way cluster-robust standard errors are actually far too high, while the OLS standard errors are pretty close to the true standard deviation of the coefficients. This seems to rule out the first explanation.

To evaluate the second explanation (“bad luck”), we can use the same data. Let’s consider critical \(p\)-values of 1% and 5% and count how many of the OLS test statistics would lead to rejection of the null hypothesis (\(\beta = 0\)) at each size. (We focus on the OLS \(t\)-statistics, as we have established that these produce the better standard error estimates.)

rejection_stats <-
  sim_results |> 
  filter(model == "OLS") |> 
  summarize(prop_01 = mean(`p-value` <= 0.01), 
            prop_05 = mean(`p-value` <= 0.05))

rejection_stats
# A tibble: 1 × 2
  prop_01 prop_05
    <dbl>   <dbl>
1   0.863   0.951

So, we reject a true null hypothesis \(86\%\) of the time at the \(1\%\)-level and \(95\%\) of the time at the \(5\%\)-level, even with what appear to be good standard errors. This would be an extraordinary degree of bad luck, suggesting that there is bias in our coefficient estimates.

The properties of fixed-effect estimators are derived asymptotically, that is as the number of observations and estimated fixed effects approaches infinity. However, in a typical panel data set, we have a fairly small number of time periods. In our analysis above, we have just four grades of data. But, because these are simulated data and not real people, we can easily expand the number of grades of data that we consider. Let’s consider n_grades = 12, the maximum handled by the get_test_scores function.1

sim_results_wide <- list_rbind(map(1:num_sims, sim_run, n_grades = 12))
rejection_stats <-
  sim_results_wide |> 
  filter(model == "OLS") |> 
  summarize(prop_01 = mean(`p-value` <= 0.01), 
            prop_05 = mean(`p-value` <= 0.05))

rejection_stats
# A tibble: 1 × 2
  prop_01 prop_05
    <dbl>   <dbl>
1   0.329   0.574

Here we see evidence of a reduction of the bias, but the bias is not eliminated. We still reject a true null hypothesis \(33\%\) of the time at the \(1\%\)-level and \(57\%\) of the time at the \(5\%\)-level.

So let’s consider the final explanation of the bias, which is due to non-random selection. It turns out that assignment to treatment in get_test_scores is a function of test scores in the year prior to the camp, but conditional on these scores, is completely random.2 This is termed selection on observables, which is often suggested as a basis on which causal inference can be justified when using approaches such as propensity-score matching.

So what happens if we use completely random assignment to treatment? We can request such assignment by setting the random_assignment to TRUE in the get_test_scores() function.

sim_results_rand <- list_rbind(map(1:num_sims, sim_run, 
                                  random_assignment = TRUE))
rejection_stats <-
  sim_results_rand |> 
  filter(model == "OLS") |> 
  summarize(prop_01 = mean(`p-value` <= 0.01), 
            prop_05 = mean(`p-value` <= 0.05))

rejection_stats
# A tibble: 1 × 2
  prop_01 prop_05
    <dbl>   <dbl>
1   0.009   0.046

Here we see evidence of elimination of the bias. We reject a true null hypothesis \(0.9\%\) of the time at the \(1\%\)-level and \(4.6\%\) of the time at the \(5\%\)-level.

Thus it appears that random assignment is critical in this setting for achieving valid causal inferences. This is somewhat concerning as, beyond non-random treatment assignment, the basic assumptions underlying causal inference appear to be satisfied, as we have a plausible basis for the parallel trends assumption. First, we have grade effects that are the same for both treated and untreated observations. Second, we have individual effects that, while different between the treated and untreated observations, remain constant over the sample period and therefore do not undermine the parallel-trends assumption.3 What this suggests is that subtle biases can enter difference-in-difference analyses even in settings that are unrealistically simple. It seems reasonable to expect that biases exist—and are plausibly worse—in the more complex settings of actual research.

21.2 Voluntary disclosure

One paper that uses a multi-way fixed-effect structure like that we analysed above is Li et al. (2018), who “seek to provide causal evidence on the proprietary cost hypothesis(2018, p. 266). While “proprietary costs” are commonly assumed to be those caused by use of information by competitors, Verrecchia (1983) uses the term more broadly:

It seems apparent that there would be a proprietary cost associated with releasing information which is unfavorable to a firm (e.g., a bank would be tempted to ask for repayment of its loan). However, the release of a variety of accounting statistics about a firm (e.g., sales, net income, costs of operation, etc.) may be useful to competitors, shareholders, or employees in a way which is harmful to a firm’s prospects even if (or perhaps because) the information is favorable. One recent example of this is the response of the UAW (United Auto Workers) for fewer labor concessions in the face of an announcement by Chrysler Corporation’s chairman that that firm’s fortunes had improved.

While Verrecchia (1983) considers the “reluctance of managers in certain highly competitive industries … to disclose favorable accounting data”, he assumes that the cost of disclosure is constant and independent of the disclosed value.

Verrecchia (1983) can be viewed as providing a theoretical foundation for the proprietary cost hypothesis. The model in Verrecchia (1983) posits a capital market–driven incentive for disclosure of favourable information; absent such an incentive, there would be no disclosure in the Verrecchia (1983) setting. Another feature of the Verrecchia (1983) setting is that firms with favourable news are the ones disclosing, as the capital-market benefit of disclosure is greater for these firms.

Li et al. (2018) exploit the staggered implementation of the inevitable disclosure doctrine, which was adopted by state courts as part of the common law of their respective states at different times (some state courts later rejected the doctrine after adopting it). The inevitable disclosure doctrine (IDD) provides an employer with injunctive relief to prevent a current or former employee from working for another company if doing so will lead to the inevitable disclosure of trade secret information. Li et al. (2018) argue that IDD increases the marginal benefits of non-disclosure, which they interpret as an increase in the cost of disclosure.

Li et al. (2018) focus their analysis of disclosure choice on the disclosure of customer identities in 10-Ks and assume that the state of a firm’s headquarters governs the applicability of IDD to the firm. We will conduct an approximate replication of certain analyses of Li et al. (2018) to understand the empirical approaches of this chapter. This requires combining data on customer disclosures with data on the dates of adoption of IDD by states with data in the states of companies’ headquarters.

21.2.1 Customer disclosures

The first data set is compseg.seg_customer, which contains data derived from companies’ disclosures regarding significant customers.

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
seg_customer <- tbl(pg, Id(schema = "compseg", table = "seg_customer"))
names_seg <- tbl(pg, Id(schema = "compseg", table = "names_seg"))
funda <- tbl(pg, Id(schema = "comp", table = "funda"))
seg_customer |>
  count(ctype) |>
  arrange(desc(n)) |>
  collect()
# A tibble: 7 × 2
  ctype         n
  <chr>     <int>
1 COMPANY  359802
2 GEOREG   145366
3 MARKET   105443
4 GOVDOM    39213
5 GOVFRN     5048
6 GOVSTATE   1428
7 GOVLOC      572
seg_customer |>
  filter(ctype == "COMPANY") |>
  count(cnms) |>
  arrange(desc(n)) |>
  collect()
# A tibble: 43,856 × 2
   cnms             n
   <chr>        <int>
 1 Not Reported 67098
 2 NOT REPORTED 31162
 3 2 Customers   5865
 4 3 Customers   5156
 5 5 Customers   4591
 6 9 Customers   4104
 7 4 Customers   4058
 8 10 Customers  3784
 9 Not reported  2925
10 2 CUSTOMERS   2053
# ℹ 43,846 more rows

To better understand this setting, let’s examine one example. One observation chosen somewhat at random from the last year of the sample period is Advanced Micro Devices, Inc. (AMD), a global semiconductor company. In its 10-K for the year ended 31 December 2010, AMD disclosed the following:

In 2010, Hewlett-Packard Company accounted for more than 10% of our consolidated net revenues. Sales to Hewlett-Packard consisted primarily of products from our Computing Solutions segment. Five customers, including Hewlett-Packard, accounted for approximately 55% of the net revenue attributable to our Computing Solutions segment. In addition, five customers accounted for approximately 46% of the net revenue attributable to our Graphics segment. A loss of any of these customers could have a material adverse effect on our business.

Elsewhere in its 10-K, AMD discloses:

In 2010, the Company had one customer that accounted for more than 10% of the Company’s consolidated net revenues. Net sales to this customer were approximately $1.4 billion, or 22% of consolidated net revenues, and were primarily attributable to the Computing Solutions segment.

Note that the Graphics and Computing Solutions segments had sales of $1,663 million and $4,817 million, respectively, and therefore made up 99.8% of AMD’s total sales of $6,494 million.

Thus 46% of Graphics represents $765 million, while 55% of Computing Solutions is $2,650 million. After subtracting $1,400 million of these sales to Hewlett-Packard (HP), Compustat appears to ascribe the remaining $1,250 to “4 Customers”. The basis for assigning $1,400 to “Not Reported” is unclear (while the sentence above merely says “one customer”, it is quite clear that this is HP and Compustat elsewhere assumes as much). (In fact, it seems almost certain that there is double-counting in this case, making addition of the numbers in salecs problematic.)

seg_customer |> 
  filter(gvkey == "001161", datadate == "2010-12-31") |>
  select(cnms, ctype, salecs, stype) |>
  collect()
# A tibble: 8 × 4
  cnms               ctype   salecs stype 
  <chr>              <chr>    <dbl> <chr> 
1 HEWLETT-PACKARD CO COMPANY  1400  BUSSEG
2 4 Customers        COMPANY  1250. BUSSEG
3 Not Reported       COMPANY  1400  BUSSEG
4 5 Customers        COMPANY   765  BUSSEG
5 Not Reported       COMPANY    NA  BUSSEG
6 Not Reported       COMPANY    NA  BUSSEG
7 Not Reported       COMPANY    NA  BUSSEG
8 International      GEOREG   5715. BUSSEG

Focusing on this case perhaps help us to understand the disclosure decision faced by AMD. Even if AMD had not disclosed the identity of its largest customer, it perhaps would have been easy to infer that “Customer #1” was indeed HP, as HP was the largest personal computer manufacturer at the time.4

The second dataset we will use is undisclosed_names, which is part of the farr package. This dataset was constructed somewhat manually and represents the values of cnms that should be deemed non-disclosures.5

undisclosed_names
# A tibble: 460 × 2
   cnms         disclosed
   <chr>        <lgl>    
 1 Not Reported FALSE    
 2 NOT REPORTED FALSE    
 3 2 CUSTOMERS  FALSE    
 4 6 Customers  FALSE    
 5 4 Customers  FALSE    
 6 3 Customers  FALSE    
 7 8 Customers  FALSE    
 8 9 CUSTOMERS  FALSE    
 9 5 CUSTOMERS  FALSE    
10 3 CUSTOMERS  FALSE    
# ℹ 450 more rows

We can then perform a left_join of the customer-related data in seg_customer to identify disclosures (and non-disclosures) of customer identities.

Following Li et al. (2018), we focus on the period from 1994 to 2010. Here we show some of the result data items, focused on disclosed customers and larger customers.

customers <-
  seg_customer |>
  filter(ctype == "COMPANY") |>
  collect()

disclosure_raw <-
  customers |>
  filter(between(datadate, 
                 as.Date("1994-01-01"), as.Date("2010-12-31"))) |>
  left_join(undisclosed_names, by = "cnms") |>
  mutate(disclosed = coalesce(disclosed, TRUE)) |>
  select(gvkey, datadate, cnms, salecs, disclosed) 

disclosure_raw |>  
  filter(disclosed) |> 
  arrange(desc(salecs))
# A tibble: 86,608 × 5
   gvkey  datadate   cnms                salecs disclosed
   <chr>  <date>     <chr>                <dbl> <lgl>    
 1 002751 2010-06-30 CVS Health Corp     23641. TRUE     
 2 002751 2009-06-30 WALGREEN CO         22888. TRUE     
 3 118122 1999-12-31 General Motors Corp 22302  TRUE     
 4 002751 2010-06-30 WALGREEN CO         21671. TRUE     
 5 002751 2009-06-30 CVS Caremark Corp   20898. TRUE     
 6 118122 2000-12-31 General Motors Corp 20665  TRUE     
 7 002751 2008-06-30 CVS Caremark Corp   20040. TRUE     
 8 002751 2007-06-30 CVS Caremark Corp   18239. TRUE     
 9 012136 2005-06-30 GENERAL ELECTRIC CO 17712  TRUE     
10 002751 2008-06-30 Walgreen Co         17307. TRUE     
# ℹ 86,598 more rows

As a shortcut way of imposing the sample requirements used in Li et al. (2018) (e.g., not financial services), we restrict our analysis to firms (GVKEYs) in the sample of Li et al. (2018). The associated gvkeys are found in the llz_2018 data set from the farr package.6

llz_2018
# A tibble: 5,830 × 1
   gvkey 
   <chr> 
 1 001004
 2 001013
 3 001021
 4 001037
 5 001048
 6 001050
 7 001056
 8 001073
 9 001082
10 001086
# ℹ 5,820 more rows

To get a feel for the data, we can look at the proportion of sales attributed to the largest customer. Note that there are some cases where the largest customer appears to account for more than 100% of sales; many of these appear to be data-entry errors. But there are hundreds of firm-years with revenue from just one customer.

There is also an interesting jump at 10% of sales. While this is likely driven by the 10% cut-off driving mandatory disclosure, it is interesting to note that many firms disclose significant customer information even when the largest customer is below 10% of sales, as seen in Figure 21.1.

sales <-
  funda |>
  filter(indfmt == "INDL", datafmt == "STD",
         consol == "C", popsrc == "D") |>
  select(gvkey, datadate, sale) |>
  collect()

biggest_customers <- 
  disclosure_raw |> 
  inner_join(sales, by = c("gvkey", "datadate")) |>
  group_by(gvkey, datadate) |> 
  mutate(max_customer = max(salecs)/sale)

biggest_customers |> 
  filter(between(max_customer, 0, 1)) |>
  ggplot(aes(max_customer)) +
  geom_histogram(binwidth = 0.02) + 
  scale_x_continuous(breaks = seq(0, 1, by = 0.1))
Figure 21.1: Histogram of max_customer

Interestingly, Li et al. (2018) only include firm-year observations where at least one customer is disclosed (whether identified or not) as having at least 10% of total sales. We construct a data set (prin_cust_df) to identify these firms.

prin_cust_df <-
  disclosure_raw |> 
  inner_join(sales, by = c("gvkey", "datadate")) |>
  group_by(gvkey, datadate) |> 
  filter(!is.na(salecs), sale > 0) |>
  summarize(prin_cust = max(salecs/sale, na.rm = TRUE), 
            .groups = 'drop') |>
  mutate(has_prin_cust = prin_cust >= 0.1)

We next restrict the data to the llz_2018 sample and calculate the two measures of disclosure choice used in Li et al. (2018), one measuring the proportion of significant customers whose identities are not disclosed and another weighting that proportion by sales.

disclosure <-
    disclosure_raw |> 
    inner_join(sales, by = c("gvkey", "datadate")) |>
    semi_join(llz_2018, by = "gvkey") |>
    group_by(gvkey, datadate) |>
    summarize(ratio = mean(!disclosed),
              ratio_sale = sum((!disclosed) * salecs)/sum(salecs),
              .groups = "drop") |>
    mutate(year = year(datadate)) 

disclosure |> 
  summarize(across(c(ratio, ratio_sale), 
                   \(x) mean(x, na.rm = TRUE)))
# A tibble: 1 × 2
  ratio ratio_sale
  <dbl>      <dbl>
1 0.459      0.461

21.2.2 Data on adoption of IDD

The next data set we use is idd_periods which is returned by the farr function get_idd_periods. This table is based on idd_dates, which is derived from data reported in Klasa et al. (2018) and reproduced in Li et al. (2018).

idd_dates
# A tibble: 24 × 3
   state idd_date   idd_type
   <chr> <date>     <chr>   
 1 AR    1997-03-18 Adopt   
 2 CT    1996-02-28 Adopt   
 3 DE    1964-05-05 Adopt   
 4 FL    1960-07-11 Adopt   
 5 FL    2001-05-21 Reject  
 6 GA    1998-06-29 Adopt   
 7 IL    1989-02-09 Adopt   
 8 IN    1995-07-12 Adopt   
 9 IA    1997-03-18 Adopt   
10 KS    2006-02-02 Adopt   
# ℹ 14 more rows

While idd_dates represents dates of either adoption or rejection of the inevitable disclosure doctrine, the idd_periods table takes a sample period (defined using min_date and max_date) and breaks that sample period into three kinds by state: pre- and post-adoption and post-rejection.

idd_periods <- get_idd_periods(min_date = "1994-01-01", 
                               max_date = "2010-12-31")
idd_periods
# A tibble: 65 × 4
   state period_type   start_date end_date  
   <chr> <chr>         <date>     <date>    
 1 AK    Pre-adoption  1994-01-01 2010-12-31
 2 AL    Pre-adoption  1994-01-01 2010-12-31
 3 AR    Pre-adoption  1994-01-01 1997-03-18
 4 AR    Post-adoption 1997-03-18 2010-12-31
 5 AZ    Pre-adoption  1994-01-01 2010-12-31
 6 CA    Pre-adoption  1994-01-01 2010-12-31
 7 CO    Pre-adoption  1994-01-01 2010-12-31
 8 CT    Pre-adoption  1994-01-01 1996-02-28
 9 CT    Post-adoption 1996-02-28 2010-12-31
10 DC    Pre-adoption  1994-01-01 2010-12-31
# ℹ 55 more rows

21.2.3 Data on state headquarters

The final piece of the puzzle in terms of data is state_hq, which is found in the farr package and derived from data provided by Bill McDonald (see here), which are the data on headquarters location used by Li et al. (2018). The state_hq table reports the range of dates for SEC filings for each cik-ba_state combination.7

state_hq
# A tibble: 53,133 × 4
   cik        ba_state min_date   max_date  
   <chr>      <chr>    <date>     <date>    
 1 0000066382 MI       1994-01-04 2018-10-10
 2 0000070415 NY       1994-01-04 2007-03-14
 3 0000084129 PA       1994-01-05 2018-10-04
 4 0000832922 OH       1994-01-05 2001-01-09
 5 0000909832 CA       1994-01-05 1996-12-20
 6 0000004911 PA       1994-01-06 1996-01-05
 7 0000038777 CA       1994-01-06 2018-11-09
 8 0000726977 IL       1994-01-06 1997-11-14
 9 0000756764 NY       1994-01-06 1999-07-08
10 0000764478 MN       1994-01-06 2018-12-07
# ℹ 53,123 more rows

To use these data, we need to link CIKs with GVKEYs, which we do using the table compseg.names_seg provided alongside Compustat’s segment data.8

ciks <-
  names_seg |>
  filter(!is.na(cik)) |>
  select(gvkey, cik) |>
  collect()

state_hq_linked <- 
  state_hq |>
  inner_join(ciks, by = "cik") |>
  inner_join(disclosure, by = "gvkey", relationship = "many-to-many") |>
  filter(datadate >= min_date, datadate <= max_date) |>
  select(gvkey, datadate, ba_state) |>
  rename(state = ba_state)

21.2.4 Regression analysis

Finally, we pull all these pieces together. Like Li et al. (2018), we delete “post-rejection” observations and log-transform the dependent variable.

reg_data <-
  disclosure |>
  inner_join(prin_cust_df, by = c("gvkey", "datadate")) |>
  filter(has_prin_cust) |>
  inner_join(state_hq_linked, by = c("gvkey", "datadate")) |>
  inner_join(idd_periods, 
             join_by(state,
                     datadate >= start_date, datadate <= end_date)) |>
  filter(period_type != "Post-rejection") |>
  mutate(post = period_type == "Post-adoption",
         ln_ratio = log(1 + ratio),
         ln_ratio_sale = log(1 + ratio_sale)) |>
  select(-start_date, -end_date)

reg_data |> count(period_type)
# A tibble: 2 × 2
  period_type       n
  <chr>         <int>
1 Post-adoption 15836
2 Pre-adoption  12931
fms <- list()
fms[[1]] <- feols(ln_ratio ~ post | gvkey + year, 
                  data = reg_data, vcov = "iid")
fms[[2]] <- feols(ln_ratio_sale ~ post | gvkey + year, 
                  data = reg_data, vcov = "iid")

To keep the analysis simple, we do not include controls (apart from firm and year fixed effects). In a typical regression analysis with firm and year fixed effects, omission of controls will often not have a material impact on coefficient estimates. We run the analysis again, but focused on a subset of firms.

switchers <- 
  reg_data |> 
  distinct(gvkey, post) |> 
  group_by(gvkey) |> 
  filter(n() > 1) |>
  select(gvkey) |>
  distinct() |>
  ungroup()

reg_data_switchers <-
  reg_data |>
  semi_join(switchers, by = "gvkey") 

fms[[3]] <- feols(ln_ratio ~ post | gvkey + year, 
                  data = reg_data_switchers, vcov = "iid")
fms[[4]] <- feols(ln_ratio_sale ~ post | gvkey + year, 
                  data = reg_data_switchers, vcov = "iid")
modelsummary(fms, 
             estimate = "{estimate}{stars}",
             gof_map = c("nobs", "r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 21.2: Effect of IDD adoption
 (1)   (2)   (3)   (4)
postTRUE 0.016** 0.024*** 0.014 0.019**
(0.008) (0.008) (0.009) (0.010)
Num.Obs. 28767 24585 3617 3164
R2 0.672 0.711 0.624 0.661

The analysis in Table 21.2 is similar to that in Table 2 of Li et al. (2018). Below we conduct analysis similar to that reported in Table 3 of Li et al. (2018), which examines the effect of IDD adoption on disclosure by year related to adoption.

We first make a little function that takes a vector of numbers and returns them as a factor that has nicely formatted strings order by the original numbers.9 This will be useful for presenting “nice” regression results.

factor_t <- function(t) {
  t <- relevel(as.factor(t), ref = "-1")
  levels(t) <- gsub("^([0-9]+)", "+\\1", levels(t))
  levels(t) <- gsub("([0-9]+)", " \\1", levels(t))
  levels(t) <- gsub("^", " ", levels(t))
  t
}

We next make a function that turns a vector of year-difference values into a factor—with values outside \((-5, +5)\) collapsed into \(-5\) and \(+5\) as appropriate—and labels the output in a regression output-friendly way. Note that if a firm is never-treated, it’s value of \(t\) will be missing and we set the value for these years to -Inf (R’s way of saying \(-\infty\)).

year_diff <- function(t) {
  t <- case_when(t < -5 ~ -5,
                 t > 5 ~ 5,
                 is.na(t) ~ -Inf,
                 .default = t) 
  factor_t(t)
}
switch_years <-
  reg_data |>
  group_by(gvkey) |> 
  arrange(datadate) |> 
  filter(period_type == "Post-adoption" & 
           lag(period_type) == "Pre-adoption") |>
  group_by(gvkey) |>
  summarize(adoption_year = min(year), .groups = "drop")

reg_data_t <-
  reg_data |>
  left_join(switch_years, by = "gvkey") |>
  mutate(t = year_diff(year - adoption_year))
fms <- list()
fms[[1]] <- feols(ln_ratio ~ t | gvkey + year, 
                  data = reg_data_t, vcov = "iid")
fms[[2]] <- feols(ln_ratio_sale ~ t | gvkey + year,
                  data = reg_data_t, vcov = "iid")
fms[[3]] <- feols(ln_ratio ~ t | gvkey + year,
                  data = subset(reg_data_t, !grepl("Inf", t)), vcov = "iid")
fms[[4]] <- feols(ln_ratio_sale ~ t | gvkey + year, 
                  data = subset(reg_data_t, !grepl("Inf", t)), vcov = "iid")
modelsummary(fms, 
             estimate = "{estimate}{stars}",
             gof_map = c("nobs", "r.squared", "adj.r.squared"),
             stars = c('*' = .1, '**' = 0.05, '***' = .01))
Table 21.3: Effect of IDD adoption by year
 (1)   (2)   (3)   (4)
t - 5 −0.005 0.001 −0.017 −0.009
(0.019) (0.021) (0.027) (0.029)
t - 4 −0.017 −0.030 −0.026 −0.038
(0.022) (0.023) (0.025) (0.026)
t - 3 −0.030 −0.029 −0.031 −0.031
(0.020) (0.020) (0.021) (0.022)
t - 2 −0.008 −0.003 −0.014 −0.010
(0.018) (0.018) (0.019) (0.019)
t + 0 −0.001 0.016 0.007 0.020
(0.015) (0.015) (0.016) (0.017)
t + 1 0.009 0.025 0.021 0.031
(0.016) (0.017) (0.018) (0.020)
t + 2 −0.007 0.010 0.009 0.017
(0.017) (0.018) (0.021) (0.023)
t + 3 0.023 0.021 0.038 0.026
(0.019) (0.020) (0.025) (0.027)
t + 4 0.028 0.039* 0.041 0.042
(0.019) (0.021) (0.027) (0.030)
t + 5 0.044*** 0.048*** 0.046 0.039
(0.015) (0.016) (0.032) (0.036)
Num.Obs. 28767 24585 3110 2740
R2 0.673 0.712 0.625 0.668
R2 Adj. 0.599 0.637 0.568 0.611

Next we make a small function to plot the coefficients over time. The function first arranges the coefficients for each period in event-time as a data frame and turns the coefficient labels (e.g., \(t + 1\)) into values (i.e., \(1\)). In making the plot, we add in the reference period (\(t-1\)), which has a coefficient value of \(0\) by definition, and add 95% confidence intervals for the coefficients.

plot_coefs <- function(fm) {
  
  coefs <- 
    tibble(t = names(coef(fm)),
           value = coef(fm),
           se = se(fm)) |>
    mutate(t = as.integer(gsub("[t ]+", "", t))) |>
    filter(t != -Inf)

  ci <- 0.95
  mult <- qnorm(1 - (1 - ci)/2)

  plot <-
    tibble(t = -1, value = 0, se = 0) |>
    bind_rows(coefs) |>
    mutate(mean = value,
           top = value + mult * se,
           bot = value - mult * se) |>
    ggplot(aes(x = t, y = mean)) + 
    geom_errorbar(aes(ymin = bot, ymax = top), width = .1) +
    geom_line() +
    geom_point() +
    scale_x_continuous(breaks = seq(from = min(coefs$t), 
                                    to = max(coefs$t)))
  
  plot
}

And we use this function to plot the coefficients for the first model.

plot_coefs(fms[[1]])
Figure 21.2: Plot of coefficients by year

21.2.5 Discussion questions

  1. The proprietary cost hypothesis tested by Li et al. (2018) posits that increased cost of disclosure will lead to reduced supply of disclosure? What rival theories exist that would make alternative predictions? Are there other elements of disclosure theory that might be tested? Why do you think Li et al. (2018) focused on this specific element of the proprietary cost hypothesis?
  2. In the analysis above, we do not include the control variables considered by Li et al. (2018). For example, Li et al. (2018) “include R&D expenditures to sales, advertisement [sic] expenditure to sales, and intangible assets scaled by total assets to control for a firm’s proprietary costs of disclosure.” Using the approach outlined in Chapter 4, in what circumstances would it be necessary to control for “a firm’s proprietary costs of disclosure” in this way? Do these circumstances seem applicable in this setting, where the treatment is a (presumably exogenous) shock to disclosure costs and the outcome is disclosure choices?
  3. What differs between the regressions reported in columns (1) and (3), and (2) and (4) of Table 21.2, respectively? What does this tell you about what drives the regression results in this setting? What happens if you omit the year fixed effects from both sets of regressions? What does this tell you about the role of the “non-switchers” (i.e., firms not in the switchers data frame) in the regression?
  4. Would you expect the inclusion of controls (see the question above) to have a significant impact on the regression results? Why or why not?
  5. What differs between the regressions reported in columns (1) and (3), and (2) and (4) of Table 21.3, respectively? What happens if you omit the year fixed effects from both sets of regressions? What does this tell you about what drives the regression results in this setting?
  6. What patterns do you observe in the coefficients reported in the table titled “Effect of IDD adoption by year”? Do these conform to what you would expect from Li et al. (2018)? (It may be easiest to focus on these in groups, e.g., those in \(t-5\) to \(t-2\), those in \(t + 0\) to \(t+3\) and those for \(t+4\) and \(t+5\).)
  7. How do the variables in Table 21.3 differ from those used in Table 3 of Li et al. (2018)? Modify the code above (e.g., the year_diff function) to produce analysis closer to that reported in Table 3 of Li et al. (2018).
  8. How helpful do you find Figure 21.2 (plot of the coefficients by year)?
  9. The year_diff function collapses years after \(t + 5\) and before \(t - 5\) into years \(t + 5\) and \(t - 5\), respectively. When would this approach make sense? What would be one alternative approach to handling these years? Does your suggested approach make a difference in this setting?
  10. Describe the data set created from the following code. What proportion of the firms in the data set have same_state equal to TRUE? For the purposes of empirical analysis, do the firms with same_state equal to FALSE enhance, or detract from, our ability to draw causal inferences about the effect of adoption of IDD?
    switch_years <-
      reg_data_switchers |>
      group_by(gvkey) |> 
      arrange(datadate) |> 
      mutate(same_state = state == lag(state), 
             adoption_year = period_type == "Post-adoption" & 
               lag(period_type) == "Pre-adoption") |> 
      filter(adoption_year) |> 
      ungroup() 
  1. What issues might be implied by the following data? How might you address these?
    reg_data_t |> count(t)
    # A tibble: 12 × 2
       t           n
       <fct>   <int>
     1 " - 1"    322
     2 " -Inf" 25657
     3 " - 5"    232
     4 " - 4"    117
     5 " - 3"    154
     6 " - 2"    219
     7 " + 0"    385
     8 " + 1"    306
     9 " + 2"    233
    10 " + 3"    192
    11 " + 4"    168
    12 " + 5"    782

  1. This maximum exists, not only because it seems inhumane to require even simulated students to attend school for longer, but also because the grade effects are hard-coded and only for 12 years.↩︎

  2. This can be seen from examination of the source code for the function, but there is no need to do so.↩︎

  3. The parallel trends assumption is likely subtly violated because of regression-to-the mean effects in test scores, but not in a way likely to be detected using the usual “parallel trends plots” used in the literature. A researcher seeing the underlying data-generating process is likely to find it very easy the imagine the “can-opener” of parallel trends we discussed in Chapter 19. To paraphrase Macbeth: “Is this a [can-opener] which I see before me, The handle toward my hand? Come, let me clutch thee. I have thee not, and yet I see thee still. Art thou not, fatal vision, sensible To feeling as to sight? … Thou marshal’st me the way that I was going [i.e., to use DiD], And such an instrument I was to use.”↩︎

  4. See Gartner (2011).↩︎

  5. See here for details on the construction of this data set. This likely mirrors a process used for Li et al. (2018) itself.↩︎

  6. See here for the code used in creating this data set.↩︎

  7. Any gaps are filled in by extending the min_date back to the date after the preceding max_date.↩︎

  8. Note that names_seg provides just one CIK for each GVKEY, though firms can change CIKs over time, which means that we will lose observations related to firms that previously used a different CIK from that found on names_seg.↩︎

  9. Examine the output of as.character(factor_t(c(-2, -1, 0, 1, 2))) to see what it is doing.↩︎