Data Science for Economists

class: center, middle, inverse, title-slide

.title[
# Data Science for Economists
]
.subtitle[
## Difference-in-differences
]
.author[
### Kyle Coombs
]
.date[
### Bates College | <a href="https://github.com/big-data-and-economics/big-data-class-materials">DCS/ECON 368</a>
]

---

name: toc

# Table of contents

- [Prologue](#prologue)

- [Difference what with who?](#what-it-did)

- [Example: Earned Income Tax Credit](#EITC)

---
name: prologue
class: inverse, center, middle
# Prologue

---
# Attribution

- These slides are adapted from work by Nick Huntington-Klein, Ed Rubin, and Scott Cunningham

- They're both superb econometric instructors and I highly recommend their work

---
# Check-in on causality

- We've been talking about causality for a bit now

- Popcorn style: what do we need for causality?

- We need to account for everything that is correlated with treatment

- There are two extreme strategies:

1. Control variables for everything!

- Did we control for it all? How do we know?

2. Get a treatment that is uncorrelated with everything else

- How do we know we got rid of all the correlation?

And then there's a middle ground...

---
class: inverse, center, middle
name: causal-inference-review

# Causal Inference Review

---
# An old causality problem: Cholera

- DiD is an old method with its first documented use by John Snow in 1854 to identify that cholera was transmitted through water1

- Cholera was a big problem in London in the 1800s

- Many were convinced that _miasma_ was the cause of cholera

- John Snow was a doctor in London in the 1800s

- He thought it might be due to contaminated water. But how to prove it?

- An experiment was out of the question, ethically and practically

- So he needed some other variation in water quality

.footnote[1 Unlike Jon Snow, who knows nothing, John Snow knew quite a lot.]

---
# A natural experiment

- There were three water companies in London: Lambeth, Southwark and Vauxhall (SV)

- In 1849, Parliament passed a bill requiring all water companies to move their pumps further up the Thames

- Natural experiment: Lambeth water company moves its pipes between 1849 and 1854; Southwart and Vauxhall (SV) water company delayed

- John Snow went door-to-door collecting info where people got their water in 1849 and counted cholera cases in 1849 and 1854

|supplier         |  1849|  1854|  diff|
|:----------------|-----:|-----:|-----:|
|Non-Lambeth Only | 134.9| 146.6|  11.7|
|Lambeth + Others | 130.1|  84.9| -45.2|

- Let's think about how John Snow could have estimated the effect of the water pump move on death rates

---
# Let's consider the counterfactual

Let's look at .purle[observed] and .blue[counterfactual] death rates per 10K people (a subscript 1 indicates having a pump moved)

.pull-left[.center[
.purple[Death.sub[1,Lambeth]] = .purple[85]
 .blue[Death.sub[0,Lambeth]] = .blue[145]
]]
.pull-right[
.purple[Death.sub[1,SV]] = .purple[75]
 .blue[Death.sub[0,SV]] = .blue[135]
]

The change in death rates differs between the companies
--
 `$\quad\quad\tau$`.sub[Lambeth] = .purple[Death.sub[1,Lambeth]] - .blue[Death.sub[0,Lambeth]] = -60
--
 `$\quad\quad\tau$`.sub[SV] = .purple[Death.sub[1,SV]] - .blue[Death.sub[0,SV]] = -60

What happens if we just compare what we can observe, Lambeth to SV?
  - Well Lambeth served the upper income areas (western London), so they might be healthier

Okay, but why not compare Lambeth before vs. Lambeth after?
  - London was actively fighting cholera, so attribution of deaths to the pump move is difficult

Both of these are examples of selection bias

---
# Selection bias

- We have defined selection bias as the difference between the unobserved control outcomes for the treated and the observed outcomes for the control group

$$
`\begin{align}
  &\color{#e64173}{\mathop{E}\left( y_i\mid D_i = 1 \right)} - \color{#6A5ACD}{\mathop{E}\left( y_i\mid D_i =0 \right)} \\
  &\quad\quad = \tau + \underbrace{\color{#e64173}{\mathop{E}\left(\color{#6A5ACD}{y_{0,i}} \mid D_i = 1 \right)} - \color{#6A5ACD}{\mathop{E}\left( y_{0,i}\mid D_i =0 \right)}}_{\color{#FFA500}{\text{Selection bias}}}
\end{align}`
$$

.hi-slate[Practical problem:] Selection bias is also difficult to observe

$$
`\begin{align}
  \underbrace{\color{#e64173}{\mathop{E}\left(\color{#6A5ACD}{y_{0,i}} \mid D_i = 1 \right)}}_{\color{#e64173}{\text{Unobservable}}} - \color{#6A5ACD}{\mathop{E}\left( y_{0,i}\mid D_i =0 \right)}
\end{align}`
$$

(back to the *fundamental problem of causal inference*)

.hi-slate[Bigger problem:] If selection bias is present, our estimate for `$\tau$` is biased, preventing us from understanding the causal effect of treatment.

Sounds a bit like omitted-variable bias, right? That's cause they're all forms of .hi[endogeneity!]

--
 Our .purple[treatment] variable is correlated with something that makese the two groups different.

---
# John Snow got pre-period data

.pull-left[.center[
.purple[Death.sub[1,Lambeth].super[Post]] = .purple[85]
 .red[Death.sub[0,Lambeth].super[Pre]] = .red[130]
]]
.pull-right[
.blue[Death.sub[0,SV].super[Post]] = .blue[147]
 .red[Death.sub[0,SV].super[Pre]] = .red[135]
]

- We can take the average of observations and subtract it from each observation

- Individual fixed effect for Lambeth on their post-treatment observation:

.purple[Death.sub[1,Lambeth].super[Post]] - Avg Death.sub[Lambeth] = .purple[85] - 107.5 = .purple[-22.5]

- This is a method called difference-in-differences

1. Difference in the groups' means before treatment
2. Difference in the groups' means after treatment
3. Difference in these differences

---
# Bridge from fixed effects

- Note that we used group averages to demean the data of *between* variation, leaving us just *within* variation

- That's a fixed effect! 
  - The year fixed effect removes the variation both groups experience over time
  - The group fixed effect removes the variation within each group in each periods

- The untreated group, or control group, is our .hi-slate[counterfactual]

- Then, we compare the within-variation for the treated group vs. the within-variation for the untreated group

- Voila, we have an effect as long as the *within* variation left over is as good as randomly assigned, we'll have causality

- Put another way, as long as nothing else affects the outcome for the treated group between the pre- and post-periods, we'll have causality

- Note: other irrelevant things can change, but as long as the treatment is the only thing that changes, we'll be good

---

# Difference-in-Differences

What *changes* are included in each value?

- .red[Untreated Before]: Untreated Group Mean
- .blue[Untreated After]: Untreated Group Mean + Time Effect
- .red[**Treated Before**]: Treated Group Mean
- .blue[**Treated After**]: Treated Group Mean + Time Effect + .hi-purple[Treatment Effect]
- .blue[Untreated After] - .red[Untreated Before] = Time Effect
- .blue[**Treated After**] - .red[**Treated Before**] = Time Effect + .hi-purple[Treatment Effect]

DiD = (.blue[**TA**] - .red[**TB**]) - (.blue[UA] - .red[UB]) = .hi-purple[Treatment Effect]

(Abbreviations for Untreated and Treated Before/After to save space)

---
# Visualizing Lambeth vs SV

|supplier         |  1849|  1854|  diff|
|:----------------|-----:|-----:|-----:|
|Non-Lambeth Only | 134.9| 146.6|  11.7|
|Lambeth + Others | 130.1|  84.9| -45.2|

---
# Visualizing Lambeth vs SV

|supplier         |  1849|  1854|  diff|
|:----------------|-----:|-----:|-----:|
|Non-Lambeth Only | 134.9| 146.6|  11.7|
|Lambeth + Others | 130.1|  84.9| -45.2|

---
# Difference-in-Differences

- Because the requirements to use it are so low, DiD is used a *lot*

- At its simplest, we need a treatment that *goes into effect* at a particular time, and we need a group that is *treated* and a group that is *not*

- Any time a policy is enacted but isn't enacted everywhere at once? DiD!

- Plus, the logic is pretty straightforward

- The question DiD tries to answer is "what was the effect of (some policy) on the people who were affected by it?"

- We have some data on the people who were affected both before the policy went into effect and after

- However, we can't just compare before and after, because things usually change over time for other reasons

- So we compare to people who weren't affected by the policy

---

# Difference-in-Difference

- What if there are more than four data points?

- Usually these four points would be four means from lots of observations, not just two people in two time periods

- How can we do this and get things like standard errors, and perhaps include controls?

- Why, use OLS regression of course, just use binary variables and interaction terms to get a DiD

`$$Y_{it} = \beta_0 + \beta_1After_t + \beta_2Treated_i + \beta_3After_t \times Treated_i + \varepsilon_{it}$$`

where `$After_t$` is a binary variable for being in the post-treatment period, and `$Treated_t$` is a binary variable for being in the treated group

```
## # A tibble: 4 × 6
## year supplier treatment deathrate After Treated
## <dbl> <chr> <chr> <dbl> <lgl> <lgl> 
## 1 1849 Non-Lambeth Only Dirty 135. FALSE FALSE 
## 2 1849 Lambeth + Others Mix Dirty and Clean 130. FALSE TRUE 
## 3 1854 Non-Lambeth Only Dirty 147. TRUE FALSE 
## 4 1854 Lambeth + Others Mix Dirty and Clean 84.9 TRUE TRUE
```

---

# Difference-in-Differences

- How can we interpret this using what we know?

`$$Y_{it} = \beta_0 + \beta_1After_t + \beta_2Treated_i + \beta_3After_t \times Treated_i + \varepsilon_{it}$$`

- `$\beta_0$` is the prediction when `$Treated_i = 0$` and `$After_t = 0$` `$\rightarrow$` the .red[Untreated Before] mean!

- `$\beta_1$` is the *time difference* for `$Treated_i = 0$` `$\rightarrow$` .blue[UA] - .red[UB]

- `$\beta_2$` is the *treatment difference* for `$After_t = 0$` `$\rightarrow$` .red[BT-BU]

- `$\beta_3$` is *how much bigger the Before-After difference* is for `$Treated_i = 1$` than for `$Treated_i = 0$` `$\rightarrow$` (.blue[**TA**] - .red[**TB**]) - (.blue[UA] - .red[UB]) = DID!

---

# Graphically

---

# Design vs. Regression

- There is a distinction between *regression model* and *research design* 
  - This is also true for fixed effects

- We have a model with an interaction term

- Not all models with interaction terms are DID!

- It's DID because it's an interaction between treated/control and before/after

- If you don't have a before/after, or you don't have a control group, that same setup may tell you something interesting but it won't be DID!

---

# Parallel Trends

- This assumption - that nothing else changes at the same time, is the poorly-named "parallel trends"

- Again, this assumes that, *if the Treatment hadn't happened to anyone*, the gap between the two would have stayed the same

- You can if *prior trends* are the same - if we have multiple pre-treatment periods, was the gap changing a lot during that period?

- There are methods to "adjust for prior trends" to fix parallel trends violations, or use related methods like Synthetic Control
  - These are beyond the scope of this class and also often snake oil

---

# Prior Trends

- Let's look at an example involving an expasnion of the EITC in 1993

- Looks like the gap between them is pretty constant before 1994! They move up and down but the *gap* stays the same. That's good.

<img src="11-diff-in-diff_files/figure-html/prior-trends-1.png" style="display: block; margin: auto;" />
---

# Prior Trends

- Formally, prior trends being the same tells us nothing about parallel trends

- But it can be suggestive if the gap was closing *anyway*
  - For example, what if you compare cholera rates between Lambeth users and Denmark, which was relatively cholera-free due to a quarantine

---

# Parallel Trends

- Just because *prior trends* are equal doesn't mean that *parallel trends* holds.

- *Parallel trends* is about what the before-after change *would have been* - we can't see that!

- For example, let's say we want to see the effect of online teaching on student test scores, using COVID school shutdowns to get a Before/After

- As of March/April 2020, some schools had gone online (Treated) and others hadn't (Untreated)

- Test score trends were probably pretty similar in the Before periods (Jan/Feb 2020), so prior trends are likely the same

- But LOTS of stuff changed between Jan/Feb and Mar/Apr, like, uh, Coronavirus, lockdowns, etc. not just online teaching! SO *parallel trends* likely wouldn't hold

---

# What if there are more groups?

- You can recognize `$(\text{Year} \geq {1994})_t$` and `$\text{Has Kids}_i$` as *fixed effects*

- `$Treated_i$` is a fixed effect for group - we only need one coefficient for it since there are only two groups

- And `$(\text{Year} \geq {1994})_t$` is a fixed effect for *time* - one coefficient for two time periods

- You can have more than one set of fixed effects like this! Our interpretation is now within-group *and* within-time

- (i.e. comparing the within-group variation across groups)

---
name: manygroups
class: inverse, center, middle

# Example: Multiple Groups

---
# Multiple Treated and Control Groups

- We can extend DID to having more than two groups, some of which get treated and some of which don't

- And more than two time periods! Multiple before and/or multiple after

- We don't have a full set of interaction terms, we still only need the one, which we can now call `$CurrentlyTreated_{it}$`

- **If you have more than two groups and/or more than two time periods, then this is what you should be doing**

---

# Multiple treatment groups

- Let's make some quick example data to show this off, with the first treated period being period 7 and the treated groups being 1 and 9, and a true effect of 3

``` r
set.seed(123)
did_data <- tibble(group = sort(rep(1:3000, 10)),
 time = rep(1:10, 3000)) %>%
 mutate(treat=group %in% c(1,9), 
 post=time>=6,
 y=group + time + 3*treat*post + rnorm(30000))
did_data
```

```
## # A tibble: 30,000 × 5
## group time treat post y
## <int> <int> <lgl> <lgl> <dbl>
## 1 1 1 TRUE FALSE 1.44
## 2 1 2 TRUE FALSE 2.77
## 3 1 3 TRUE FALSE 5.56
## 4 1 4 TRUE FALSE 5.07
## 5 1 5 TRUE FALSE 6.13
## 6 1 6 TRUE TRUE 11.7 
## 7 1 7 TRUE TRUE 11.5 
## 8 1 8 TRUE TRUE 10.7 
## 9 1 9 TRUE TRUE 12.3 
## 10 1 10 TRUE TRUE 13.6 
## # ℹ 29,990 more rows
```

---

# Multiple treatment groups

``` r
# Put group first so the clustering is on group
etable(feols(y ~ post*treat | group + time, data = did_data))
```

```
##                      feols(y ~ post ..
## Dependent Var.:                      y
##                                       
## postTRUE x treatTRUE 3.114*** (0.2430)
## Fixed-Effects:       -----------------
## group                              Yes
## time                               Yes
## ____________________ _________________
## S.E.: Clustered              by: group
## Observations                    30,000
## R2                              1.0000
## Within R2                      0.00179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# Event study plot

``` r
est_did = feols(y ~ i(time, treat, 5) | group + time, data = did_data)
iplot(est_did, xlab = "Time", ylab = "Treatment Effect", main = "Event Study Plot")
```

---
# Staggered Treatment

- Let's say we have a treatment that starts at different times for different groups

``` r
# Create example data
set.seed(123)
# Treatment is 3 and starts at different times with 1000 units for each group
staggered_data <- tibble(
 id = rep(1:3000, each = 10),time = rep(1:10, times = 3000),
 cohort = case_when(id <= 1000 ~ 5,id <= 2000 ~ 8,TRUE ~ 1000),
 treated = time>=cohort,
 time_to_treat=ifelse(cohort<1000,time-cohort,-1000),
 y = id + time + 3*treated + rnorm(3000,0,5)
)
etable(feols(y ~ treated | id + time, data = staggered_data))
```

```
##                 feols(y ~ treat..
## Dependent Var.:                 y
##                                  
## treatedTRUE     3.015*** (0.1021)
## Fixed-Effects:  -----------------
## id                            Yes
## time                          Yes
## _______________ _________________
## S.E.: Clustered            by: id
## Observations               30,000
## R2                        0.99997
## Within R2                 0.02913
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# Two-way fixed effects model

- We just ran a two-way fixed effects model

- We have a fixed effect for group and a fixed effect for time

- This is often done when we have a panel dataset with individuals over time

- The generic formula is:

$$
`\begin{align}
  y_{it} = \underbrace{\alpha_i}_{\text{Individual FE}} + \underbrace{\alpha_t}_{\text{Time FE}} + \beta CurrentlyTreated_{it} + \varepsilon_{it}
\end{align}`
$$

- Where did the treatment and pre-post indicator go?

- They're in the fixed effects! We don't need to include them separately
---
# Cool event study plot

``` r
est_did = feols(y ~ i(time_to_treat,ref=c(-1,-1000)) | id + time, data = staggered_data)
iplot(est_did, xlab = "Relative Time", ylab = "Treatment Effect", 
  main = "Event Study Plot")
```

---
# Downers and Assumptions

- So... does this all work?

- That example got pretty close to the truth of 3 but who knows in other cases!1

- What needs to be true for this to work?

- DID and TWFE give a causal effect as long as *the only reason the gap changed* was the treatment

- For TWFE to have a causal effect with panel data, we assume no endogenous variation across time within unit
  - It gets even messier if we have staggered rollout of treatment

- In DID, we need to assume that there's no endogenous variation *across this particular before/after time change*

- An easier assumption to justify but still an assumption!

.footnote[1 We do know. It fails a lot.]

---

# Before we finish, a warning!

- DID is so nice and simple that it feels like you can get real flexible with it

- But the stuff we're covering in this class - up to TWFE, *relies very strongly* on the assumptions we made.

- If you break them, the *research design* may hold up, *but the estimator really doesn't* and you may need a different estimator

---
# Context

- TWFE *quickly falls apart* if you have different groups getting treated at different times, called "staggered treatment"

1. **Forbidden comparison**: Your early treated group becomes a control for your later treated group
  
  2. If the effect increases/decreases in relative time, your early treated group gives a "bad comparison"

3. See slides by [Andrew Baker](https://andrewcbaker.netlify.app/did#1) for an example explanation of the problem

---
# TWFE with Het. Treatment Timing

Let's create a simple example with three groups and different treatment times:

``` r
# Create example data
n_periods <- 10; n_groups <- 10; n_obs <- n_periods * n_groups
set.seed(123)
twfe_data <- data.frame(id = rep(1:n_groups, each = n_periods),
 time = rep(1:n_periods, times = n_groups)) %>% 
 mutate( # Group 1 never treated, Cohort 2 treated at t=5, Cohort 3 treated at t=8
 cohort=case_when(id<3 ~ 1, id<5 ~ 2, id<8 ~ 3, TRUE ~ 1),
 treated = (cohort == 2 & time >= 5) | (cohort == 3 & time >= 8),
 # True treatment effects: Cohort 2: effect = 2, Cohort 3: effect = 4
 true_effect = case_when(
 cohort == 2 & treated ~ 4,
 cohort == 3 & treated ~ 2,
 TRUE ~ 0
 ),
 # Generate outcome with: Cohort fixed effects (id), Time fixed effects (time), Treatment effects, and some noise
 y = id + time + true_effect + rnorm(n_obs)
 )
```

---
# TWFE with Het. Treatment Timing

Treatment effects were:

1. Cohort 2: 4 after t=5
2. Cohort 3: 2 after t=8

``` r
# Run TWFE regression
twfe_model <- feols(y ~ treated | id + time, data = twfe_data)
etable(twfe_model) %>% kable(format="markdown")
```

|                |twfe_model    |
|:---------------|:-------------|
|Dependent Var.: |y             |
|                |              |
|treatedTRUE     |3.628 (1.554) |
|Fixed-Effects:  |------------- |
|id              |Yes           |
|time            |Yes           |
|_______________ |_____________ |
|S.E.: Clustered |by: id        |
|Observations    |30            |
|R2              |0.93918       |
|Within R2       |0.49396       |

What's going on?

---
# TWFE with Het. Treatment Timing

The TWFE estimate is biased because:

1. Early treated units (Cohort 2) become controls for later treated units (Cohort 3)
2. The treatment effects are different across groups (2 vs 4)
3. TWFE weights these comparisons incorrectly

This is known as the "forbidden comparison" problem in staggered DiD.

---
# Different comparisons

---
# Bacon-Decomposition

``` r
bgd <- bacon(y ~ treated, data = twfe_data, id = "id", time = "time")
```

```
##                       type  weight avg_est
## 1 Earlier vs Later Treated 0.18182 1.78361
## 2 Later vs Earlier Treated 0.13636 5.73666
## 3     Treated vs Untreated 0.68182 3.69853
```

```
## [1] 3.628289
```

---
# Fixes

- There are many fixes for this problem

- Many go beyond the scope of the class

- But you have the starting tools to understand them

- Also, many are written up to implement using R, Stata, Python, etc.