Data Science for Economists

class: center, middle, inverse, title-slide

.title[
# Data Science for Economists
]
.subtitle[
## Neighborhoods and Upward Mobility
]
.author[
### Kyle Coombs
]
.date[
### Bates College | <a href="https://github.com/big-data-and-economics">EC/DCS 368</a>
]

---

name: toc

``` r
options(htmltools.dir.version = FALSE)
library(knitr)
opts_chunk$set(
  fig.align="center", fig.width=6, fig.height=4, 
  # out.width="748px", #out.length="520.75px",
  dpi=300, #fig.path='Figs/',
  cache=T, warning=F, message=F#, echo=F 
  )

library(tidyverse)
```

```
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```

``` r
library(gganimate)
library(broom)
library(fixest)

outcomes <- read_csv('data/national_estimates_by_cohort_primary_outcomes.csv', 
  col_select=c('cohort','kfr_pooled_pooled_p25','kfr_pooled_pooled_p75')) %>%
  rename(kfr_p25=kfr_pooled_pooled_p25, kfr_p75=kfr_pooled_pooled_p75)
```

```
## Rows: 15 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): cohort, kfr_pooled_pooled_p25, kfr_pooled_pooled_p75
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

``` r
outcomes <- mutate(outcomes, 
  kfr_p25_zcore=(kfr_p25-mean(kfr_p25))/sd(kfr_p25), 
  kfr_p75_zcore=(kfr_p75-mean(kfr_p75))/sd(kfr_p75))

results <- feols(kfr_p75_zcore ~ kfr_p25_zcore, data=outcomes)
```

# Table of contents

1. [Prologue](#prologue)

2. [Geographical Variation in Upward Mobility](#geo-vary)

3. [Characteristics of High-Mobility Areas](#characteristics)

4. [Spatial Correlation and Decay](#spatial)

---
class: inverse, center, middle
name: prologue

# Prologue

---
# Prologue

- Today's lecture is a little different than the last few

- We're talking about an application of big data to a big question: why do some people move up the income ladder and others don't?

- This is a big question in economics and public policy

- Chetty answered it using big data and spatial analysis
  - By big I mean: essentially all tax returns in the USA from 1989-2015

- He released summaries of the data publicly in 2018 as the Opportunity Atlas

- These show tons of descriptive measures of income mobility at various levels of geography: state, county, and Census Tract

---
class: white-slide

<img src="pics/fading_dream.png" width="90%" style="display: block; margin: auto;" />
Source: [Chetty et al. (2014)](https://opportunityinsights.org/paper/recentintergenerationalmobility/)

---
# Why is the "American Dream" Fading?

- Why are children's chances of climbing the income ladder falling in the USA?
  
  - What can be done to reverse this trend?

- Need to go beyond macroeconomic data to ansawer this question. Why?
  
  - Too many changes happening over time and across space to separate out the causal factors.

- Also: only a handful of data points (classic macro problem)

---
# Enter the Opportunity Atlas

- Created in 2018, the Opportunity Atlas offers one measure of how income mobility differs by location in the USA
  
  - If some areas have more mobility than others, can we learn why and apply those lessons elsewhere?

- Data sources: 
  - Anonymized Census data (2000, 2010 ACS) covering U.S. population
  - Federal income tax returns from 1989-2015.

- Method: Link parents based on dependent claiming on tax returns

- Target sample: Children born between 1978-1983 (U.S. citizens and authorized immigrants who arrived as children)

There's bound to be a messy with this much data, so they create an analysis sample

- **Analysis sample**: 20.5 million children, 96% coverge of target sample

---
# Toolkit to use these data

- Data cleaning and wrangling

- Data visualization

- Spatial analysis

- Regression analysis

---
# Incomes in Tax Data

- Parent household incomes: average income reported on Form 1040 tax return from 1994-2000

- Children incomes measured from tax returns in 2014-15 (ages 31-37)

- But income levels differ over time! How do we compare them?

- Use percentile ranks in the *national* distribution
  - Rank children relative to others born in same year and parents relative to other parents

- [What is a percentile?](https://www.mentimeter.com/app/presentation/blkhnym4ou7ejzod9b1gc6b24id49nr5/xeit9p23asph)

- **Income percentile**: The fraction of the national income distribution that a person's income exceeds

- Take average income percentile of children by parental income percentile

---
# Average Child Income Percentile by Parent Income Percentile

<img src="09-oppatlas_files/figure-html/oppatlas-1.svg" style="display: block; margin: auto;" />
Source: [The Opportunity Atlas](https://opportunityinsights.org/data/)

---
class: inverse, center, middle
name: geo-vary
# Geographic Variation in Upward Mobility

---
# What is mobility for a given area?

- Run this same regression of income ranks by Census tract, county, or commuting zone in the USA<sup>1</sup>
  - Census tracts are small geographic areas that contain 1,200-8,000 people

- For simplicity, Chetty et al. (2018) report the average income percentile of children whose parents were at the 25th percentile of the national income distribution

- This is a single measure of upward mobility that is easy to understand and compare across areas
  - It is not the only measure, but it is a good one

- **Big data tip**: Sensibly summary statisics make big data more useful
  - The right statistic depends on the question you're asking

- [Where do you think has the lowest upward mobility? The highest?](https://www.mentimeter.com/app/presentation/blkhnym4ou7ejzod9b1gc6b24id49nr5/9zp5t7zs6ot5)

.footnote[<sup>1</sup> Technical detail: Weight each child by fraction of childhood (up to 23) in a given area to account for movement across areas during childhood]

---
<img src="pics/op_atlas_map.png" width="90%" style="display: block; margin: auto;" />

*Note: Blue = More Upward Mobility, Red = Less Upward Mobility*
Source: [The Opportunity Atlas](https://opportunityinsights.org/data/)

---
# All that data and still limitations?

- They worked with the near universe of tax returns in the USA from 1989-2015

- Yet, they still have limitations

- What are a few?

- Underscores a key point: data limitations are a fact of life no matter how much data you have

- You are always simplifying the world to make it fit into data

- We use models to make sense of what those limitations are

- Even if you do not think you are using a model, you are

---
# What model? I didn't write one

- A model is a simplification of the world

- It outlines the variables that you assume are systematically related to each other

- e.g. Chetty et al. use tax data to measure income mobility
  - Unreported income is not included 
  - Do you think unreported income is systematically underreported for some groups? In some areas? 
  - To some extent this can be tested

- Can anyone think of examples of places where hidden models are used to interpret data?

- Economic wellbeing summarized by income percentile

- GDP per capita as an indicator of economic development

- More lead paint in old buildings `$\Rightarrow$` Pre-1950s housing proxies for lead exposure

- New construction is slow, so pre-1950s housing measured today likely holds for the past

---
# Inferences about today

Chetty et al. extrapolate from cohorts born in the 80s to make inferences about today
- Assumption: mobility is not systematically changing over time, but it may lose precision
- Tests for "correlation" between mobility measures for cohorts born earlier in history

---
# Sometimes more data is the answer

- After a few years, the Opp Atlas team (Chetty, Dobbie, Goldman, Porter, and Yang 2024) updated with more data

- They repeat the same analysis for cohorts born between 1978 and 1992<sup>2</sup>
  - Measure adulthood income from 2005 to 2019

- Able to look at how mobility has changed over time by cohort, location, other demographics

- Indications that mobility has increased over time for some, but not all groups

.footnote[<sup>2</sup> I'm in the data!]

---
# Mobility trends shift over time

---
# Racial differences

---
class: inverse, center, middle
name: characteristics
# Characteristics of High-Mobility Areas

---
# Why does upward mobility differ?

Armed with a summary measure of upward mobility, we can ask:

- Why do some areas have more upward mobility than others?

- Spatial and correlational analysis is a good place to start

- What are potential characteristics of high mobility areas?

- Better jobs?
  - Better schools? 
  - Institutional differences?
  - Culture?

---
# Upward Mobility vs. Job Growth

---
# Actual correlates

1. Segregation: Greater racial and income segregation associated with lower levels of mobility

2. Income Inequality: Places with smaller middle class have less mobility

3. School Quality: Higher expenditure, smaller classes, higher test scores correlated with more mobility

4. Family Structure: 
  - Areas with more single parents have lower mobility
  - Strong correlation even for kids whose *own* parents are married
  - This result is a puzzling one and the focus of much recent and (somewhat controversially) reported on research

5. Social Capital
  - It takes a village to raise a child
  - Chetty et al. (2023) leveraged Facebook Data to create the Social Capital Atlas

---
# Why do we care about correlation?

- We all know correlation is not causation

- We'll discuss this in-depth after break if you don't believe me

- So why are we talking about correlation at all?

- One of the first steps in understanding a complex system is to understand how variables are related

- This is especially true when we have a lot of data

- Plus, almost ever causal relationship is just a correlation with a story
  - Story might be: I ran an experiment and found a correlation with a randomly assigned treatment
  - But the story might be: I assume some natural variation in the data is like a random assignment and I found a correlation

---
class: inverse, center, middle
name: spatial

# Spatial Correlation and Decay

---
# Big question: why don't people move?

- If some areas have more mobility than others, why don't people move to those areas?

- Is it rent?

---
# The Price of Opportunity in Seattle
Upward Mobility vs Median Rent by Neighborhood

---
# Big question: why don't people move?

- Initial experiments indicate benefits exist from moving (we'll see later)

- If some areas have more mobility than others, why don't people move to those areas?

- Is it rent?

- Other costs of moving?

- Maybe they do not want to move as far?

- Overall, this is not a highly effective approach

---
# Well what if we invest locally?

- What if we invest in the areas that have low mobility? (place-based approach)

- Would there be spillovers between locations? 
  - It is tough to improve one neighborhood (e.g. a tract), let alone many at once
  - Do we have to improve them all at once to help people?

- The answer to this question changes the policy approach

---
# Spatial decay suggests localized effects

---

# Overall Takeaways

- Correlation evidence is suggestive, but not causal

- Causality requires a more focused approach

- We will build this toolkit in the next few lectures

---
class: inverse, center, middle
name: appendix

# Appendix

---
# How to calculate a correlation

- Quick review: what is a correlation?

- Mathematically:

$$ \text{Correlation} = \frac{\text{Cov}(X,Y)}{\text{SD}(X)\text{SD}(Y)} $$

- Intuitively, what is it?

- It is a measure of how two variables move together normalized to be between -1 and 1

- What are some ways to calculation a correlation in R?

``` r
corr <- cor(outcomes$kfr_p25, outcomes$kfr_p75)
print(paste("This correlation between 25th and 75th percentile mobility is:", corr))
```

```
## [1] "This correlation between 25th and 75th percentile mobility is: -0.684480995406698"
```

---
# Z-score to get correlation with regression

- One handy way to calculate a correlation is to use regression, exploiting the formula for the coefficient

In a regression, the coefficient on `$X$`:

$$ \beta = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} $$

$$ \text{Cov}(X,Y) = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})$$

$$\text{Var}(X) = \frac{1}{n}\sum_{i=1}^{n}(X_i - \bar{X})^2 $$

- Subtract means and divide by the standard deviation, or "z-score," we'll calculate the correlation with a regression coefficient

- Why do this? 
  - It is easy to interpret a coefficient
  - It is easy to get a confidence interval
  - It is easy to control for other variables
  - Great way to normalize wildly different variables

---
# Let's test it out

``` r
# library(fixest); library(broom) # already loaded
# Finish this code! 
# I renamed kfr_pooled_pooled_p25 to kfr_p25 and kfr_pooled_pooled_p75 to kfr_p75
outcomes <- mutate(outcomes, 
  kfr_p25_zcore=, 
  kfr_p75_zcore=)

results <- feols(kfr_p75_zcore ~ kfr_p25_zcore, data=outcomes)
#diff <- round(results$coefficients[['kfr_p25_zcore']] - corr,16)
diff <- round(results$coefficients[['kfr_p25_zcore']] - corr,16)
etable(results) %>% # table it
  kable() %>% # make it prettier
  print() # print it
print(paste("Correlation function and z-scored regression approach are within",diff,"of each other"))
```

---
# Correlation vs. Regression

```
## 
## 
## |                |results            |
## |:---------------|:------------------|
## |Dependent Var.: |kfr_p75_zcore      |
## |                |                   |
## |Constant        |3.08e-14 (0.1953)  |
## |kfr_p25_zcore   |-0.6845** (0.2022) |
## |_______________ |__________________ |
## |S.E. type       |IID                |
## |Observations    |15                 |
## |R2              |0.46851            |
## |Adj. R2         |0.42763            |
```

```
## [1] "Correlation function and z-scored regression approach are within 1e-16 of each other"
```