Data Science for Economists

class: center, middle, inverse, title-slide

.title[
# Data Science for Economists
]
.subtitle[
## Lecture 1: Introduction
]
.author[
### Kyle Coombs (he/him/his)
]
.date[
### Bates College | <a href="https://github.com/big-data-and-economics">EC/DCS 368</a>
]

---

# Table of contents

1. [Prologue](#prologue)

2. [What is data science?](#what_is_data_science)

3. [Syllabus highlights](#syllabus)

4. [Getting started](#started)

5. [R for data science](#r4ds)

6. [Data visualization with ggplot2](#ggplot2)

---
class: inverse, center, middle
name: prologue

# Prologue

---
# Introductions

### Course

You'll soon receive access to this GitHub organization, where we submit assignments, upload presentations, etc.

### Me

---
# Why this course?

Fill in the gaps left by traditional econometrics and methods classes.

- Practical skills that tools that will benefit your thesis and future career.
- Neglected skills like how to actually find datasets in the wild and clean them.
- Apply skills to analyze empirical questions on economic and social problems.

Data science skills are largely distinct from (and complementary to) the core 'metrics familiar to economists.

- Acquiring data; scraping; maintaining databases; etc. 
- Data viz, cleaning and wrangling; programming; cloud computation; relational databases; machine learning; etc.

> *"In short, we will cover things that I wish someone had taught me when I was starting out in college."*

---

# Caveat

- This course will be **_hard._** You will need to:
  - Teach yourself new skills I cannot cover in 12 weeks
  - Be entrepreneurial: If you find a better way to do something, do (and share) it!
  - Be patient: You will encounter bugs and errors, and you will need to learn how to fix them

- This course will also be **_rewarding_**
  - You can avoid the mistakes you make here on your thesis and in your career
  - You will learn skills that employers, pre-doc programs, and grad schools want
  - You will learn how to be a better researcher and citizen
    - Seriously, a little data science goes a long way in helping you see through BS

---
class: inverse, center, middle
name: what_is_data_science

# What is Data Science?

---
# What is Data Science?

- .hi[Data science (DS):] The scientific discipline that deals with transforming data into useful information ("insights") using a variety of stats/ML techniques

- Facebook: Collects data on search history, friendship links, site clicks, occupation, etc.

- Chetty et al. (2022) used FB data to estimate SES and social network ([Social Capital Atlas](https://www.socialcapital.org/))

- The rise of data science has come because of the so-called "Big" Data revolution

- The rise of the internet in the late-1990s and 2000s `$\Rightarrow \,\uparrow$` opportunities for companies and governments to collect data on consumers & citizens

- Spread of mobile devices & social media from late 2000s until now generated even more data
    
#### Pillars of data science

- Programming (automation of data collection, manipulation, cleaning, visualization, and modeling)
- Visualization & exploration
- Causal inference (to be able to make a policy prescription)
- Machine learning (to select models, compress data, predict outcomes)

...Assuming one has the appropriate foundation of basic calculus and statistics

---
# Data are weird!

---
# The data science workflow

.center[
<img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="85%" style="display: block; margin: auto;" />

Source: [R for Data Science](http://r4ds.had.co.nz/introduction.html)
]

---
# "Big" Data
.center[
<img src="pics/frisch.jpg" width="90%" style="display: block; margin: auto;" />
]

Source: Frisch, Ragnar. 1933. "Editor's Note" _Econometrica_ 1(1): 1-4

---
class: inverse, center, middle
name: syllabus

# Syllabus highlights

(Read the full document [here](https://github.com/big-data-and-economics/syllabus/blob/main/syllabus.pdf).)

---
# Course organization

This course is run out of a GitHub organization, [Data Science for Economists](https://github.com/big-data-and-economics/).

Relevant repositories:

- [Course materials](https://github.com/big-data-and-economics/big-data-class-materials): has syllabus, lectures, grade breakdown, etc.
- [Presentations](https://github.com/big-data-and-economics/presentations-winter2025): shared repository where you will submit your presentations
- [Exercises](https://github.com/big-data-and-economics/exercises): has in-class exercises -- you'll "fork" this and work in your own version
- [Discussion](https://github.com/orgs/big-data-and-economics/discussions): for asking and answering
- Problem set repositories: specific repositories for problem sets
 questions

---
# Jargon

- There is a jargon in this class that won't make sense at first, I'll try to flag it as it comes

- Here's a few terms:

.hi[Local machine:] Your personal (or any) computer that isn't a server accessed via the internet

.hi[Version Control:] Keep track of different iterations of a project/code

.hi[Repository:] The location on GitHub of all project files and (commented) file revision history

.hi[GUI:] A Graphical User Interface -- where you point-and-click to do things, like RStudio

.hi[Command line:] Removes the "graphical" from GUI, instead you type all commands to navigate a computer and execute programs

- R operates via the Command line, RStudio is a GUI
  - On Mac, this is called Terminal which uses Bash (Linux) commands
  - If you installed Git for Windows, you got *Git Bash*, which uses Bash (Linux) commands
  - You can also install Windows Subsystem for Linux to run Linux on a Windows machine

.hi[Script:] A file that contains code that can be executed by a computer, i.e. `.R`, `.Rmd`, `.do`, `.py`, etc.

---
# Like any coding language, R is hard

<div class="figure" style="text-align: center">
<img src="pics/r-challenges-lawlor-et-al-2022.jpg" alt="Taken from Lawlor et al. (2022)" width="55%" />
<p class="caption">Taken from Lawlor et al. (2022)</p>
</div>

---
# Tips for success in this class<sup>1</sup>

- **Start early:** This is a fast-paced class; you will need to start early to get help when you need it

- **Ask and answer questions (respectfully) on GitHub Discussions:** 
  - I will direct all emails about non-personal matters to GitHub Discussions
  - Perhaps a classmate can answer (ideal) OR I can answer once for everyone (also valuable)

- **Try to understand the task before you code:** 
  - This is not a class where you can just copy code from the internet and paste it in. 
  - This is a class where you know if you have the right answer _if_ you understand the task

- **Try stuff and see if it worked:** Check if the code does what you want before you submit

- **Comment your code:**
  - Helps you remember what you were doing
  - Helps me understand what you were trying (like showing your work in a math class)

- **Let me know about bugs/errors early**: Technical demos break. Flagging them helps me, your peers, and future cohorts.

- **Listen and try things as I demonstrate them**: If you need me to slow down, ask

---
# Stuck while coding?

You're gonna write a lot of code for this class, which means you're gonna troubleshoot a lot of bugs.

- Some of these will be bugs of your own making, some will be bugs of open source tools

- A major learning objective in this class is to learn when to ask for help and how to ask for help

1. Try to describe in plain words/simple pictures what you want code to do before you write it

2. Break this description into smaller steps (1: Read in data, 2: Drop rows with NA County, etc.)

3. Write code "modularly" (in pieces) to do each step, then you can troubleshoot piece-by-piece

4. The `help` documentation for R functions is the best place to start for troubleshooting

5. [Google is your friend.](http://xkcd.com/627/) Google is your friend. Google is your friend.<sup>1</sup>

6. Search [stackoverflow](http://stackoverflow.com/) and include the `[r]` tag or your package, e.g. `[ggplot]`, `[data.table]`

7. Ask for help from classmates, our CAT, and me on GitHub Discussions/Issues or OH

- If you ask for help, I will ask if you tried the first six steps. If not, I will ask you to try them first.

.footnote[<sup>1</sup> More in the appendix!

<sup>2</sup>Generative AI ([ChatGPT](https://chat.openai.com/) and [GitHub CoPilot](https://github.com/features/copilot)) is a bit of a frenemy.]

---
# Avoid: "Unhelpfully seeking help"

> "If someone has the wit and knowledge to answer your question, they probably have other things they would like to do. Making your message clear, concise and user-friendly gives you the best hope of at least one of those strangers diverting their attention away from their life towards your problem."

> - [The R Inferno Circle 9](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf)

In short, write a minimal reproducible example (MREs) of your bug. See this [stackoverflow thread](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) for a primer.

- We'll come back to MREs soon!

Lots of helpful (and a little rude) guides in the [University of British Columba Stat 545 course](https://stat545.stat.ubc.ca/notes/notes-a01/)

---
# Class outline

### Three units:

#### 1. Data science basics
.pull-left[
- Version control with Git and GitHub
- R language basics
- Data cleaning and wrangling
]
.pull-right[
- Data acquisition and loading
- Data visualization
- Spatial analysis
]

#### 2. Causal Inference
.pull-left[
- Regression analysis
- Regression discontinuity design
]
.pull-right[
- Panel data and fixed effects
- Difference-in-difference design
]

#### 3. Scaling up: Big data, ML, and cloud computation
.pull-left[
- Functions and iteration
- Parallel programming
]
.pull-right[
- Machine Learning techniques
- Text analysis
]

---
class: inverse, center, middle
name: started

# Getting started

---
# Software installation and registration

1. Download [R](https://www.r-project.org/).

2. Download [RStudio](https://www.rstudio.com/products/rstudio/download/preview/).

3. Download [Git](https://git-scm.com/downloads).

4. Create an account on [GitHub](https://github.com/) and register for a student/educator [discount](https://education.github.com/discount_requests/new).
  - I will use GitHub to disseminate and submit assignments, receive feedback, etc.

5. Make a folder on your computer for this class. Any and all repositories for this class should be cloned into this folder.

If you had trouble completing any of these steps, please raise your hand.
- My go-to place for installation guidance and troubleshooting is Jenny Bryan's http://happygitwithr.com.

---
# Some OS-specific extras

I'll detail further software requirements as and when the need arises. However, to help smooth some software installation issues further down the road, please also do the following (depending on your OS):

- **Windows:** I recommend that you install [Rtools](https://cran.r-project.org/bin/windows/Rtools/), [Chocolately](https://chocolatey.org/) and [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/install).
- **Mac:** Install [Homebrew](https://brew.sh/). I also recommend that you configure/open your C++ toolchain (see [here](https://github.com/rmacoslib/r-macos-rtools#installer-package-for-macos-r-toolchain-).)
- **Linux:** None (you should be good to go).

---
# Checklist

☑ Do you have the most recent version of R?
  
  ``` r
  version$version.string
  ```
  
  ```
  ## [1] "R version 4.4.2 (2024-10-31 ucrt)"
  ```
☑ Do you have the most recent version of RStudio?
  
  ``` r
  RStudio.Version()$version
  ## Requires an interactive session but should return something like "[1] ‘2024.12.0.46’"
  ```

☑ Have you updated all of your R packages? 
  
  ```r
  update.packages(ask = FALSE, checkBuilt = TRUE)
  ```

---
# Checklist (cont.)

Navigate to the terminal tab in the RStudio console.

☑ Which version of Git have you installed?
  
  ``` bash
  git --version
  ```
  
  ```
  ## git version 2.34.1
  ```

☑ Did you introduce yourself to Git? (Substitute in your details.)
  
  ```bash
  git config --global user.name 'kgcsport'
  git config --global user.email 'kcoombs@bates.edu'
  git config --global --list
  ```

☑ Did you register an account in GitHub?

---
# Checklist (cont.)

Open up your computers and navigate to the in-class exercise repository: https://github.com/big-data-and-economics/exercises

1. Fork the repository to your own GitHub account in the upper-right corner of the page.
2. Navigate to `Settings > Collaborator and teams`
3. Click `Add people` to add my username (@kgcsport) to your forked repository
4. Click on `01-intro-to-r` and download the `intro-to-r.R` script to a folder for this class on your computer.
5. This is the in-class exercise for today, submission instructions are in the folder

We will make sure that everything is working properly with your R and GitHub setup next lecture.

For the rest of today's lecture, I want to go over some very basic R concepts.

Open `intro-to-r.R` and fill in the code under each comment (marked `#`) as we go.

PS — Just so you know where we're headed: We'll return to these R concepts (and delve much deeper) after a brief, but important detour to the lands of coding best practices and Git(Hub).

---
class: inverse, center, middle
name: r4ds

# R for data science
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
name: whyr
# Why R and RStudio?

### Data science positivism

- Alongside Python, R has become the *de facto* language for data science.
  - See: [*The Impressive Growth of R*](https://stackoverflow.blog/2017/10/10/impressive-growth-r/), [*The Popularity of Data Science Software*](http://r4stats.com/articles/popularity/)
- Open-source (free!) with a global user-base spanning academia and industry.
  - "Do you want to be a profit source or a cost center?"

### Bridge to applied economics and other tools

- Already has all of the statistics and econometrics support, and is amazingly adaptable as a “glue” language to other
programming languages and APIs.
- The RStudio IDE and ecosystem allow for further, seemless integration.

### Path dependency

- It's also the language that I know best.
- (Learning multiple languages is a good idea, though.)

---
# Do `$\uparrow$` in GDP cause life expectancy to `$\uparrow$`?

- Let's use R to try to look at a key question in economics:
  - Does increasing the economic pie (GDP) lead to longer lives (life expectancy)?

- We can use the gapminder dataset to explore this question
- The [gapminder](https://github.com/jennybc/gapminder) dataset contains panel data on life expectancy, population size, and GDP per capita for 142 countries since the 1950s
- Any predictions about what we'll learn?

---
class: inverse, center, middle
name: ggplot2

# ggplot2
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Install and load

For the remainder of this lecture, we're going to play around with [ggplot2](https://ggplot2.tidyverse.org/) to explore our question about GDP and life expectancy.

If you don't have them already, install the `ggplot2` and `gapminder` packages via either:
- **Console:** Enter `install.packages(c("ggplot2", "gapminder"), dependencies=T)`.
- **RStudio:** Click the "Packages" tab in the bottom-right window pane. Then click "Install" and search for these two packages.
  
![](pics/install.gif)

---
# Install and load (cont.)

Once the packages are installed, load them into your R session with the `library()` function.

``` r
library(ggplot2)
library(gapminder) ## We're just using this package for the gapminder data
```

Notice too that you don't need quotes around the package names any more. Reason: R now recognises these packages as defined objects with given names.

PS — A convenient way to combine the package installation and loading steps is with the [pacman package's](https://github.com/trinker/pacman) `p_load()` function. If you run `pacman::p_load(ggplot, gapminder)` it will first look to see whether it needs to install either package before loading them. Clever.
- We'll get to this later, but if you want to run a function from an (installed) package without loading it, you can use the `PACKAGE::package_function()` syntax.

---
# Exploratory analysis

- What initial exploratory questions do we have? Shout them out!

- Here's what I want to know:

1. How are the data organized? 
  2. What are the unique countries in the dataset?
  3. What does the relationship between GDP per capita and life expectancy look like?

- We'll tackle the third today since it is the most visually interesting

---
# ChatGPT example

I need a volunteer. Please type a ChatGPT prompt into my computer that:

>  Writes R code that loads the gapminder data, display some key checks, then make a scatterplot of GDP per capita and life expectancy

**Potential prompt:**

Write an R script that does the following:

(a) Reads in the gapminder dataset using the library gapminder.

(b) Prints out the first 10 rows of the gapminder dataset.

(d) Plots a scatterplot of the gdpPercap by lifeExp using the gapminder dataset using ggplot.

---
# ChatGPT example (continued)

**Assistant:**

Certainly! Here's an R script that accomplishes these tasks using the gapminder dataset:

.scroll-box-8[

```r
# Load necessary libraries
library(gapminder)
library(ggplot2)

# Step (a) - Read in the gapminder dataset
data("gapminder")
gapminder_df <- gapminder

# Step (b) - Print first 10 rows
print(head(gapminder_df, 10))

# Step (c) - List unique countries
unique_countries <- unique(gapminder_df$country)
print(unique_countries)

# Step (d) - Plot scatterplot of gdpPercap by lifeExp
ggplot(gapminder_df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  labs(x = "GDP per Capita", y = "Life Expectancy") +
  ggtitle("Relationship between GDP per Capita and Life Expectancy")
```
]

This script assumes that the gapminder package is installed and loaded. It reads in the gapminder dataset, prints the first 10 rows, lists unique countries present in the dataset, and then creates a scatterplot of gdpPercap (GDP per Capita) against lifeExp (Life Expectancy) using ggplot2. Adjust the plot aesthetics as needed to suit your preferences.
***

Exported on 1/8/2024.

Tips on using generative AI for coding:

- [Turing Institute](https://www.turing.ac.uk/blog/using-generative-ai-write-code-guide-researchers)
- [Github Copilot in RStudio](https://github.blog/developer-skills/github/how-to-use-github-copilot-in-your-ide-tips-tricks-and-best-practices/)
- [General Copilot tips](https://docs.github.com/en/copilot/using-github-copilot/best-practices-for-using-github-copilot)

---
# How's the code run?

```
## # A tibble: 10 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
```

```
##   [1] Afghanistan              Albania                  Algeria                 
##   [4] Angola                   Argentina                Australia               
##   [7] Austria                  Bahrain                  Bangladesh              
##  [10] Belgium                  Benin                    Bolivia                 
##  [13] Bosnia and Herzegovina   Botswana                 Brazil                  
##  [16] Bulgaria                 Burkina Faso             Burundi                 
##  [19] Cambodia                 Cameroon                 Canada                  
##  [22] Central African Republic Chad                     Chile                   
##  [25] China                    Colombia                 Comoros                 
##  [28] Congo, Dem. Rep.         Congo, Rep.              Costa Rica              
##  [31] Cote d'Ivoire            Croatia                  Cuba                    
##  [34] Czech Republic           Denmark                  Djibouti                
##  [37] Dominican Republic       Ecuador                  Egypt                   
##  [40] El Salvador              Equatorial Guinea        Eritrea                 
##  [43] Ethiopia                 Finland                  France                  
##  [46] Gabon                    Gambia                   Germany                 
##  [49] Ghana                    Greece                   Guatemala               
##  [52] Guinea                   Guinea-Bissau            Haiti                   
##  [55] Honduras                 Hong Kong, China         Hungary                 
##  [58] Iceland                  India                    Indonesia               
##  [61] Iran                     Iraq                     Ireland                 
##  [64] Israel                   Italy                    Jamaica                 
##  [67] Japan                    Jordan                   Kenya                   
##  [70] Korea, Dem. Rep.         Korea, Rep.              Kuwait                  
##  [73] Lebanon                  Lesotho                  Liberia                 
##  [76] Libya                    Madagascar               Malawi                  
##  [79] Malaysia                 Mali                     Mauritania              
##  [82] Mauritius                Mexico                   Mongolia                
##  [85] Montenegro               Morocco                  Mozambique              
##  [88] Myanmar                  Namibia                  Nepal                   
##  [91] Netherlands              New Zealand              Nicaragua               
##  [94] Niger                    Nigeria                  Norway                  
##  [97] Oman                     Pakistan                 Panama                  
## [100] Paraguay                 Peru                     Philippines             
## [103] Poland                   Portugal                 Puerto Rico             
## [106] Reunion                  Romania                  Rwanda                  
## [109] Sao Tome and Principe    Saudi Arabia             Senegal                 
## [112] Serbia                   Sierra Leone             Singapore               
## [115] Slovak Republic          Slovenia                 Somalia                 
## [118] South Africa             Spain                    Sri Lanka               
## [121] Sudan                    Swaziland                Sweden                  
## [124] Switzerland              Syria                    Taiwan                  
## [127] Tanzania                 Thailand                 Togo                    
## [130] Trinidad and Tobago      Tunisia                  Turkey                  
## [133] Uganda                   United Kingdom           United States           
## [136] Uruguay                  Venezuela                Vietnam                 
## [139] West Bank and Gaza       Yemen, Rep.              Zambia                  
## [142] Zimbabwe                
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
```

---
# How's the code run? (cont.)
<img src="01-Intro_files/figure-html/gapminder-chatgpt-eval2-1.png" style="display: block; margin: auto;" />

---
# Elements of ggplot2

[Hadley Wickham's](http://hadley.nz/) ggplot2 is one of the most popular packages in the entire R canon. 
- It also happens to be built upon some deep visualization theory: i.e. Leland Wilkinson's [*The Grammar of Graphics*](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448).

There's a lot to say about ggplot2's implementation of this "grammar of graphics" approach, but the three key elements are:

1. Your plot ("the visualization") is linked to your variables ("the data") through various **aesthetic mappings**.

2. Once the aesthetic mappings are defined, you can represent your data in different ways by choosing different **geoms** (i.e. "geometric objects" like points, lines or bars).

3. You build your plot in **layers**.

</br>

That's kind of abstract. Let's break down the elements of ggplot2 in turn with some actual plots.

- As a shortcut, we'll use AI to write the basic code for us then we'll fill in the blanks.

---
# 1. Aesthetic mappings

```r
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point()
```

<img src="01-Intro_files/figure-html/aesthetics0-1.png" style="display: block; margin: auto;" />
---
# 1. Aesthetic mappings (cont.)

```r
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point()
```

Focus on the top line, which contains the initialising `ggplot()` function call. This function accepts various arguments, including:
- Where the data come from (i.e. `data = gapminder`).
- What the aesthetic mappings are (i.e. `mapping = aes(x = gdpPercap, y = lifeExp)`).

The aesthetic mappings here are pretty simple: They just define an x-axis (GDP per capita) and a y-axis (life expectancy).
- To get a sense of the power and flexibility that comes with this approach, however, consider what happens if we add more aesthetics to the plot call...