Data Science for Economists

class: center, middle, inverse, title-slide

.title[
# Data Science for Economists
]
.subtitle[
## Lecture 2: Clean Code
]
.author[
### Kyle Coombs (adapted from Tyler Ransom + Scott Cunningham)
]
.date[
### Bates College | <a href="https://github.com/big-data-and-economics">EC/DCS 368</a>
]

---

# Table of contents

1. [Prologue](#prologue)

3. [Clean Code](#clean_code)
  - [Automation](#automation)
  - [Version Control](https://raw.githack.com/big-data-and-economics/big-data-class-materials/main/lectures/02-git/02-Git.html#1)
  - [Organization of data and software files](#organization)
  - [Abstraction](#abstraction)
  - [Documentation](#documentation)
  - [Time / task management](#time-task)
  - [Test-driven development (unit testing, profiling, refactoring)](#test-driven)
  - [Pair programming](#pair-programming)

4. [Appendix: FAQ](#faq)

---
class: inverse, center, middle
name: prologue

# Prologue

<html><div style='float:right'></div><hr color='#EB811B' size=1px width=796px></html>
<div align="center">
<img src="pics/code_quality.png">
</div>
Source: [xkcd](http://xkcd.com/1513/)

---

# Attribution

- Today's material comes from these sources:

1. [Clean Code](https://raw.githack.com/OU-PhD-Econometrics/fall-2022/master/LectureNotes/01a-CleanCode/01aslides.html) by Tyler Ransom

2. [*Code and Data for the Social Sciences: A Practitioner's Guide*](https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf), by Gentzkow and Shapiro

3. [Causal Inference and Research Design](https://github.com/scunning1975/mixtape) by Scott Cunningham

4. [Jenny Bryan's UseR 2018 keynote address](https://www.youtube.com/watch?v=7oyiPBjLAWY)

Also a small contribution from [here](https://garywoodfine.com/what-is-clean-code/) and other sundry internet pages

---
# Reducing empirical chaos

## Sad story

- Once upon a time there was a boy who was writing a job market paper on unemployment insurance during the pandemic

- This boy presented the findings a half dozen times, spoke to the media some, and generally thought he had cool results

- Several people suggested he look at a handful of other outcome series and try changing his analysis unit frequency from monthly to weekly

- He also knew that he needed to restrict his sample to reduce noise

---
# The horror!

- But then after making these changes and re-running his code that took two days, his new sample dropped by 50 percent!
- He was, understandably, terrified.
- The young boy spent a week looking for the fix weeding through six different versions of the .do, .R, .dta, .csv, .sh, .py files with suffixes like *_v1* and *_test* and *_test2* and *_final_I_swear* and *_okay_i_lied*
- Finally he discovered the phrase:

```r
df %>% filter(insample_new==0)
```
**instead of**

```r
df %>% filter(insample_new==1)
```
- The boy was very frustrated and decided to work on these slides while re-running his code.

- Today and next class are about minimizing these struggles through Clean Code and a reproducible workflow

---
class: inverse, center, middle
name: clean_code

# Clean Code

---
# What is Clean Code?

.hi[Clean Code:] Code that is easy to understand, easy to modify, and hence easy to debug

#### Clean code advances scientific progress

- Good science uses careful observations to iteratively test hypotheses/make predictions

- Scientific progress is impeded if

- mistaken previous results are erroneously given authority
    - previous hypothesis tests are not reproducible
    - previous methods and results are not transparent

- Thus, for science that involves computer code, clean code is a must

- Reduces "the influence of hidden researcher decisions" (Huntington-Klein et al. 2021)

#### Clean code increases personal/team sanity

- You will always make a mistake while coding -- what makes good programmers great is their ability to identify and correct mistakes

- Clean code makes it easier to identify and correct mistakes

- Saves you stress in the long-run and makes your collaborative relationships more pleasant

---
# Why clean code is under-produced

- If clean code is so beneficial and important, why isn't there more of it?

1. .hi[Competitive pressure] to produce research/products as quickly as possible

2. .hi[End user] (journal editor, reviewer, reader, dean) .hi[doesn't care what the code looks like], just that the product works

3. In the moment, clean code .hi[takes longer to produce] while seemingly conferring no benefit

---
# How does one produce clean code?

1. Organization of data and files

2. Version Control<sup>1</sup>

3. Automation

4. Abstraction

5. Documentation

6. Time / task management

7. Test-driven development (unit testing, profiling, refactoring)

8. Pair programming

.footnote[<sup>1</sup> Handled in Git lecture notes.]

---
name: organization
# 1a. File organization

1. Separate directories by function

2. Separate files into inputs and outputs

3. Make directories portable

- To see how professionals do this, check out the source code for R's [dplyr](https://github.com/tidyverse/dplyr) package

- There are separate directories for source code (`/src`), documentation (`/man`), code tests (`/test`), data (`/data`), examples (`/vignettes`), and more

- When you use version control, it forces you to make directories portable (otherwise a collaborator will not be able to run your code)

- Use __relative__ file paths, not absolute file paths

---
# Don't be like this

<html><div style='float:right'></div><hr color='#EB811B' size=1px width=796px></html>
<div align="center">
<img src="pics/documents_folder.png">
</div>
Source: [xkcd](http://xkcd.com/1459/)

---
# What is a working directory?

- All the files on your computer are organized in directories or folders

- When you are running a script, you are working from a particular directory
  - This is *not necessarily* the directory where the script is located
  - Your computer looks for `my_data.csv` in this directory when you execute `read.csv('my_data.csv')`
  - If that file is not in that directory, you will get a `FileNotFound` error
  - In **R**, you can see what directory you are in using the `getwd()` function
    - It is also above the console in RStudio
    - You can double click the `.Rproj` file to set the working directory to the root of the project
    - You can also change your working directory using the `setwd()` function (avoid this within scripts)

``` r
getwd()
```

```
## [1] "C:/Users/kgcsp/OneDrive/Documents/Education/Big Data/big-data-class-materials/lectures/02-empirical-workflow"
```

``` r
setwd('lectures/02-empirical-workflow')
```

---
# What is a directory path?

A path defines the location of a file or directory in a file system tree. If I navigate to this file in my computer, it is:

`C:\Users\kgcsp\OneDrive\Documents\Education\ECON368-DSE\big-data-class-materials\lectures\02-empirical-workflow\02-empirical-workflow.Rmd`<sup>1</sup>

The name separates folders that chart the path from the .hi[root] to the file
  - .hi[root]: the start of the file system tree (above that is `C:\`)
  - Each folder along the tree is separated by a `\` or `/`

This is called an .hi[absolute path]:
  - It is long, hard to remember, and not portable across computers

.hi[Relative paths] solve a lot of this:
  - The path to a file or directory starting from the current working directory
  - If my working directory is `/big-data-class-materials`, then I can write `lectures/02-empirical-workflow/02-empirical-workflow.Rmd`
  - **This is portable**: if you have a copy of the `big-data-class-materials` repository, this script will work

.footnote[<sup>1</sup> This is a Windows path, Mac and Linux paths use `/` instead of `\`. See appendix for slides on how to move between them using `..`]

---
# How I organize research projects

- Entire projects should *ideally* live within the same directory
- I have a folder called (`my_project`)
  - Within that folder I have subfolders:
      1. `data` for all data files
          a. `raw` for raw data files
          b. `clean` or `work` for cleaned data files
          c. `temp` for temporary data files
      2. `code` for all code files, and sometimes:
          a. `code/analysis` for code files that build/clean code
          a. `code/build` for code files that do analysis
      3. `output` for all output files
          a. `output/figures` for code files that make figures
          b. `output/tables` for code files that make tables
      4. `literature` or `articles` for all relevant literature
      5. `writing` for all writing files
          a. `writing/notes` for notes
          b. `writing/drafts` for drafts
          c. `writing/edits` for edits
      6. `presentations` for all presentations
          a. `presentations/slides` for slides
          b. `presentations/notes` for notes
- I'll further more or less as needed
- See [the `my_project` folder](https://github.com/big-data-and-economics/big-data-class-materials/tree/main/lectures/02-empirical-workflow/my_project) on GitHub (in the same folder as this lecture) as an example

---
# What is the value of directories?

- All of the files in a directory are related to each other
- Can reference a file within the `data/raw` folder, from the `code/build` folder without writing out the full path

- If you use `file.path()` or the **here** package, you can automate the slashes in your file paths
  - `file.path('data', 'raw', 'file.csv')` will work on Windows, Mac, and Linux
  - `here::here('data', 'raw', 'file.csv')` will do the same thing (see appendix for more on it)

- Then you do not need to worry about shifting around directories

---

# 1b. Data organization

- The key idea is to practice .hi[relational data base management]

- A relational database consists of many smaller data sets

- Each data set is tabular and has a unique, non-missing key

- Data sets "relate" to each other based on these keys

- You can implement these practices in any modern statistical analysis software (R, Stata, SAS, Python, Julia, SQL, ...)

- Gentzkow & Shapiro recommend not merging data sets until as far into your code pipeline as possible

---
# What problems would this create?

```
##    county state cnty_pop state_pop region_state region_county
## 1   36037    NY  3817735  43320903            1             1
## 2   36038    NY   422999  43320903            1             1
## 3   36039    NY   324920        NA            1             1
## 4   36040  <NA>   143432  43320903            1             1
## 5      NA    NY       NA  43320903            1             1
## 6   37001    VA  3228290   7173000            3             3
## 7   37002    VA   449499   7173000            3             3
## 8   37003    VA   383888   7173000            3             4
## 9   37004    VA   483829   7173000            3             3
## 10     NA    VA       NA   7173000            3             3
```

### Why is RDBM better?

.pull-left[

```
##   county state cnty_pop
## 1  36037    NY  3817735
## 2  36038    NY   422999
## 3  36039    NY   324920
## 4  36040    NY   143432
## 5  37001    VA  3228290
## 6  37002    VA   449499
## 7  37003    VA   383888
## 8  37004    VA   483829
```
]
.pull-right[

```
##   state state_pop region
## 1    NY  43320903      1
## 2    VA   7173000      3
```
]

Source: [Example from Code and Data for the Social Sciences](https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf) (p. 19)

---
name: automation
# 3. Automation

- Gentzkow & Shapiro's two rules for automation:

1. Automate everything that can be automated

2. Write a single script that executes all code from beginning to end

- There are two reasons automation is so important

- Reproducibility (helps with debugging and revisions)

- Efficiency (having a code base saves you time in the future)

---
# How to write scripts

#### Keep them modular
- Each script should do one thing and one thing only
- e.g. It takes an input in, it returns an output

- Taking in a raw file and returning a cleaned version
    - Taking in two files and merging them
    - Taking in a cleaned file and returning a figure

- This is somewhat aligned with the structure of an essay

- intro paragraph =/= body paragraph 1 =/= ... =/= conclusion
  
- Much like essays revisions, modular code makes it easier to debug and revise

#### Have a main script that runs all scripts in order

- A single script that shows the sequence of steps, i.e. "shows your work"
  - This script will run modular scripts in sequence to exactly reproduce your analysis

- You will rarely run it all at once, but it will be a nice way to organize your thoughts

- A benefit of a well-organized directory: easily see what scripts you need to run in what order

- In R, use `source('rscript.R')` to run a script from one script

- A main script could be a `.Rmd`, a `.R`, a `.sh`, a `.py`, a `.do` etc.

---
name: main-script
class: white-slide
# Main script

.scroll-output[

``` r
#File: main.R.R
#By: Kyle Coombs
#What: Runs the project from start to finish in Python
#Date: 2024-09-0
# Instructions:
# Run this code from the root directory of your project

#Install packages with housekeeping. Also put together paths.
source('housekeeping.R')
#User written functions can be sourced -- or you could write a package, your call
source(paste0(build,'clean_functions.R'))
source(paste0(analysis,'analysis_functions.R'))

#Import files
source(paste0(build,'import_census.R'))
source(paste0(build,'import_admin_data.R'))

#Clean files
source(paste0(build,'clean_census.R'))
source(paste0(build,'clean_admin_data.R'))

#Merge files 1 to 2
source(paste0(build,'merge_census_admin.R'))

#Analysis
source(paste0(analysis,'/summary_stats.R'))
source(paste0(analysis,'/basic_regression.R'))

#Tables will likely be made with a host of R packages
source(paste0(analysis,'/make_sum_figures.R'))
source(paste0(analysis,'/make_reg_figures.R'))
source(paste0(analysis,'/make_sum_tables.R'))
source(paste0(analysis,'/make_reg_tables.R'))
```
]

[Main script with functions](main-with-functions)

---
# Main script as .Rmd

- In this class, your problem sets will be `.Rmd` files that you knit

- The `.Rmd` file will serve as your main script

- You can `source()` modular code files in code chunks

- It improves chances your code runs from start to finish instead of only when working interactively
  - Means I can run (and grade) your code more easily!

---
# What's a housekeeping file?

A housekeeping file .hi[automates] several tasks and goes at the start of every file in your project

1. Load packages
2. Save strings of path directories to use later using the `file.path()` function to reference elsewhere<sup>1</sup>
  - If a folder name changes, you only need to change it in one place in your code
  - Use these strings to reference files in other scripts `read.csv(file.path(data_raw,'my_data.csv'))`
3. Create directories if they don't exist

.footnote[<sup>1</sup> Alternatives to `file.path()` include `paste` and `here()`. Check [appendix example](#here) for more information.]

---
class: white-slide

``` r
# Housekeeping.R
# By: Your Name
# Date: YYYY-MM-DD
# What: This script loads the packages and data needed for the analysis.

## Package installation -- uncomment if running for the first time
#install.packages(c('tidyverse'))
library(tidyverse)
library(haven) # installed by tidyverse

## Directory objects
data_dir <- file.path('data')
raw_dir <- file.path(data_dir,'raw')
clean_dir <- file.path(data_dir,'clean')
output_dir <- file.path('output')
code_dir <- file.path('code')
processing_dir <- file.path(code_dir,'processing')
analysis_dir <- file.path(code_dir,'analysis')
documentation_dir <- file.path('documentation')

# Create directories
suppressWarnings({
    dir.create(data_dir)
    dir.create(raw_dir)
    dir.create(clean_dir)
    dir.create(documentation_dir)
    dir.create(code_dir)
    dir.create(processing_dir)
    dir.create(analysis_dir)
    dir.create(output_dir)
})
```

---
name: abstraction
# 4. Abstraction

- .hi[Abstraction:] "reducing the complexity of something by hiding unnecessary details from the user"

- e.g. A dishwasher. I mainly need to know how to load it, put in soap, and press start. I don't need to understand the electrical wiring or plumbing.

- In programming, abstraction is usually handled with functions

- Abstraction is usually a good thing

- But it you can go too far: overly abstract code can be "impenetrable" and difficult to modify or debug

#### Gentzkow & Shapiro give three rules for abstraction:

1. Abstract to eliminate redundancy

2. Abstract to improve clarity

3. Otherwise, don't abstract

- In the context of R, abstraction means:
  - Write functions
  - Name your objects sensibly

---

# Abstract to eliminate redundancy

- Sometimes you might find yourself repeating lines of code to accomplish a task

```r
# Downloading a sequence of files from 2004 to 2020 gets tedious
download.file('https://data.nber.org/tax-stats/zipcode/2020/zipcode2020.zip',destfile=paste0(data_dir,'zipcode2020.zip',sep='/'))
download.file('https://data.nber.org/tax-stats/zipcode/2019/zipcode2019.zip',destfile=paste0(data_dir,'zipcode2019.zip',sep='/'))
download.file('https://data.nber.org/tax-stats/zipcode/2019/zipcode2019.zip',destfile=paste0(data_dir,'zipcode2018.zip',sep='/'))
# etc.
```

Notice any problems?

``` r
# Downloading a sequence of files from 20 wih a loop
lapply(2004:2020,function(y) {
    download.file(paste0('https://data.nber.org/tax-stats/zipcode/',y,'/zipcode',y,'.zip'),destfile=paste0(data_dir,'zipcode',y,'.zip',sep='/'))
})
```

- We'll learn more about iteration/for loops/appy statements later

- There are many forms of redundancy that can be eliminated with abstraction beyond iteration

---
# Abstract to improve clarity

Say you want to round a number to the nearest of different integers:

1. Divide the number by there base integer
2. Round the result to the nearest whole number
3. Multiply by the base integer

I start coding and copy and paste the code for each integer:

``` r
rounded_157_nearest_5 <- round(157/5)*5
rounded_157_nearest_7 <- round(157/5)*7
```

Notice a problem?

Why not abstract with a function?

``` r
round_to_nearest <- function(x,base=5) {
    return(round(x/base)*base)
}

rounded_157_nearest_5 <- round_to_nearest(157,base=5)
rounded_157_nearest_7 <- round_to_nearest(157,base=7)
```

The second approach is easier to read and understand what the code is doing!

---
# Otherwise, don't abstract

1. Write/use functions for tasks that are repeated
2. Write thoughtful variable names (e.g. `x100, x101` versus `household_income`, `household_size`)

- If we're only doing it once in our script, then it may not make sense to use the function version

- This discussion points out that it can be difficult to know if one has reached the optimal level of abstraction

- As you're starting out programming, I would advise doing almost everything inside of a function (i.e. err on the side of over-abstraction when starting out)
  - And look for opportunities to loop (or use apply functions)

---
name: documentation
# 5. Documentation

Documentation gives sufficient information to replicate work, but not so much that it is overwhelming<sup>1</sup>

### Rules for documentation

1. Don't write documentation you will not maintain

2. Code should be self-documenting

- Generally speaking, commented code is helpful

- However, sometimes it can be harmful if, e.g. code comments contain dynamic information

- It may not be helpful to have to rewrite comments every time you change the code

- Code can be "self-documenting" by naming functions and variables thoughtfully

.footnote[<sup>1</sup> Anyone who has ever built IKEA furniture knows this all too well]

#### Documentation in R

- .hi[R Help System:] access using `?function_name`

- .hi[Package vignettes:] access using `vignette("vignette_name")`

- .hi[Cheatsheets:] access at [Posit Cheatsheets](https://posit.co/resources/cheatsheets/)

---
# A README is documentation

- A README gives high-level information about the repository or data file:
  - This repository contains code that does X task
  - Simple use case: use this repository to replicate paper X in journal Y

- Onboarding instructions:
  - Add your name to this file in repository folder `the/folder/file.md`
  - Fork the repository and pull request changes
  - Configure your computer settings in this way to run this project
  - Guidelines/rules for contributing to the project

- Licensing information:
  - You can just take this code!
  - This is proprietary and we will sue you if you haven't paid us

- Dependencies:
  - To use this code or package or data_dir download packages `X`, `Y`, `Z`

- Changelog (short narrative commit history):
  - 9/23/2023 - KGC - added function `X` to do `Y`

---
# Documentation and problem sets

Documentation inevitably creates a host of issues on assignments.

It is challenging to give narrative technical instructions:

On a blank problem set:

- The reader (you) still needs to engage thoughtfully with the task
- The writer (me) needs to account for many misinterpretations!

On a completed problem set:

- The reader (me) is trying to guess what you were thinking
- The writer (you) may have made a mistake and not realized it

#### This challenge is a feature, not a bug.

My assignments are a learning experience of the [robustness principle/Postel's law](https://en.wikipedia.org/wiki/Robustness_principle)<sup>1</sup> (for people):

> "Be conservative in what you send, be liberal in what you accept." - Jon Postel

He was talking about internet protocols, but I see it as a general principle for communication:

- Conservative: Make instructions instructions as clear s
- Liberal: Give the benefit of the doubt and try to engage thoughtfully with documentation

Basically, be empathetic and hold your own technical work to a higher standard.

P.S. I also give out bonus points for fixing typos on my assignments because it is so important!

.footnote[<sup>1</sup> This quote was originally in reference to how to design programs that send and receive data.]

---
name: time-task
# 6. Time and task management

### Time management

- Time management is key to writing clean code<sup>1</sup>

- It is foolish to think that one can write clean code in a strained mental state

- Code written when you are groggy, overly anxious, or distracted will come back to bite you

- Schedule long blocks of time (1.5 hours - 3 hours) to work on coding where you eliminate distractions (email, social media, etc.)

- Stop coding when you feel that your focus or energy is dissipating

.footnote[<sup>1</sup> Your professor needs this lecture too]

### Task management

- When collaborating on code, avoid email or Slack threads to discuss coding tasks

- Rather, use a task management system that has dedicated messages for a particular point of discussion (bug in the code, feature to develop, etc.)

- I use GitHub issues and milestones for all of my coding projects including [developing this class](https://github.com/orgs/big-data-and-economics/projects/2)

---
name: test-driven
# 7. Test-driven development

- The only way to know that your code works is to test it!

- Test-driven development (TDD) consists of a suite of tools for writing code that can be automatically tested

- Simplest test is to check if the code gives you the output you expected
  - Whenever you make a change, check it against the output you expect
  - Ideally, check against a small example so it runs fast and is easy to confirm

- What if the code takes too long to check completely? Meet .hi[unit tests]

- .hi[Unit testing] is nearly universally used in professional software development

---
# Unit testing

- Unit tests are scripts that check that a piece of code does everything it is supposed to do

- When professionals write code, they also write unit tests for that code at the same time

- If code doesn't pass tests, then bugs are caught immediately

- R's [dplyr package](https://github.com/tidyverse/dplyr) shows that all unit tests are passing and that tests cover 91% of the code base

- [testthat](https://testthat.r-lib.org/) is a nice step-by-step guide for doing this in R (I use it to autograde exercises)

### Assertions

- Assert statements are extremely useful for basic unit tests
- They exist in every langage
- In R it is called stopifnot()

```r
x <- TRUE
stopifnot(x)

y <- FALSE
stopifnot(y)
```

```
## Error: y is not TRUE
```
---

# 8. Pair programming - work with a buddy

- An essential part of clean code is reviewing code

- An excellent way to review code is to do so at the time of writing

- .hi[Pair programming] involves sitting two programmers at one computer

- One programmer does the writing while the other reviews

- This is a great way to spot silly typos and other issues that would extend development time

- It's also a great way to quickly refactor code at the start

- .hi[I strongly encourage you to do pair programming on problem sets in this course!]
    - (Sometimes I will require it)

---
name: mre
# Minimal reproducible example<sup>1</sup>

- Related to unit testing are minimal reproducible examples (aka MRE, reprex, minreps,...)

- The best way to isolate bugs is a minimal reproducible example

- If code throws an error, there's likely superfluous lines of code that are irrelevant to the error
  - The superfluous stuff makes it harder to read and replicate the error

- [Minimal reproducible examples](https://stackoverflow.com/help/minimal-reproducible-example) (reprex) are a great way to isolate the error
    - **Minimal** Use as little code as possible that still produces the same problem
    - **Complete** Provide all parts needed to reproduce your problem in the question itself
    - **Reproducible** Test the code you'll provide to make sure it reproduces the problem

- That means you should be able to copy and paste the code and run it yourself
    - Name all packages and data needed to reproduce error
    - Cut out irrelevant packages, steps, and data that are not relevant to the error

- Sometimes writing one will help you find the bug, sometimes it'll help a stranger find the bug in your code faster, and sometimes it'll identify a very real bug in the package itself

.footnote[<sup>1</sup> Postel's Law in action.]

---

# Min Reprex from [RStudio community](https://community.rstudio.com/t/faq-how-to-do-a-minimal-reproducible-example-reprex-for-beginners/23061)

- If someone does not have `hrbrthemes` installed, they will not be able to run the code below
  - You can remove this package from your code and still reproduce the error.

```r
library(ggplot2) #For ggplot
library(datasets) #To load irs
library(hrbrthemes) #For the theme
data(iris)
df <- iris %>%
    mutate(Sepal.Length = Sepal.Length * 1000,
           Sepal.Width = Sepal.Width * 1000)

ggplot(data = df,x = Sepal.Length, y = Sepal.Width) +
    theme_modern_rc() +
    geom_point() +
    scale_x_log10() +
    labs(title = "Iris Sepal Width vs. Sepal Length",
         subtitle = "Log10 Scaled X Axis")
```

```
## Error in `geom_point()`:
## ! Problem while setting up geom.
## ℹ Error occurred in the 1st layer.
## Caused by error in `compute_geom_1()`:
## ! `geom_point()` requires the following missing aesthetics: x and y
```

---
# How to write MREs

Cut out the unnecessary steps

``` r
library(ggplot2)
dat <- iris[1:4,]
ggplot(data = dat, x = Sepal.Length, y = Sepal.Width) +
    geom_point()
```

- You can use [reprex](https://reprex.tidyverse.org/) to make sure that your code is reproducible by others and [dput](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dput) to make sure that your data is reproducible by others.

``` r
dput(iris[1:4,]) # copy/paste output of dput into your MRE
```

```
## structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
## 3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
## 0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), levels = c("setosa", 
## "versicolor", "virginica"), class = "factor")), row.names = c(NA, 
## 4L), class = "data.frame")
```

---
# A complete MRE<sup>1</sup>
.scroll-output[
#### Summary 
When I try to make a scatterplot with ggplot, I am told that `geom_point()` requires missing aesthetics `x` and `y`. But I specified `x` and `y` in the `ggplot()` function. Can you help resolve?

#### Expected behavior

I expected the code to produce a scatterplot of `Sepal.Length` and `Sepal.Width` from the iris dataset.

#### Data

I used a subset of the iris dataset.

``` r
dat <- iris[1:4,]
```

#### Code and error message

``` r
ggplot(data = dat, x = Sepal.Length, y = Sepal.Width) +
    geom_point()
```

#### Session Info (use `sessionInfo()` to generate this)

```
## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] skimr_2.1.5       gapminder_1.0.0   causaldata_0.1.3  roxygen2_7.3.1   
##  [5] here_1.0.1        aqp_2.0.3         fontawesome_0.5.2 hrbrthemes_0.8.7 
##  [9] RefManageR_1.4.0  lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1    
## [13] dplyr_1.1.4       purrr_1.0.2       readr_2.1.5       tidyr_1.3.1      
## [17] tibble_3.2.1      ggplot2_3.5.1     tidyverse_2.0.0   knitr_1.46       
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5            xfun_0.44               bslib_0.7.0            
##  [4] lattice_0.22-6          tzdb_0.4.0              vctrs_0.6.5            
##  [7] tools_4.3.3             generics_0.1.3          curl_5.2.1             
## [10] fansi_1.0.6             cluster_2.1.6           pkgconfig_2.0.3        
## [13] data.table_1.15.4       lifecycle_1.0.4         farver_2.1.2           
## [16] compiler_4.3.3          munsell_0.5.1           repr_1.1.7             
## [19] codetools_0.2-20        httpuv_1.6.15           fontquiver_0.2.1       
## [22] fontLiberation_0.1.0    htmltools_0.5.8.1       sass_0.4.9             
## [25] yaml_2.3.8              Rttf2pt1_1.3.12         pillar_1.9.0           
## [28] later_1.3.2             crayon_1.5.2            jquerylib_0.1.4        
## [31] extrafontdb_1.0         gfonts_0.2.0            cachem_1.1.0           
## [34] mime_0.12               fontBitstreamVera_0.1.1 tidyselect_1.2.1       
## [37] digest_0.6.35           stringi_1.8.4           xaringan_0.30          
## [40] rprojroot_2.0.4         extrafont_0.19          bibtex_0.5.1           
## [43] fastmap_1.2.0           grid_4.3.3              colorspace_2.1-0       
## [46] cli_3.6.2               magrittr_2.0.3          base64enc_0.1-3        
## [49] crul_1.4.2              utf8_1.2.4              withr_3.0.0            
## [52] gdtools_0.3.7           scales_1.3.0            promises_1.3.0         
## [55] backports_1.5.0         sp_2.1-4                timechange_0.3.0       
## [58] rmarkdown_2.27          httr_1.4.7              hms_1.1.3              
## [61] shiny_1.8.1.1           evaluate_0.23           rlang_1.1.3            
## [64] Rcpp_1.0.12             xtable_1.8-4            glue_1.7.0             
## [67] httpcode_0.3.0          xml2_1.3.6              jsonlite_1.8.8         
## [70] R6_2.5.1                plyr_1.8.9              systemfonts_1.1.0
```
]

.footnote[<sup>1</sup> Every forum has its own approach to MREs. Sometimes session info is not initially needed!]

---
# Try to write an MRE!

- Sync your fork of the exercise repository and open the folder for [mre-exercise](https://github.com/big-data-and-economics/exercises/tree/main/02-mre-exercise)
- The file `mre-opatlas.Rmd` has a bug in it that has led to a host of problems when you look at the knit output, `mre-opatlas.md`
- Try to write an MRE
- I have already raised this as a poorly-written issue on GitHub. You can see the issue [here](https://github.com/big-data-and-economics/exercises/issues/69)
- Tips: https://aosmith16.github.io/spring-r-topics/slides/week09_reprex.html#1

---
class: inverse, center, middle
name: appendix
# Appendix

---
# Shifting directories

### Help! I need to run code from `code`, but need a file in `data/raw/file.csv`!

- You can use relative paths to navigate between directories
- `..` means "go up one directory"
  - `../data/raw` means "go up one directory, then down into `data/raw`"
- `.` means "stay in the current directory"
  - `./code/build` means "stay in the current directory, then down into `code/build`"
- `../..` means "go up two directories"
  - `../../data/raw` means "go up two directories, then down into `data/raw`

Play around with them yourself!

---
# Main script with functions
name: main-with-functions

```r
#File: main.Rmd or main.R
#By: Kyle Coombs
#What: Runs the project from start to finish in Python
#Date: 2023/09/12

#Import files
df1 <- read_csv(paste0(raw,'file1.csv'))
df2 <- read_parquet(paste0(raw,'file2.parquet'))
df3 <- read_dta(paste0(raw,'file3.dta'))

#Clean files
cleaned_df1  <- clean_df1(df1)
cleaned_df2  <- clean_df2(df2)
cleaned_df3  <- cf.clean_df3(df3)

#Merge files 1 to 2
merged_df1_df2 = merge(cleaned_df1, cleaned_df2, on=c('merge','vars'))

#Append file 1 to
append_df1_df2_df3 = rbind(merged_df1_df2, cleaned_df2)

#Analysis
sum_stats=summary_stats(append_df1_df2_df3,stats=c('mean','median','max'))
reg_results=basic_regression(append_df1_df2_df3)

#Tables will likely be made with a host of R packages
make_sum_figures(sum_stats)
make_figures(reg_results)
make_sum_tables(sum_stats)
make_tables(reg_results)
```

[Back to main](main-script)
---

# Textbooks: Smarter people than me

- Cunningham (2021) [Causal Inference: The Mixtape](https://www.amazon.com/Causal-Inference-Mixtape-Scott-Cunningham/dp/0300251688) (Also, [free version on his website](https://mixtape.scunning.com/))
- Huntington-Klein (2022) [The Effect](https://theeffectbook.net/introduction.html)
- Angrist and Pischke (2009) [Mostly Harmless Econometrics](http://www.amazon.com/Mostly-Harmless-Econometrics-Empiricists-Companion/dp/0691120358/) (MHE)
- Morgan and Winship (2014) [Counterfactuals and Causal Inference](http://www.amazon.com/Counterfactuals-Causal-Inference-Principles-Analytical/dp/1107694167/) (MW)
- Sweigart (2019) [Automate The Boring Stuff With Python](https://automatetheboringstuff.com/)
- Wickham (2023) [Advanced R](http://adv-r.had.co.nz/)
- Wickham and Grolemund (2023) [R for Data Science](https://r4ds.had.co.nz/)
- Peng (2022) [R Programming for Data Science](https://bookdown.org/rdpeng/rprogdatascience/)
---

# Non-textbook readings

- The help documentation associated with your language (no really)
- Jesse Shapiro's "How to Present an Applied Micro Paper"
- Gentzkow and Shapiro's coding practices manual
- Ljubica "LJ" Ristovska's language agnostic guide to programming for economists
- Grant McDermott on Version Control using Github [Link](https://raw.githack.com/uo-ec607/lectures/master/02-git/02-Git.html#1)

---

# Helpful for troubleshooting

- The help documentation associated with your language (no really)
- All languages: [Stack Overflow](https://stackoverflow.com), [Stack Exchange](https://stackexchange.com)
- Stata-specific (all hail Nick Cox): [Statalist](https://www.statalist.org/forums/forum/general-stata-discussion/general)
- Cheatsheets: [Stata](https://www.stata.com/bookstore/statacheatsheets.pdf), [RStudio](https://www.rstudio.com/resources/cheatsheets/), [Python](https://betterprogramming.pub/10-must-have-python-cheatsheets-2b74e8097bc3?gi=cfdb14820caa)
- Me: [Sign up for office hours](https://calendar.google.com/calendar/u/1/selfsched?sstoken=UUF5d0hzbmlvemxVfGRlZmF1bHR8NDRjMWFiMjA5OTNkNzMwNTVkYzBkYWYyYzc2NmQ5Yjc/)

---

# Learn by Immersion

- Just like learning a real language, no amount of talking today will teach you how to use any program.
  - You have to need to use it (immersion) to learn it.
  - Google is your dictionary.
  - Help files are your grammar books.
  - ChatGPT is your phrasebook.
  - A great way to start coding is to see lots of other people's code and copy what you read.

- You must learn how to ask the “right” question:
  - Never: "Importing csv file into R not working."
  - Better: "read_csv R [specific error message]."
  - Better still: "read_csv tidyverse [specific error message]."

---

# Abstract to eliminate redundancy (cont.)

What if you can't find an R function? Write your own!

```r
set.seed(16)
prod1  = rnorm(1, 0, 1)*rnorm(1,4,6)
prod2  = rnorm(2, 0, 1)*rnorm(2,0,1)
prod3  = rnorm(3, 0, 1)*rnorm(3,15,78)
print(prod1)
## [1] 1.547257
print(prod2)
## [1] 1.2582691 0.6764943
print(prod3)
## [1] -60.06036  10.11156  24.32342
```

```r
set.seed(16)
multiply_normals = function(count,mean1=0,sd1=1,mean2=0,sd2=1) {
    prod = rnorm(count,mean1,sd1)*rnorm(count,mean2,sd2)
    return(prod)
}
prod1=multiply_normals(1,mean2=4,sd2=6)
prod2=multiply_normals(2,mean2=0,sd2=1)
prod3=multiply_normals(3,mean2=15,sd2=78)

print(prod1)
## [1] 1.547257
print(prod2)
## [1] 1.2582691 0.6764943
print(prod3)
## [1] -60.06036  10.11156  24.32342
```

---
# Note on seeds

- When randomizing in any language, you aren't really randomizing

- You're producing pseudo-random numbers that return in a deterministic ordered list

- If you set the seed, you can reproduce the same "random" numbers

- This is useful for debugging and sharing code

- Use `set.seed` in R

```r
set.seed(0)
print(rnorm(1)+rnorm(1,5)+rnorm(1,10))
## [1] 17.26652
print(rnorm(1)+rnorm(1,5)+rnorm(1,10))
## [1] 15.14712
# New seed
set.seed(1)
print(rnorm(1)+rnorm(1,5)+rnorm(1,10))
## [1] 13.72156
print(rnorm(1)+rnorm(1,5)+rnorm(1,10))
## [1] 16.10432
# Reset seed
set.seed(0)
print(rnorm(1)+rnorm(1,5)+rnorm(1,10))
## [1] 17.26652
print(rnorm(1)+rnorm(1,5)+rnorm(1,10))
## [1] 15.14712
```

---

# Make your own documentation

- R has excellent built-in documentation called `Roxygen2`

- These make great documents above functions to increase readability

- Here's an example:

```r
library(roxygen2)
#' This is a sample function
#'
#' This function does something amazing.
#'
#' @param x A numeric input.
#' @return The result of the amazing operation.
#' @examples
#' amazing_function(5)
amazing_function <- function(x) {
  # function implementation
}
```

- Use `roxygen::roxygenise()` to generate documentation for all functions in a file
- Read more [here](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html)

---
name: refactor
# Refactoring

- Refactoring refers to the action of restructuring code without changing its external behavior or functionality. Think of it as "reorganizing"

.scroll-box-8[

``` r
get_some_data <- function(config, outfile) {
  if (config_ok(config)) {
    if (can_write(outfile)) {
      if (can_open_network_connection(config)) {
        data <- parse_something_from_network()
        if(makes_sense(data)) {
          data <- beautify(data)
          write_it(data_dir outfile)
          return(TRUE)
        } else {
          return(FALSE)
        }
      } else {
        stop("Can't access network")
      }
    } else {
      ## uhm. What was this else for again?
    }
  } else {
    ## maybe, some bad news about ... the config?
  }
}
```
]
after refactoring becomes
.scroll-box-8[

``` r
get_some_data <- function(config, outfile) {
  if (config_bad(config)) {
    stop("Bad config")
  }

if (!can_write(outfile)) {
    stop("Can't write outfile")
  }

if (!can_open_network_connection(config)) {
    stop("Can't access network")
  }

data <- parse_something_from_network()
  if(!makes_sense(data)) {
    return(FALSE)
  }

data <- beautify(data)
  write_it(data_dir outfile)
  TRUE
}
```
]
- Nothing changed in the code except the number of characters in the function

- The new version may run faster, is more readable. The output is unchanged.

- Refactoring could also mean reducing the number of input arguments

- Jenny Bryan gave a [great talk](https://www.youtube.com/watch?v=7oyiPBjLAWY) on refactoring

---
name: profiling
# Profiling

- Profiling refers to checking the resource demands of your code

- How much processing time does your script take? How much memory?

- Clean code should be highly performant: it uses minimal computational resources

- Profiling and refactoring go hand in hand, along with unit testing, to ensure that code is maximally optimized

- Here are two intro guides to profiling in R:
  - Using `system.time` and `Rprofs` from R Programming for Data Science[https://bookdown.org/rdpeng/rprogdatascience/profiling-r-code.html]
  - Using `lineprof` from Advanced R[http://adv-r.had.co.nz/Profiling.html]

[Back to MREs](#mres)

---

# Neat R functions to help reduce redundancy

```r
set.seed(16)
list1 = list() # Make an empty list to save output in
for (i in 1:3) { # Indicate number of iterations with "i"
    list1[[i]] = multiply(i) # Save output in list for each iteration
}
list1
```

```
## [[1]]
## [1] 1.547257
## 
## [[2]]
## [1] 11.934479 -1.717951
## 
## [[3]]
## [1] -7.4831177  0.9587218  4.7882622
```

A better way to eliminate this redundancy is to use the `map` function:

```r
set.seed(16)
map(1:3, multiply)
```

```
## [[1]]
## [1] 1.547257
## 
## [[2]]
## [1] 11.934479 -1.717951
## 
## [[3]]
## [1] -7.4831177  0.9587218  4.7882622
```

> - More on these later!

---
name: here
# Alternative to file.path is `here()`

- Better yet is the [here](https://cran.r-project.org/web/packages/here/vignettes/here.html)
    - `here()` will find the root directory of your project and then you can navigate from there

```r
#install.packages('here')
library(here)
```

```
## here() starts at C:/Users/kgcsp/OneDrive/Documents/Education/Big Data/big-data-class-materials
```

```r
here::i_am('my_project/code/build/.placeholder')
```

```
## here() starts at C:/Users/kgcsp/OneDrive/Documents/Education/Big Data/big-data-class-materials/lectures/02-empirical-workflow
```

```r
here('data/raw','my_data.csv')
```

```
## [1] "C:/Users/kgcsp/OneDrive/Documents/Education/Big Data/big-data-class-materials/lectures/02-empirical-workflow/data/raw/my_data.csv"
```

- Can be less clunky than `paste()` and `sep="/"`

- Get lost in your directories? Use `file.path()` to identify your root directory