Data Science for Economists

class: center, middle, inverse, title-slide

.title[
# Data Science for Economists
]
.subtitle[
## Lecture 6a: Web Data in Research
]
.author[
### Kyle Coombs (he/him/his)
]
.date[
### Bates College | <a href="https://github.com/big-data-and-economics">EC/DCS 368</a>
]

---

# Table of contents

1. [Prologue](#prologue)

2. [Worldwide Web of Data](#web-data)

3. [Examples of scraping in economics research](#examples-of-scraping-in-economics-research)

4. [Access methods](#access-methods)
  - [Click and Download](#click-and-download)
  - [Server-side scraping](#server-side-scraping)
  - [Client-side scraping](#client-side-scraping)

6. [Ethics of web scraping](#ethics-of-web-scraping)

---
class: inverse, center, middle
name: prologue

# Prologue

---
# Prologue

- We've spent the first month of this class on learning: 
  - empirical organization skills ("Clean Code"), 
  - basics of R
  - basics of data wrangling and tidy data

- Now we're going to tackle data acquisition via **scraping**

- Essentially, we're going to learn how to get data from the web

- As context, everything I am showing you today assumes you've:
  1. Found data on the web you want
  2. Found the relevant way to access it (APIs vs. CSS)
  3. Know the specifics needed to access the data (e.g. the name of a series, have an API key, the rough HTML structure)

- These data are usually messy in one way or another, so it'll give you something to tidy

- Extended demos for this lecture are available in [WEB APIs](https://raw.githack.com/big-data-and-economics/big-data-class-materials/main/lectures/07-web-apis/07-web-apis.html) and [web Scraping](https://raw.githack.com/big-data-and-economics/big-data-class-materials/main/lectures/06-web-css/06-web-css.html)

---
# Plan for today

- What is scraping?

- Contrast Client-side and Server-side scraping

- Examples of scraping in economics research

- Ethical considerations

- Learn by doing with APIs (CSS will happen later -- potentially end of semester)

---
# Attribution

- These slides take inspiration from the following sources:

- [Nathan Schiff's web data lecture](https://nathanschiff.com/wp-content/uploads/2017/02/web_data_lecture.pdf)

- [Andrew MacDonald's slides](https://stat545.com/supporting-docs/webdata01_slides.html#1)

- [Jenny Bryan's textbook](https://stat545.com/web-data-slides.html)

- [Grant McDermott's notes on CSS](https://raw.githack.com/uo-ec607/lectures/master/06-web-css/06-web-css.html) and [APIs](https://raw.githack.com/uo-ec607/lectures/master/07-web-apis/07-web-apis.html)

- [James Densmore's stance on ethics](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01)

---
class: inverse, center, middle
name: web-data

# Worldwide Web of Data

---
# Worldwide Web of Data

- Every website you visit is packed with data

- Every app on your phone is packed with data and taking data from you

- Guess what?

- These data often measure hard to measure things
  - These data are often public (at some level of aggregation/anonymity)
  - These data are often not easily accessibe and not **tidy**
  - Samples might be biased (have to navigate that)
  - This is legal (usually) and ethical (usually)

- Guess what? All this makes these data (and knowing how to access it) valuable

- It also makes this a hard skill to pick up

---
class: inverse, center, middle
name: examples-of-scraping-in-economics-research

# Examples of scraping in economics research

---
# What cool things can you do with web data?

- Can anyone think of examples of web data being used in economics research?

---
# Measuring hard to measure things

- Imagine you survey a ton of people about their beliefs that a candidate is unfit to be president because of their race

- Due to social desirability bias, you get a lot of "I don't know" or "I don't think that"

- There are lots of creative survey methods to get at this, but is there some way to measure this without asking people?

- Say, why not find out the frequency that people search Google for racial epithets in connection to the candidate?

- Guess what? Stephens-Davidowitz (2014) did just that
  - Finds racial animus cost Barack Obama 4 percentage points in the 2008 election (equivalent of a home-state advantage)
  - Google search term data yield effects that are 1.5 to 3 times larger than survey estimates of racial animus

`$\text{Racially Charged Search Rate}_j =\left[\frac{\text{Google searches including the word "Word 1 (s)"}}{\text{Total Google searches}}\right]_{j, 2004-2007}$`

for `$j$` geographical area (state, county, etc.)

---
# Racial Animus Map

Map of media markets by racially charged search rate from 2004 to 2007. The darker red, the more racially charged.

---
# Election performance

Obama underperformed Kerry in areas with more racially charged search rates.

---
# Other uses

- "Billion prices project" (Cavallo and Rigobon 2015) : collect prices from online retailers to look at macro price changes

- Davis and Dingell (2016): use Yelp to explore racial segregation in consumption

- Halket and Pginatti (2015): scrape Craiglists to look at housing markets

- Wu (2018): undergraduate hacked into online economics job market forum to look at toxic language and biases in the academic economics against women

- Glaeser (2018) uses Yelp data to quantify how neighborhood business activity changes as areas gentrify (**Student presentation**)

- Tons leverage eBay, Alibaba, etc. to look at all kinds of commercial activity

- Edelman B (2012) gives an overview of using internet data for economic research

---
class: inverse, center, middle
name: access-methods

# Access methods

---
# Access methods

There are three ways to data off the web:

1. **click-and-download** on the internet as a "flat" file, like a CSV or Excel file
  - What you're used to

2. **Client-side**_ websites contain an empty template that _request_ data from a server and then fills in the template with the data 
  - The request is sent to an API (application programming interface) endpoint
  - Technically you can just source right from the API endpoint (if you can find it) and skip the website altogether
  - I consider this a form of scraping
  - **Key concepts**: APIs, API endpoints

3. **Server-side** websites that sends HTML and JavaScript to your browser, which then renders the page
  - People often call this "scraping"
  - All the data is there, but not in a tidy format
  - **Key concepts**: CSS, Xpath, HTML

- Key takeaway: if there's a structure to how the data is presented, you can exploit it to get the data

---
name: click-and-download
# Click and Download

- You've all seen this approach before

- You go to a website, click a link, and download a file

- Sometimes you need to login first, but if not you can automate this with R's `download.file()` function

- Below will download the Occupational Employment and Wage Statistics (OEWS) data for Massachusetts in 2021 from the BLS

``` r
download.file("https://www.bls.gov/oes/special.requests/oesm21ma.zip", "oesm21ma.zip")
```

---
name: client-side-scraping
# Client-side scraping

- The website contains an empty template of HTML and CSS.
  - E.g. It might contain a “skeleton” table without any values.
- However, when we actually visit the page URL, our browser sends a request to the host server.
- If everything is okay (e.g. our request is valid), then the server sends a response script, which our browser executes and uses to populate the HTML template with the specific information that we want.
- **Webscraping challenges:** Finding the “API endpoints” can be tricky, since these are sometimes hidden from view.
- **Key concepts:** APIs, API endpoints

---
# APIs

- APIs are a collection of rules/methods that allow one software application to interact with another

- Examples include:
  - Web servers and web browsers
  - R libraries and R clients
  - Databases and R clients
  - Git and GitHub and so on

---
# Key API concepts

- **Server:** A powerful computer that runs an API.
- **Client:** A program that exchanges data with a server through an API.
- **Protocol:** The “etiquette” underlying how computers talk to each other (e.g. HTTP).
- **Methods:** The “verbs” that clients use to talk with a server. The main one that we’ll be using is GET (i.e. ask a server to retrieve information), but other common methods are POST, PUT and DELETE.
- **Requests:** What the client asks of the server (see Methods above).
- **Response:** The server’s response. This includes a Status Code (e.g. “404” if not found, or “200” if successful), a Header (i.e. meta-information about the reponse), and a Body (i.e the actual content that we’re interested in).

- Not covered? Explicit directions for each API we cover today
- Instead, we're covering the nuts and bolts so you can figure out how to use any API

---
# API Endponts

- Web APIs have a URL called an **API Endpoint** that you can use to access view the data in your web browser

- Except instead of rendering a beautifully-formatted webpage, the server sends back a ton of messy text! 
  - Either a JSON (JavaScript object notation) or XML (eXtensible Markup Language) file

- It'd be pretty overwhelming to learn how to navigate these new language syntaxes

- Guess what? R has packages to help you with that
  - `jsonlite` for JSON
  - `xml2` for XML

- Today we're going to work through a few of these

- That means the hardest parts are: 
  - Finding the API endpoint
  - Understanding the rules
  - Identify the words you need to use to get the data you want

- To be clear, that's all still tricky!

---
# You've likely used FRED before

---
# Underneath is an API!

- The endpoint is https://api.stlouisfed.org/fred/series/observations?series_id=GNPCA&api_key=<YOUR_API_KEY>&file_type=json
- Just sub an your API key and you're good to go

- What's an API Key? It is a unique identifier that is used to authenticate access to the data
  - It's like a password, but it's not a password
  - It tracks who is using the API and how much they're using it
  - Fake example: `asdfjaw523a3523414at43sad`
  - FRED gives you one for free if you [register an API key](https://research.stlouisfed.org/useraccount/apikey)

---
# FRED API Json

.scroll-output[
```json
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","observation_start":"1600-01-01","observation_end":"9999-12-31","units":"lin","output_type":1,"file_type":"json","order_by":"observation_date","sort_order":"asc","count":94,"offset":0,"limit":100000,"observations":
[{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1929-01-01","value":"1202.659"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1930-01-01","value":"1100.67"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1931-01-01","value":"1029.038"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1932-01-01","value":"895.802"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1933-01-01","value":"883.847"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1934-01-01","value":"978.188"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1935-01-01","value":"1065.716"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1936-01-01","value":"1201.443"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1937-01-01","value":"1264.393"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1938-01-01","value":"1222.966"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1939-01-01","value":"1320.924"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1940-01-01","value":"1435.656"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1941-01-01","value":"1690.844"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1942-01-01","value":"2008.853"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1943-01-01","value":"2349.125"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1944-01-01","value":"2535.744"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1945-01-01","value":"2509.982"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1946-01-01","value":"2221.51"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1947-01-01","value":"2199.313"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1948-01-01","value":"2291.804"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1949-01-01","value":"2277.883"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1950-01-01","value":"2476.097"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1951-01-01","value":"2677.414"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1952-01-01","value":"2786.602"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1953-01-01","value":"2915.598"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1954-01-01","value":"2900.038"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1955-01-01","value":"3107.796"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1956-01-01","value":"3175.622"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1957-01-01","value":"3243.263"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1958-01-01","value":"3215.954"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1959-01-01","value":"3438.007"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1960-01-01","value":"3527.996"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1961-01-01","value":"3620.292"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1962-01-01","value":"3843.844"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1963-01-01","value":"4012.113"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1964-01-01","value":"4243.962"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1965-01-01","value":"4519.102"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1966-01-01","value":"4812.8"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1967-01-01","value":"4944.919"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1968-01-01","value":"5188.802"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1969-01-01","value":"5348.589"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1970-01-01","value":"5358.035"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1971-01-01","value":"5537.202"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1972-01-01","value":"5829.057"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1973-01-01","value":"6170.549"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1974-01-01","value":"6145.506"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1975-01-01","value":"6118.231"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1976-01-01","value":"6454.905"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1977-01-01","value":"6758.055"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1978-01-01","value":"7127.776"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1979-01-01","value":"7375.014"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1980-01-01","value":"7355.39"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1981-01-01","value":"7528.705"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1982-01-01","value":"7397.849"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1983-01-01","value":"7730.794"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1984-01-01","value":"8280.163"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1985-01-01","value":"8598.506"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1986-01-01","value":"8876.436"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1987-01-01","value":"9179.633"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1988-01-01","value":"9569.566"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1989-01-01","value":"9920.542"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1990-01-01","value":"10120.114"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1991-01-01","value":"10100.371"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1992-01-01","value":"10452.604"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1993-01-01","value":"10738.246"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1994-01-01","value":"11155.769"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1995-01-01","value":"11459.835"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1996-01-01","value":"11893.706"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1997-01-01","value":"12408.947"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1998-01-01","value":"12954.457"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"1999-01-01","value":"13583.582"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2000-01-01","value":"14144.962"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2001-01-01","value":"14294.624"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2002-01-01","value":"14529.585"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2003-01-01","value":"14949.293"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2004-01-01","value":"15542.707"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2005-01-01","value":"16075.089"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2006-01-01","value":"16483.539"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2007-01-01","value":"16867.78"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2008-01-01","value":"16940.097"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2009-01-01","value":"16514.062"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2010-01-01","value":"17013.917"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2011-01-01","value":"17306.204"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2012-01-01","value":"17686.281"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2013-01-01","value":"18049.236"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2014-01-01","value":"18499.72"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2015-01-01","value":"19021.225"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2016-01-01","value":"19372.908"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2017-01-01","value":"19905.052"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2018-01-01","value":"20490.925"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2019-01-01","value":"20977.326"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2020-01-01","value":"20451.945"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2021-01-01","value":"21590.414"},
{"realtime_start":"2024-02-03","realtime_end":"2024-02-03","date":"2022-01-01","value":"21992.687"}]}
```
]

---
# What do I need to know?

- The base URL: https://api.stlouisfed.org/
- The API endpoint (fred/series/observations/)
- The parameters:
  - series_id="GNPCA"
  - api_key=YOUR_API_KEY
  - file_type=json

``` r
endpoint = "fred/series/observations"
params = list(
  api_key= "YOUR_FRED_KEY", ## Change to your own key
  file_type="json", 
  series_id="GNPCA"
  )
```

---
# Reading FRED's JSON

``` r
fred = 
  httr::GET(
    url = "https://api.stlouisfed.org/", ## Base URL
    path = endpoint,    ## The API endpoint
    query = params                       ## Our parameter list
    ) %>% 
  httr::content(as="text") %>%
  jsonlite::fromJSON()
```

- What's in there?

.scroll-box-16[

``` r
fred
```

```
## $realtime_start
## [1] "2024-09-26"
## 
## $realtime_end
## [1] "2024-09-26"
## 
## $observation_start
## [1] "1600-01-01"
## 
## $observation_end
## [1] "9999-12-31"
## 
## $units
## [1] "lin"
## 
## $output_type
## [1] 1
## 
## $file_type
## [1] "json"
## 
## $order_by
## [1] "observation_date"
## 
## $sort_order
## [1] "asc"
## 
## $count
## [1] 95
## 
## $offset
## [1] 0
## 
## $limit
## [1] 100000
## 
## $observations
##    realtime_start realtime_end       date     value
## 1      2024-09-26   2024-09-26 1929-01-01  1202.659
## 2      2024-09-26   2024-09-26 1930-01-01   1100.67
## 3      2024-09-26   2024-09-26 1931-01-01  1029.038
## 4      2024-09-26   2024-09-26 1932-01-01   895.802
## 5      2024-09-26   2024-09-26 1933-01-01   883.847
## 6      2024-09-26   2024-09-26 1934-01-01   978.188
## 7      2024-09-26   2024-09-26 1935-01-01  1065.716
## 8      2024-09-26   2024-09-26 1936-01-01  1201.443
## 9      2024-09-26   2024-09-26 1937-01-01  1264.393
## 10     2024-09-26   2024-09-26 1938-01-01  1222.966
## 11     2024-09-26   2024-09-26 1939-01-01  1320.924
## 12     2024-09-26   2024-09-26 1940-01-01  1435.656
## 13     2024-09-26   2024-09-26 1941-01-01  1690.844
## 14     2024-09-26   2024-09-26 1942-01-01  2008.853
## 15     2024-09-26   2024-09-26 1943-01-01  2349.125
## 16     2024-09-26   2024-09-26 1944-01-01  2535.744
## 17     2024-09-26   2024-09-26 1945-01-01  2509.982
## 18     2024-09-26   2024-09-26 1946-01-01   2221.51
## 19     2024-09-26   2024-09-26 1947-01-01  2199.313
## 20     2024-09-26   2024-09-26 1948-01-01  2291.804
## 21     2024-09-26   2024-09-26 1949-01-01  2277.883
## 22     2024-09-26   2024-09-26 1950-01-01  2476.097
## 23     2024-09-26   2024-09-26 1951-01-01  2677.414
## 24     2024-09-26   2024-09-26 1952-01-01  2786.602
## 25     2024-09-26   2024-09-26 1953-01-01  2915.598
## 26     2024-09-26   2024-09-26 1954-01-01  2900.038
## 27     2024-09-26   2024-09-26 1955-01-01  3107.796
## 28     2024-09-26   2024-09-26 1956-01-01  3175.622
## 29     2024-09-26   2024-09-26 1957-01-01  3243.263
## 30     2024-09-26   2024-09-26 1958-01-01  3215.954
## 31     2024-09-26   2024-09-26 1959-01-01  3438.007
## 32     2024-09-26   2024-09-26 1960-01-01  3527.996
## 33     2024-09-26   2024-09-26 1961-01-01  3620.292
## 34     2024-09-26   2024-09-26 1962-01-01  3843.844
## 35     2024-09-26   2024-09-26 1963-01-01  4012.113
## 36     2024-09-26   2024-09-26 1964-01-01  4243.962
## 37     2024-09-26   2024-09-26 1965-01-01  4519.102
## 38     2024-09-26   2024-09-26 1966-01-01    4812.8
## 39     2024-09-26   2024-09-26 1967-01-01  4944.919
## 40     2024-09-26   2024-09-26 1968-01-01  5188.802
## 41     2024-09-26   2024-09-26 1969-01-01  5348.589
## 42     2024-09-26   2024-09-26 1970-01-01  5358.035
## 43     2024-09-26   2024-09-26 1971-01-01  5537.202
## 44     2024-09-26   2024-09-26 1972-01-01  5829.057
## 45     2024-09-26   2024-09-26 1973-01-01  6170.549
## 46     2024-09-26   2024-09-26 1974-01-01  6145.506
## 47     2024-09-26   2024-09-26 1975-01-01  6118.231
## 48     2024-09-26   2024-09-26 1976-01-01  6454.905
## 49     2024-09-26   2024-09-26 1977-01-01  6758.055
## 50     2024-09-26   2024-09-26 1978-01-01  7127.776
## 51     2024-09-26   2024-09-26 1979-01-01  7375.014
## 52     2024-09-26   2024-09-26 1980-01-01   7355.39
## 53     2024-09-26   2024-09-26 1981-01-01  7528.705
## 54     2024-09-26   2024-09-26 1982-01-01  7397.849
## 55     2024-09-26   2024-09-26 1983-01-01  7730.794
## 56     2024-09-26   2024-09-26 1984-01-01  8280.163
## 57     2024-09-26   2024-09-26 1985-01-01  8598.506
## 58     2024-09-26   2024-09-26 1986-01-01  8876.436
## 59     2024-09-26   2024-09-26 1987-01-01  9179.633
## 60     2024-09-26   2024-09-26 1988-01-01  9569.566
## 61     2024-09-26   2024-09-26 1989-01-01  9920.542
## 62     2024-09-26   2024-09-26 1990-01-01 10120.114
## 63     2024-09-26   2024-09-26 1991-01-01 10100.371
## 64     2024-09-26   2024-09-26 1992-01-01 10452.604
## 65     2024-09-26   2024-09-26 1993-01-01 10738.246
## 66     2024-09-26   2024-09-26 1994-01-01 11155.769
## 67     2024-09-26   2024-09-26 1995-01-01 11459.835
## 68     2024-09-26   2024-09-26 1996-01-01 11893.706
## 69     2024-09-26   2024-09-26 1997-01-01 12408.947
## 70     2024-09-26   2024-09-26 1998-01-01 12954.457
## 71     2024-09-26   2024-09-26 1999-01-01 13583.582
## 72     2024-09-26   2024-09-26 2000-01-01 14144.962
## 73     2024-09-26   2024-09-26 2001-01-01 14294.624
## 74     2024-09-26   2024-09-26 2002-01-01 14529.585
## 75     2024-09-26   2024-09-26 2003-01-01 14949.293
## 76     2024-09-26   2024-09-26 2004-01-01 15542.707
## 77     2024-09-26   2024-09-26 2005-01-01 16075.089
## 78     2024-09-26   2024-09-26 2006-01-01 16483.539
## 79     2024-09-26   2024-09-26 2007-01-01  16867.78
## 80     2024-09-26   2024-09-26 2008-01-01 16940.097
## 81     2024-09-26   2024-09-26 2009-01-01 16514.062
## 82     2024-09-26   2024-09-26 2010-01-01 17013.917
## 83     2024-09-26   2024-09-26 2011-01-01 17306.204
## 84     2024-09-26   2024-09-26 2012-01-01 17686.281
## 85     2024-09-26   2024-09-26 2013-01-01 18049.236
## 86     2024-09-26   2024-09-26 2014-01-01  18499.72
## 87     2024-09-26   2024-09-26 2015-01-01 19021.225
## 88     2024-09-26   2024-09-26 2016-01-01 19372.908
## 89     2024-09-26   2024-09-26 2017-01-01 19905.052
## 90     2024-09-26   2024-09-26 2018-01-01 20490.925
## 91     2024-09-26   2024-09-26 2019-01-01 21000.945
## 92     2024-09-26   2024-09-26 2020-01-01 20482.341
## 93     2024-09-26   2024-09-26 2021-01-01 21648.657
## 94     2024-09-26   2024-09-26 2022-01-01 22176.949
## 95     2024-09-26   2024-09-26 2023-01-01  22769.38
```
]

---
# Turn it into data

``` r
fred =
  fred %>% 
  purrr::pluck("observations") %>% ## Extract the "$observations" list element
  # .$observations %>% ## I could also have used this
  # magrittr::extract("observations") %>% ## Or this
  as_tibble() ## Just for nice formatting
fred
```

```
## # A tibble: 95 × 4
##    realtime_start realtime_end date       value   
##    <chr>          <chr>        <chr>      <chr>   
##  1 2024-09-26     2024-09-26   1929-01-01 1202.659
##  2 2024-09-26     2024-09-26   1930-01-01 1100.67 
##  3 2024-09-26     2024-09-26   1931-01-01 1029.038
##  4 2024-09-26     2024-09-26   1932-01-01 895.802 
##  5 2024-09-26     2024-09-26   1933-01-01 883.847 
##  6 2024-09-26     2024-09-26   1934-01-01 978.188 
##  7 2024-09-26     2024-09-26   1935-01-01 1065.716
##  8 2024-09-26     2024-09-26   1936-01-01 1201.443
##  9 2024-09-26     2024-09-26   1937-01-01 1264.393
## 10 2024-09-26     2024-09-26   1938-01-01 1222.966
## # ℹ 85 more rows
```

---
# Clean it up a bit and plot it

``` r
# library(lubridate) ## Already loaded above

fred =
  fred %>%
  mutate(across(realtime_start:date, ymd)) %>% # make all the dates, dates
  mutate(value = as.numeric(value)) # Make the values numeric  
head(fred,3)
```

```
## # A tibble: 3 × 4
##   realtime_start realtime_end date       value
##   <date>         <date>       <date>     <dbl>
## 1 2024-09-26     2024-09-26   1929-01-01 1203.
## 2 2024-09-26     2024-09-26   1930-01-01 1101.
## 3 2024-09-26     2024-09-26   1931-01-01 1029.
```

### Plot it

``` r
ggplot(fred, aes(x=date, y=value)) + # set your ggplot df and aesthetics
  geom_line() + # what geom?
  scale_y_continuous(labels = scales::comma) + # Make the scale prettier
  labs(
    x="Date", y="2012 USD (Billions)",
    title="US Real Gross National Product", caption="Source: FRED"
    )
```

---
# Plot it

---
# Hide your API Key

- In general, you don't want to share your API key with anyone

- Instead, you can make it an environment variable either for a single session or permanently

``` r
Sys.setenv(FRED_API_KEY_TEST="abcdefghijklmnopqrstuvwxyz0123456789") 
FRED_API_KEY_TEST = Sys.getenv("FRED_API_KEY_TEST")
FRED_API_KEY_TEST
```

```
## [1] "abcdefghijklmnopqrstuvwxyz0123456789"
```

- You can also permanently add it to your `.Renviron` file, by running the `edit_r_environ()` function from the **usethis** package
- Then just type in `FRED_API_KEY_TEST=abcdefghijklmnopqrstuvwxyz0123456789`, save, and re-read

``` r
usethis::edit_r_environ() # open R environment to edit
readRenviron("~/.Renviron")  # read the .Renviron file
```

- Any time you need it, use `Sys.getenv("FRED_API_KEY_TEST")`

---
# Popular APIs

- Many popular APIs are free to use and have a lot of documentation

- Sometimes the documentation gets a bit cumbersome though

- So kind souls have developed R packages to help you "abstract" these details (**Clean Code**)

- For example, the `tidycensus` package is a wrapper for the US Census API
  - You'll use it on your problem set

- Others include: `fredr`, `blsAPI`, `gh`, `googlesheets4`, `googledrive`, `wikipediR`, etc.

- Here's a curated list: https://github.com/RomanTsegelskyi/r-api-wrappers

---
# Without tidycensus

- Sign up for a [Census API key](https://api.census.gov/data/key_signup.html)
- Get [API endpoint you want](https://www.census.gov/data/developers/data-sets.html) 
- Define other parameters
  - Series you want, your "get,:" i.e. B19013_001E is median household income, NAME is the name of the geography, GEOID is a census identifier
  - Figure out the types of parameters
  - Name the groups you want, in Census that is the "for" -- e.g. state, county, etc.
  - Name the groups you want, in Census that is your "in" -- e.g. Maine, Cumberland County, etc.

``` r
params_census <- list("key"=Sys.getenv('CENSUS_API_KEY'),                       ## Our parameter list
              "get" = "NAME,B19013_001E", 
              "for" = "county:*", 
              "in" = "state:23")
```

``` r
census = 
  httr::GET(
    url = "https://api.census.gov/", ## Base URL
    path = "data/2017/acs/acs5",    ## The API endpoint
    query = params_census,
    ) %>% 
  httr::content(as="text") %>%
  jsonlite::fromJSON() 
```

---
# Census API differs from FRED

- Hey, wait that output a different structure than FRED did through this point
- So you need a different process to turn into a data table!

``` r
print(census)
```

```
##       [,1]                         [,2]          [,3]    [,4]    
##  [1,] "NAME"                       "B19013_001E" "state" "county"
##  [2,] "Oxford County, Maine"       "44582"       "23"    "017"   
##  [3,] "Waldo County, Maine"        "50162"       "23"    "027"   
##  [4,] "Penobscot County, Maine"    "47886"       "23"    "019"   
##  [5,] "Piscataquis County, Maine"  "38797"       "23"    "021"   
##  [6,] "Androscoggin County, Maine" "49538"       "23"    "001"   
##  [7,] "Aroostook County, Maine"    "39021"       "23"    "003"   
##  [8,] "Washington County, Maine"   "40328"       "23"    "029"   
##  [9,] "Cumberland County, Maine"   "65702"       "23"    "005"   
## [10,] "Knox County, Maine"         "53117"       "23"    "013"   
## [11,] "Sagadahoc County, Maine"    "60457"       "23"    "023"   
## [12,] "York County, Maine"         "62618"       "23"    "031"   
## [13,] "Kennebec County, Maine"     "50116"       "23"    "011"   
## [14,] "Franklin County, Maine"     "45541"       "23"    "007"   
## [15,] "Somerset County, Maine"     "41549"       "23"    "025"   
## [16,] "Hancock County, Maine"      "51438"       "23"    "009"   
## [17,] "Lincoln County, Maine"      "54041"       "23"    "015"
```

---
# For completion janitor package

- Oh shoot, I don't have a GEOID (FIPS code) for the counties!

``` r
# library(tidyverse)
# library(janitor)
census %>% 
  as_tibble() %>%
  row_to_names(row_number=1)
```

```
## # A tibble: 16 × 4
##    NAME                       B19013_001E state county
##    <chr>                      <chr>       <chr> <chr> 
##  1 Oxford County, Maine       44582       23    017   
##  2 Waldo County, Maine        50162       23    027   
##  3 Penobscot County, Maine    47886       23    019   
##  4 Piscataquis County, Maine  38797       23    021   
##  5 Androscoggin County, Maine 49538       23    001   
##  6 Aroostook County, Maine    39021       23    003   
##  7 Washington County, Maine   40328       23    029   
##  8 Cumberland County, Maine   65702       23    005   
##  9 Knox County, Maine         53117       23    013   
## 10 Sagadahoc County, Maine    60457       23    023   
## 11 York County, Maine         62618       23    031   
## 12 Kennebec County, Maine     50116       23    011   
## 13 Franklin County, Maine     45541       23    007   
## 14 Somerset County, Maine     41549       23    025   
## 15 Hancock County, Maine      51438       23    009   
## 16 Lincoln County, Maine      54041       23    015
```

---
# Tidycensus

- Tidycensus embraces the **abstraction** principle of clean code

``` r
#library(tidycensus) # Already loaded
census_api_key("YOUR API KEY GOES HERE") # type this once and do not share your key
```

``` r
get_acs(geography = "county", 
  state="ME", 
  variables = "B19013_001E", # Median household income
  year = 2017, 
  show_call = TRUE, # Show the API call
  survey='acs5')
```

```
## # A tibble: 16 × 5
##    GEOID NAME                       variable   estimate   moe
##    <chr> <chr>                      <chr>         <dbl> <dbl>
##  1 23001 Androscoggin County, Maine B19013_001    49538  1293
##  2 23003 Aroostook County, Maine    B19013_001    39021  1177
##  3 23005 Cumberland County, Maine   B19013_001    65702  1115
##  4 23007 Franklin County, Maine     B19013_001    45541  2739
##  5 23009 Hancock County, Maine      B19013_001    51438  1931
##  6 23011 Kennebec County, Maine     B19013_001    50116  1664
##  7 23013 Knox County, Maine         B19013_001    53117  2506
##  8 23015 Lincoln County, Maine      B19013_001    54041  2895
##  9 23017 Oxford County, Maine       B19013_001    44582  1758
## 10 23019 Penobscot County, Maine    B19013_001    47886  1189
## 11 23021 Piscataquis County, Maine  B19013_001    38797  2314
## 12 23023 Sagadahoc County, Maine    B19013_001    60457  2953
## 13 23025 Somerset County, Maine     B19013_001    41549  1522
## 14 23027 Waldo County, Maine        B19013_001    50162  2047
## 15 23029 Washington County, Maine   B19013_001    40328  1454
## 16 23031 York County, Maine         B19013_001    62618  1559
```

---
# Notes on Tidycensus

- You still need to go find a series ID (AI/Census documentation)

- Census API docs organized by year and survey: https://api.census.gov/data/YEAR/SURVEY/SUBSURVEY/variables.html
  
  - **tidycensus**' `load_variables(YYYY, "sub-survey")` syntax will help you find the variables you need

- `show_call=TRUE` will show you the API call that was made -- learn by doing

- Argument `geography=TRUE` returns the polygon needed to map these! (Hint hint hint)

- `tidycensus` is a great example of how to abstract the details of an API

---
# Hidden APIs

- Sometimes the API endpoint is hidden from view

- But you can find it by using the "Inspect" tool in your browser

- It will require some detective work!

- But if you pull it off, you can get data that no one else has

---
name: server-side-scraping
# Server-side scraping

- The scripts that “build” the website are not run on our computer, but rather on a host server that sends down all of the HTML code.
  - E.g. Wikipedia tables are already populated with all of the information — numbers, dates, etc. — that we see in our browser.

- In other words, the information that we see in our browser has already been processed by the host server.

- You can think of this information being embedded directly in the webpage’s HTML.
  - So if we can get our hands on the HTML, we can get our hands on the data.
  - We just have to figure out how to strip off the HTML and get the data into a tidy format.

- **Webscraping challenges:** Finding the correct CSS (or Xpath) “selectors”. Iterating through dynamic webpages (e.g. “Next page” and “Show More” tabs).

- **Key concepts:** CSS, Xpath, HTML

- **R package**: `rvest` has a suite of functions to help convert HTML to a tidy format

---
# Underneath Wikipedia

---
# The HTML source

- If we can just cut out all the HTML and get the data into a tidy format, we're golden
- Better yet, we can use some of the HTML to help us find ha**rvest** the data we want

```html
<caption>List of men's Olympic records in athletics
</caption>
<tbody><tr>
<th scope="col" width="12%">Event
</th>
<th class="unsortable" width="5%">Record
</th>
<th scope="col" width="10%">Athlete(s)
</th>
<th scope="col" width="15%">Nation
</th>
<th scope="col" width="10%">Games
</th>
<th scope="col" width="5%">Date
</th>
<th scope="col" class="unsortable" width="3%">Ref(s)
</th></tr>
<tr>
<th scope="row"><span data-sort-value="00100&#160;!"><a href="/wiki/100_metres" title="100 metres">100 metres</a></span>
</th>
<td align="right">9.63&#160;
</td>
<td><span data-sort-value="Bolt, Usain"><span class="vcard"><span class="fn"><a href="/wiki/Usain_Bolt" title="Usain Bolt">Usain Bolt</a></span></span></span>
</td>
<td><span class="mw-image-border" typeof="mw:File"><span><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Flag_of_Jamaica.svg/22px-Flag_of_Jamaica.svg.png" decoding="async" width="22" height="11" class="mw-file-element" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Flag_of_Jamaica.svg/33px-Flag_of_Jamaica.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Flag_of_Jamaica.svg/44px-Flag_of_Jamaica.svg.png 2x" data-file-width="1200" data-file-height="600" /></span></span>&#160;<a href="/wiki/Jamaica_at_the_2012_Summer_Olympics" title="Jamaica at the 2012 Summer Olympics">Jamaica</a>&#160;<span style="font-size:90%;">(JAM)</span>
</td>
<td><span data-sort-value="2012&#160;!"><a href="/wiki/Athletics_at_the_2012_Summer_Olympics_%E2%80%93_Men%27s_100_metres" title="Athletics at the 2012 Summer Olympics – Men&#39;s 100 metres">2012 London</a></span>
</td>
<td><span data-sort-value="000000002012-08-05-0000" style="white-space:nowrap">August 5, 2012</span>
</td>
<td align="center"><sup id="cite_ref-9" class="reference"><a href="#cite_note-9">&#91;9&#93;</a></sup>
</td></tr>
```

---
# Inspect elements and `rvest`

- [Selector gadget](https://selectorgadget.com/) is a Chrome extension that helps you find the CSS selectors you need
- It will highlight the elements you want to scrape and give you the CSS selector
- You can then use this selector in the `html_elements()` function to pick out those elements from the HTML

- In R, we can use the `rvest` package to read into the HTML document into R and then parse the relevant nodes.
  - A typical workflow is: `read_html(URL) %>% html_elements(CSS_SELECTORS) %>% html_table()`.
  - You might need other functions depending on the content type (e.g. `html_text`).

---
# Inspect elements gif

![Inspect elements gif](pics/wikipedia_scrape.gif)

---
# Scraping Wikipedia

- The hard part is getting the CSS selector. After that the code is pretty simple
- You will just need to use some other packages like `janitor`, `dplyr`, and `tidyr` to clean the table up a bit for use

``` r
read_html("http://en.wikipedia.org/wiki/Men%27s_100_metres_world_record_progression") %>% 
  html_element("div+ .wikitable :nth-child(1)") %>% ## select table element
  html_table()  %>%                                    ## convert to data frame
  head(5)
```

```
## # A tibble: 5 × 5
##    Time Athlete       Nationality    `Location of races`     Date              
##   <dbl> <chr>         <chr>          <chr>                   <chr>             
## 1  10.8 Luther Cary   United States  Paris, France           July 4, 1891      
## 2  10.8 Cecil Lee     United Kingdom Brussels, Belgium       September 25, 1892
## 3  10.8 Étienne De Ré Belgium        Brussels, Belgium       August 4, 1893    
## 4  10.8 L. Atcherley  United Kingdom Frankfurt/Main, Germany April 13, 1895    
## 5  10.8 Harry Beaton  United Kingdom Rotterdam, Netherlands  August 28, 1895
```

---
# Stability and CSS scraping

- Websites change over time

- That can break your scraping code

- This makes scraping as much of an "art" as it is a science

---
# Wayback Machine: Internet Archive

- If you go to several federal government websites (https://www.usaid.gov/), you'll see a blank page or a memo that says the page has been taken down.

- That data is no longer visible to the public. But you may need it for research

- The [Wayback Machine](https://archive.org/web/) is a digital archive of the World Wide Web and other information on the Internet
  - It allows users to go "back in time" and see how websites looked in the past
  - It has archived over 500 billion web pages

- Sometimes scraping it is tricky though, so be patient!

- There are packages like `ArchiveRetriever` that help leverage its API

---
# Wayback Machine with USAID

![Scrape USAID](pics/wayback-usaid.gif)

---
class: inverse, center, middle
name: ethics-of-web-scraping

# Ethics of web scraping

---
# Legality of web scraping

- All of today is about how to get data off the web

- If you can see it in a browser window and work out its structure, you can scrape it

- And the legal restrictions are pretty obscure, fuzzy, and ripe for reform
  - hiQ Labs vs LinkedIn court ruling defended hiQ's right to scrape, then the Supreme Court vacated the ruling, and the final decision was against HiQ Labs
  - The Computer Fraud and Abuse Act (CFFA) protects the scraping of publicly available data
  - Legality gets messy around personal data and intellectual property (for good reason, but again reform is needed)

---
# Ethics of web scraping

- Technically, web scraping just automates what you (or a team of **well**-compensated RAs) could do manually
  - It's just a lot faster and more efficient (no offense)

- Webscraping is an integral tool to modern investigative journalism
  - Sometimes companies hide things in their HTML that they don't want the public to see
  - Pro Publica has developed a tool called [Upton](https://www.propublica.org/nerds/upton-a-web-scraping-framework) to make it more accessible

- So I stand firmly on the pro-scraping side with a few ethical caveats
  - Just because you can scrape it, doesn’t mean you should
  - It’s pretty easy to write up a function or program that can overwhelm a host server or application through the sheer weight of requests
  - Or, just as likely, the host server has built-in safeguards that will block you in case of a suspected malicious Denial-of-serve (DoS) attack

---
# Be nice

- Once you get over the initial hurdles, scraping is fairly easy to do (cleaning can be trickier)

- There's plenty of digital ink spilled on the [ethics of web scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01)

- The key takeaway is to be nice
  - If a public API exists, use it instead of scraping
  - Only take the data that is necessary
  - Have good reason to take data that is not intentionaly public
  - Do not repeatedly swarm a server with requests (use `Sys.sleep()` to space out requests)
  - Scrape to add value to the data, not to take value from the host server
  - Properly cite any scraped content and respect the terms of service of the website
  - Document the steps taken to scrape the data

---
# **polite** package and `robots.txt`

- Sites often have a "robot.txt," which is a file that tells you what you can and cannot scrape

- A "web crawler" should be written to start with the robots.txt and then follow the rules

- The `polite` package is a tool to help you be nice

- It explicitly checks for permissions and goes to the robots.txt of any site you visit

- As you get better at scraping and start trying to scrape at scale, you should use this

---
# Conclusion

- Web content can be rendered either 1) server-side or 2) client-side.

- Client-side content is often rendered using an API endpoint, which is a URL that you can use to access the data directly.
  - APIs are a set of rules/methods that allow one software application to interact with another they often require an access token
  - You can use R packages (**httr**, **xml2**, **jsonlite**) to access these endpoints and tidy the data.
  - Popular APIs have packages in R or other software that streamline access

- Server-side content is often rendered using HTML and CSS.
  - Use the **rvest** package to read the HTML document into R and then parse the relevant nodes.
  - A typical workflow is: read_html(URL) %>% html_elements(CSS_SELECTORS) %>% html_table().
  - You might need other functions depending on the content type (e.g. html_text).

- Just because you can scrape something doesn’t mean you should (i.e. ethical and possibly legal considerations).
- Webscraping involves as much art as it does science. Be prepared to do a lot of experimenting and data cleaning.

---
class: inverse, center, middle

# Next: Onto scraping and API activities! 
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>