Today we’ll be using SelectorGadget, which is a Chrome extension that makes it easy to discover CSS selectors. (Install the extension directly here.) Please note that SelectorGadget is only available for Chrome. If you prefer using Firefox, then you can try ScrapeMate. The student presentation will go in depth.
Recall that rvest was automatically installed with
the rest of the tidyverse. These lecture notes assume that you have at
least installed rvest 1.0.0, which
install.packages("tidyverse")
should give you.
Most of these notes taken from work by Grant McDermott and José Roberto Ayala Solares.
The next two lectures are about getting data, or “content”, off the web and onto our computers. We’re all used to seeing this content in our browsers (Chrome, Firefox, etc.). So we know that it must exist somewhere. However, it’s important to realize that there are actually two ways that web content gets rendered in a browser:
You can read here for more details (including example scripts), but for our purposes the essential features are as follows:
Over the next two lectures, we’ll go over the main differences between the two approaches and cover the implications for any webscraping activity. I want to forewarn you that webscraping typically involves a fair bit of detective work. You will often have to adjust your steps according to the type of data you want, and the steps that worked on one website may not work on another. (Or even work on the same website a few months later). All this is to say that webscraping involves as much art as it does science.
The good news is that both server-side and client-side websites allow for webscraping.1 If you can see it in your browser, you can scrape it.
The previous sentence elides some important ethical/legal considerations. Just because you can scrape it, doesn’t mean you should. Now, I first have to tell you that this paragraph used to contain a warning about the legal restrictions pertaining to webscraping activity. I’ve decided to drop those in the wake of the landmark hiQ Labs vs LinkedIn court ruling. (Short version: It is currently legal to scrape data from the web using automated tools, as long as the data are publicly available.) However, it’s still important to realize that the tools we’ll be using over these next two lectures are very powerful. A computer can process commands much, much faster than we can ever type them up manually. It’s pretty easy to write up a function or program that can overwhelm a host server or application through the sheer weight of requests.Or, just as likely, the host server has built-in safeguards that will block you in case of a suspected malicious attack. We’ll return to the “be nice” mantra at the end of this lecture, as well as in the next lecture.
One way to stay ethical is to use the polite package. I’ll show you how to use this package alongside the rvest package below when we start scraping the NYT website. Essentially, polite checks the robots.txt file of a website to see if it is okay to scrape. If it is, then it will automatically add a delay between each request that you make. This is a good way to avoid overwhelming a host server. A robots.txt function is a file that is hosted on a website that tells you what you can and cannot scrape. You can find the robots.txt file for any website by typing the url of the website followed by “/robots.txt”. For example, here is the robots.txt file for the New York Times and the robots.txt for Wikipedia. Technically these are not legally binding, but they give guidance on how to proceed. Here is a deeper dive into robots.txt files and how they are used to help regulate traffic from search engines like Google.
If you scrape the site over and over – likely if you are collecting big data – please use the polite package alongside rvest. Nothing today will come close to causing problems, but it is good practice to be nice.
Webscraping is extremely trick and code may stop working after just a small change to the underlying HTML. Organizations are constantly rewriting their webpages, which can break webscraping code – whether intentionally or not. This is why it is important to be able to read HTML and CSS. You can often find the information you want by looking at the HTML and CSS. For example, these lecture notes used to feature a Craigslist example, but that company has made it far harder to scrape since 2021, so I need to find a new scraping example. (NYT below.)
The primary R package that we’ll be using today is rvest (link), a simple webscraping library inspired by Python’s Beautiful Soup (link), but with extra tidyverse functionality. rvest is designed to work with webpages that are built server-side and thus requires knowledge of the relevant CSS selectors… Which means that now is probably a good time for us to cover what these are.
When your web browser displays an HTML (Hypertext Markup Language) document (i.e. a webpage), it merges the content (words, pictures, data, videos) with style information to make it look good. The style information is specified using a language called Cascading Style Sheets (CSS), which provides rules.
Here’s how that works:
element
, class
,
ID
, etc. It then applies the rules to the DOM tree to
determine how the document should be displayed, which is called a render
tree.In short, CSS is a language for specifying the appearance of HTML documents (including web pages). It does this by providing web browsers a set of display rules, which are formed by:
The key point is that if you can identify the CSS selector(s) of the data you want, then you can isolate the data from the rest of the webpage content that you don’t want. This where SelectorGadget comes in. We’ll work through an extended example (with a twist!) below, but I highly recommend looking over this quick vignette before proceding.
Okay, let’s get to an application. Say that we want to scrape the Wikipedia page on the Men’s 100 metres world record progression.
First, open up this page in your browser. Take a look at its structure: What type of objects does it contain? How many tables does it have? Do these tables all share the same columns? What row- and columns-spans? Etc.
Once you’ve familiarised yourself with the structure, read the whole
page into R using the rvest::read_html()
function, which
returns an XML document
that contains all the information about the web page.
# library(rvest) ## Already loaded
m100 = read_html("http://en.wikipedia.org/wiki/Men%27s_100_metres_world_record_progression")
m100
## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...
As you can see, this is an XML document2 that contains everything needed to render the Wikipedia page. It’s kind of like viewing someone’s entire LaTeX document (preamble, syntax, etc.) when all we want are the data from some tables in their paper.
Let’s start by scraping the first table on the page, which documents the unofficial progression before the IAAF. The first thing we need to do is identify the table’s unique CSS selector. Here’s a GIF of me using SelectorGadget to do that.
As you can see, working through this iterative process yields
“div+ .wikitable :nth-child(1)”. We can now use this unique CSS
selector to isolate the pre-IAAF table content from the rest of the HTML
document. The core rvest function that we’ll use to
extract the table content is html_element()
, before piping
it on to html_table()
to parse the HTML table into an R
data frame. Notice that I am using the %>%
pipe-operator
from the magrittr package,
which can help to express complex operations as elegant pipelines
composed of simple, easily understood pieces.
pre_iaaf =
m100 %>%
html_element("div+ .wikitable :nth-child(1)") %>% ## select table element
html_table() ## convert to data frame
pre_iaaf
## # A tibble: 21 × 5
## Time Athlete Nationality `Location of races` Date
## <dbl> <chr> <chr> <chr> <chr>
## 1 10.8 Luther Cary United States Paris, France July 4, 1…
## 2 10.8 Cecil Lee United Kingdom Brussels, Belgium September…
## 3 10.8 Étienne De Ré Belgium Brussels, Belgium August 4,…
## 4 10.8 L. Atcherley United Kingdom Frankfurt/Main, Germany April 13,…
## 5 10.8 Harry Beaton United Kingdom Rotterdam, Netherlands August 28…
## 6 10.8 Harald Anderson-Arbin Sweden Helsingborg, Sweden August 9,…
## 7 10.8 Isaac Westergren Sweden Gävle, Sweden September…
## 8 10.8 Isaac Westergren Sweden Gävle, Sweden September…
## 9 10.8 Frank Jarvis United States Paris, France July 14, …
## 10 10.8 Walter Tewksbury United States Paris, France July 14, …
## # ℹ 11 more rows
Great, it worked!
I’ll tidy things up a bit so that the data frame is easier to work
with in R. First, I’ll use the janitor::clean_names()
convenience function to remove spaces and capital letters from the
column names. (Q: How else could we have done this?) Second, I’ll use
the lubridate::mdy()
function to convert the date string to
a format that R actually understands.
# library(janitor) ## Already loaded
# library(lubridate) ## Already loaded
pre_iaaf =
pre_iaaf %>%
clean_names() %>% ## fix the column names
mutate(date = mdy(date)) ## convert string to date format
pre_iaaf
## # A tibble: 21 × 5
## time athlete nationality location_of_races date
## <dbl> <chr> <chr> <chr> <date>
## 1 10.8 Luther Cary United States Paris, France 1891-07-04
## 2 10.8 Cecil Lee United Kingdom Brussels, Belgium 1892-09-25
## 3 10.8 Étienne De Ré Belgium Brussels, Belgium 1893-08-04
## 4 10.8 L. Atcherley United Kingdom Frankfurt/Main, Germany 1895-04-13
## 5 10.8 Harry Beaton United Kingdom Rotterdam, Netherlands 1895-08-28
## 6 10.8 Harald Anderson-Arbin Sweden Helsingborg, Sweden 1896-08-09
## 7 10.8 Isaac Westergren Sweden Gävle, Sweden 1898-09-11
## 8 10.8 Isaac Westergren Sweden Gävle, Sweden 1899-09-10
## 9 10.8 Frank Jarvis United States Paris, France 1900-07-14
## 10 10.8 Walter Tewksbury United States Paris, France 1900-07-14
## # ℹ 11 more rows
Now that we have our cleaned pre-IAAF data frame, we could easily plot it. I’m going to hold off doing that until we’ve scraped the rest of the WR data. But first, an aside on browser inspection tools.
SelectorGadget is a great tool. But it isn’t available on all browsers and can involve more work than I’d like sometimes, with all that iterative clicking.3 I therefore wanted to mention an alternative (and very precise) approach to obtaining CSS selectors: Use the “inspect web element” feature of your browser.
Here’s a quick example using Google Chrome. First, I open up the inspect console (Ctrl+Shift+I, or right-click and choose “Inspect”). I then proceed to scroll over the source elements, until Chrome highlights the table of interest on the actual page. Once the table (or other element of interest) is highlighted, I can grab its CSS by right-clicking and selecting Copy -> Copy selector.
In general, I prefer to obtain CSS selectors using this “inspect” method with my browser. But each to their own.
Your turn: Download the next two tables from the same WR100m page. Combine these two new tables with the one above into a single data frame and then plot the record progression. Answer below. (No peeking until you have tried first.)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Let’s start with the second table. Note: Selector gadget may give you a slightly different CSS selector (“h3+ .wikitable :nth-child(1)”) than the one I use below. That’s fine. I pulled my selector from the browser inspection tool.
# Refresh the m100, it was forgotten from the last code chunk when run within Rmarkdown.
m100 = read_html("http://en.wikipedia.org/wiki/Men%27s_100_metres_world_record_progression")
iaaf_76 =
m100 %>%
html_element("#mw-content-text > div.mw-parser-output > table:nth-child(17)") %>%
html_table()
As we did with the previous table, let’s fix the column names and coerce the date string to a format that R understands.
## # A tibble: 54 × 8
## time wind auto athlete nationality location_of_race date ref
## <dbl> <chr> <dbl> <chr> <chr> <chr> <date> <chr>
## 1 10.6 "" NA Donald Lippi… United Sta… Stockholm, Swed… 1912-07-06 [2]
## 2 10.6 "" NA Jackson Scho… United Sta… Stockholm, Swed… 1920-09-16 [2]
## 3 10.4 "" NA Charley Padd… United Sta… Redlands, USA 1921-04-23 [2]
## 4 10.4 "0.0" NA Eddie Tolan United Sta… Stockholm, Swed… 1929-08-08 [2]
## 5 10.4 "" NA Eddie Tolan United Sta… Copenhagen, Den… 1929-08-25 [2]
## 6 10.3 "" NA Percy Willia… Canada Toronto, Canada 1930-08-09 [2]
## 7 10.3 "0.4" 10.4 Eddie Tolan United Sta… Los Angeles, USA 1932-08-01 [2]
## 8 10.3 "" NA Ralph Metcal… United Sta… Budapest, Hunga… 1933-08-12 [2]
## 9 10.3 "" NA Eulace Peaco… United Sta… Oslo, Norway 1934-08-06 [2]
## 10 10.3 "" NA Chris Berger Netherlands Amsterdam, Neth… 1934-08-26 [2]
## # ℹ 44 more rows
For the final table, I’ll just run the code all at once. By now you should recognise all of the commands. Again, SelectorGadget might tell you it is “p+ .wikitable :nth-child(1)” instead. I pulled my selector from the browser inspection tool.
iaaf =
m100 %>%
html_element(".wikitable:nth-child(23) :nth-child(1)") %>%
html_table() %>%
clean_names() %>%
mutate(date = mdy(date))
iaaf
## # A tibble: 24 × 9
## time wind auto athlete nationality location_of_race date
## <dbl> <chr> <dbl> <chr> <chr> <chr> <date>
## 1 10.1 1.3 NA Bob Hayes United States Tokyo, Japan 1964-10-15
## 2 10.0 0.8 NA Jim Hines United States Sacramento, USA 1968-06-20
## 3 10.0 2.0 NA Charles Greene United States Mexico City, Mexico 1968-10-13
## 4 9.95 0.3 NA Jim Hines United States Mexico City, Mexico 1968-10-14
## 5 9.93 1.4 NA Calvin Smith United States Colorado Springs, … 1983-07-03
## 6 9.83 1.0 NA Ben Johnson Canada Rome, Italy 1987-08-30
## 7 9.93 1.0 NA Carl Lewis United States Rome, Italy 1987-08-30
## 8 9.93 1.1 NA Carl Lewis United States Zürich, Switzerland 1988-08-17
## 9 9.79 1.1 NA Ben Johnson Canada Seoul, South Korea 1988-09-24
## 10 9.92 1.1 NA Carl Lewis United States Seoul, South Korea 1988-09-24
## # ℹ 14 more rows
## # ℹ 2 more variables: notes_note_2 <chr>, duration_of_record <chr>
Let’s combine our three separate tables into a single data frame.
I’ll use base R’s rbind()
to bind by row and include only
the variables that are common to all of the three data frames. For good
measure, I’ll also add an extra column describing which era each record
was recorded under.
wr100 =
rbind(
pre_iaaf %>% select(time, athlete, nationality, date) %>% mutate(era = "Pre-IAAF"),
iaaf_76 %>% select(time, athlete, nationality, date) %>% mutate(era = "Pre-automatic"),
iaaf %>% select(time, athlete, nationality, date) %>% mutate(era = "Modern")
)
wr100
## # A tibble: 99 × 5
## time athlete nationality date era
## <dbl> <chr> <chr> <date> <chr>
## 1 10.8 Luther Cary United States 1891-07-04 Pre-IAAF
## 2 10.8 Cecil Lee United Kingdom 1892-09-25 Pre-IAAF
## 3 10.8 Étienne De Ré Belgium 1893-08-04 Pre-IAAF
## 4 10.8 L. Atcherley United Kingdom 1895-04-13 Pre-IAAF
## 5 10.8 Harry Beaton United Kingdom 1895-08-28 Pre-IAAF
## 6 10.8 Harald Anderson-Arbin Sweden 1896-08-09 Pre-IAAF
## 7 10.8 Isaac Westergren Sweden 1898-09-11 Pre-IAAF
## 8 10.8 Isaac Westergren Sweden 1899-09-10 Pre-IAAF
## 9 10.8 Frank Jarvis United States 1900-07-14 Pre-IAAF
## 10 10.8 Walter Tewksbury United States 1900-07-14 Pre-IAAF
## # ℹ 89 more rows
All that hard works deserves a nice plot, don’t you think?
wr100 %>%
ggplot(aes(x=date, y=time, col=fct_reorder2(era, date, time))) +
geom_point(alpha = 0.7) +
labs(
title = "Men's 100m world record progression",
x = "Date", y = "Time",
caption = "Source: Wikipedia") +
theme(legend.title = element_blank()) ## Switch off legend title
There are several features of the previous Wikipedia example that
make it a good introductory application. Most notably, the HTML table
format provides a regular structure that is easily coercible into a data
frame (via html_table()
). Oftentimes, however, the
information that we want to scrape off the web doesn’t have this nice
regular structure. For this next example, then, I’m going to walk you
through a slightly more messy application: Scraping items from The
New York Times. The following is adapted from an R notebook of a
Medium blog
post from August 2, 2017 by José Roberto Ayala Solares, who adapted
it from work by Kevin
Markham. The post walks through a tutorial to scrape the Trump
Lies article from the NYT. I’ve made some minor modifications to the
code and added some extra commentary.
For a nice description about the article that we’ll be working with, look at Kevin’s tutorial. In summary, the data that we are interested in consists of a record of lies, each with 4 parts:
I will show you how to read with rvest and polite here. Note for yourself that the output is the same, but the polite version works behind the scenes to guarantee that you are not overwhelming the host server.
# library(polite)
nyt_session = bow("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html") #start your session politely
webpage <- scrape(nyt_session) # politely scrape
webpage
## {html_document}
## <html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="https://schema.org/NewsArticle" itemscope="" xmlns:og="http://opengraphprotocol.org/schema/">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n<style>\n.lt-ie10 .messenger.suggestions {\n display: block !imp ...
webpage_html <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
webpage_html
## {html_document}
## <html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="https://schema.org/NewsArticle" itemscope="" xmlns:og="http://opengraphprotocol.org/schema/">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n<style>\n.lt-ie10 .messenger.suggestions {\n display: block !imp ...
Let’s use the SelectorGadget to figure out what the records are called. Above, we see that every record has the following structure in the HTML code:
<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>
Therefore, to collect all the lies, we need to identify all the
<span>
tags that belong to
class="short-desc"
. The function that will help us to do so
is html_elements()
(Note: In the original article, José
uses html_nodes
, which
are more generic than elements.). This function requires the XML
document that we have read and the nodes that we want to select. For the
latter, it is encouraged to use the SelectorGadget, an open source
tool that makes CSS selector generation and discovery easy. Using such a
tool, we find that all the lies can be selected by using the selector
".short-desc"
.
## {xml_nodeset (180)}
## [1] <span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Ira ...
## [2] <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time m ...
## [3] <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and ...
## [4] <span class="short-desc"><strong>Jan. 25 </strong>“Now, the audience was ...
## [5] <span class="short-desc"><strong>Jan. 25 </strong>“Take a look at the Pe ...
## [6] <span class="short-desc"><strong>Jan. 25 </strong>“You had millions of p ...
## [7] <span class="short-desc"><strong>Jan. 25 </strong>“So, look, when Presid ...
## [8] <span class="short-desc"><strong>Jan. 26 </strong>“We've taken in tens o ...
## [9] <span class="short-desc"><strong>Jan. 26 </strong>“I cut off hundreds of ...
## [10] <span class="short-desc"><strong>Jan. 28 </strong>“The coverage about me ...
## [11] <span class="short-desc"><strong>Jan. 29 </strong>“The Cuban-Americans, ...
## [12] <span class="short-desc"><strong>Jan. 30 </strong>“Only 109 people out o ...
## [13] <span class="short-desc"><strong>Feb. 3 </strong>“Professional anarchist ...
## [14] <span class="short-desc"><strong>Feb. 4 </strong>“After being forced to ...
## [15] <span class="short-desc"><strong>Feb. 5 </strong>“We had 109 people out ...
## [16] <span class="short-desc"><strong>Feb. 6 </strong>“I have already saved m ...
## [17] <span class="short-desc"><strong>Feb. 6 </strong>“It's gotten to a point ...
## [18] <span class="short-desc"><strong>Feb. 6 </strong>“The failing @nytimes w ...
## [19] <span class="short-desc"><strong>Feb. 6 </strong>“And the previous admin ...
## [20] <span class="short-desc"><strong>Feb. 7 </strong>“And yet the murder rat ...
## ...
#> {xml_nodeset (180)}
#> [1] <span class="short-desc"><strong>Jan. 21 </strong>"I wasn't a fan of Iraq. I didn't want to go into Ir ...
#> [2] <span class="short-desc"><strong>Jan. 21 </strong>"A reporter for Time magazine — and I have been on t ...
#> [3] <span class="short-desc"><strong>Jan. 23 </strong>"Between 3 million and 5 million illegal votes cause ...
#> [4] <span class="short-desc"><strong>Jan. 25 </strong>"Now, the audience was the biggest ever. But this cr ...
#> [5] <span class="short-desc"><strong>Jan. 25 </strong>"Take a look at the Pew reports (which show voter fr ...
#> [6] <span class="short-desc"><strong>Jan. 25 </strong>"You had millions of people that now aren't insured ...
#> [7] <span class="short-desc"><strong>Jan. 25 </strong>"So, look, when President Obama was there two weeks ...
#> [8] <span class="short-desc"><strong>Jan. 26 </strong>"We've taken in tens of thousands of people. We know ...
#> ...
This returns a list with 180 XML nodes that contain the information for each of the 180 lies in the web page.
You might be tempted to just immediately use the
html_table
function again. But if we try that, what
happens? We don’t get a table. We get 180 empty tibbles because we have
not yet added structure.
Let’s start simple and focus on extracting all the necessary details from the first lie. We can then extend this to all the others easily. Remember that the general structure for a single record is:
<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>
Notice that the date is embedded within the
<strong>
tag. To select it, we can use the
html_elements()
or html_nodes()
function using
the selector "strong"
.
## {xml_nodeset (1)}
## [1] <strong>Jan. 21 </strong>
We then need to use the html_text()
function to extract
only the text, with the trim argument active to trim leading and
trailing spaces. Finally, we make use of the stringr package to add the year
to the extracted date.
To select the lie, we need to make use of the
xml_contents()
function that is part of the xml2 package
(this package is required by the rvest package, so it is not necessary
to load it). The function returns a list with the nodes that are part of
first_result
.
## {xml_nodeset (3)}
## [1] <strong>Jan. 21 </strong>
## [2] “I wasn't a fan of Iraq. I didn't want to go into Iraq.”
## [3] <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczyns ...
#> {xml_nodeset (3)}
#> [1] <strong>Jan. 21 </strong>
#> [2] "I wasn't a fan of Iraq. I didn't want to go into Iraq."
#> [3] <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-don ...
We are interested in the lie, which is the text of the second node.
## [1] "“I wasn't a fan of Iraq. I didn't want to go into Iraq.”"
Notice that there is an extra pair of quotes (“…”) surrounding the
lie. To get rid of them, we simply use the str_sub()
function from the stringr package to select just the lie.
## [1] "I wasn't a fan of Iraq. I didn't want to go into Iraq."
Hopefully by now it shouldn’t be too complicated to see that to
extract the explanation we simply need to select the text within the
<span>
tag that belongs to
class=".short-truth"
. This will extract the text together
with the opening and closing quotations, but we can easily get rid of
them.
explanation <- first_result %>% html_element(".short-truth") %>% html_text(trim = TRUE)
str_sub(explanation, 2, -2)
## [1] "He was for an invasion before he was against it."
Finally, to get the URL, notice that this is an attribute within the
<a>
tag. We simply select this element with the
html_element()
function, and then select the href attribute
with the html_attr()
function.
## [1] "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"
We found a way to extract each of the 4 parts of the first record. We
can extend this process to all the rest using a for loop. In the end, we
want to have a data frame with 180 rows (one for each record) and 4
columns (to keep the date, the lie, the explanation and the URL). One
way to do so is to create an empty data frame and simply add a new row
as each new record is processed. However, this is not considered a good
practice. As suggested here,
we are going to create a single data frame for each record and store all
of them in a list. Once we have the 180 data frames, we’ll bind them
together using the bind_rows()
function from the dplyr package. This creates our
desired dataset.
library(dplyr)
records <- vector("list", length = length(results))
for (i in seq_along(results)) {
date <- str_c(results[i] %>% html_elements("strong") %>% html_text(trim = TRUE), ", 2017")
lie <- str_sub(xml_contents(results[i])[2] %>% html_text(trim = TRUE), 2, -2)
explanation <- str_sub(results[i] %>% html_elements(".short-truth") %>% html_text(trim = TRUE), 2, -2)
url <- results[i] %>% html_elements("a") %>% html_attr("href")
records[[i]] <- data_frame(date = date, lie = lie, explanation = explanation, url = url)
}
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Rows: 180
## Columns: 4
## $ date <chr> "Jan. 21, 2017", "Jan. 21, 2017", "Jan. 23, 2017", "Jan. 2…
## $ lie <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", …
## $ explanation <chr> "He was for an invasion before he was against it.", "Trump…
## $ url <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-t…
Notice that the column for the date is considered a character vector.
It’d be nice to have it as a datetime vector instead. To do so, we can
use the lubridate package and use the mdy()
function
(month-day-year) to make the conversion.
## Rows: 180
## Columns: 4
## $ date <date> 2017-01-21, 2017-01-21, 2017-01-23, 2017-01-25, 2017-01-2…
## $ lie <chr> "I wasn't a fan of Iraq. I didn't want to go into Iraq.", …
## $ explanation <chr> "He was for an invasion before he was against it.", "Trump…
## $ url <chr> "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-t…
If you want to export your dataset, you can use either the
write.csv()
function that comes by default with R, or the
write_csv()
function from the readr package, which is twice
faster and more convenient than the first one.
Similarly, to retrieve your dataset, you can use either the default
function read.csv()
or the read_csv()
function
from the readr package.
## Rows: 180 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): lie, explanation, url
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The full code for this tutorial is shown below:
# Load packages
library(rvest)
library(stringr)
library(dplyr)
library(lubridate)
library(readr)
# Read web page
webpage <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
# Extract records info
results <- webpage %>% html_elements(".short-desc")
# Building the dataset
records <- vector("list", length = length(results))
for (i in seq_along(results)) {
date <- str_c(results[i] %>%
html_elements("strong") %>%
html_text(trim = TRUE), ', 2017')
lie <- str_sub(xml_contents(results[i])[2] %>% html_text(trim = TRUE), 2, -2)
explanation <- str_sub(results[i] %>%
html_elements(".short-truth") %>%
html_text(trim = TRUE), 2, -2)
url <- results[i] %>% html_elements("a") %>% html_attr("href")
records[[i]] <- data_frame(date = date, lie = lie, explanation = explanation, url = url)
}
df <- bind_rows(records)
# Transform to datetime format
df$date <- mdy(df$date)
# Export to csv
write_csv(df, "trump_lies.csv")
I also want to mention that the stringr, dplyr, lubridate and readr packages are all part of the tidyverse family. In fact, you could also use the popular purrr package to avoid the for loop. However, this would require the creation of a function that maps each record to a data frame. For another example on how to do web scraping, have a look at this awesome blog post from Dean Attali.
Hopefully you find this tutorial useful. Its purpose is not to show which programming language is better, but instead to learn from both Python and R, as well as to increase your programming skills and tools to tackle a more diverse set of problems.
We played fast and loose with HTML nodes, elements, and attributes above. Let me more carefully explain them here (with help from ChatGPT and GitHub CoPilot). HTML nodes, elements, and attributes are related, but slightly different concepts. Here’s a quick summary:
HTML nodes: The building blocks of any HTML document. Everything from text on the page, comments in the code, whitespace, and HTML elements themselves are considered nodes. They are subdivided into text, element, and comment nodes (among others).
HTML elements: The structural and semantic parts
of the webpage. These define the type of content and how to
display/structure it on the web page. Each HTML element has an opening
and closing tag. For example, <p>
is the opening tag
for a paragraph element, while </p>
is the closing
tag. Other examples:
<a>
, </a>
: Anchor element,
used to create hyperlinks.<div>
, </div>
: Division
element, used to group other elements together.<divpath>
, </divpath>
: Generic
block container for other elements.<span>
, </span>
: Generic
inline container for other elements.<ul>
, </ul>
: Unordered list
element. These can be nested, i.e. if you have an anchor element within
a paragraph element.Every HTML element is an HTML node, but not every HTML node is an HTML element. For example, text nodes are not elements.
attribute=value_of_attribute
. For example, the
<a>
element can have an href
attribute
with the syntax <a href="https://www.google.com">
.
Other examples:
src
: Specifies the URL of an image to be displayed
within an <img>
element.class
: Specifies one or more class names for an
element. Used to reference elements from CSS.id
: Specifies a unique id for an element. Used to
reference elements from JavaScript.style
: Specifies an inline CSS style for an
element.width
: Specifies the width of an element.height
: Specifies the height of an element.rvest
package to read into the HTML document
into R and then parse the relevant nodes.
read_html(URL) %>% html_elements(CSS_SELECTORS) %>% html_table()
.html_text
).In the next lecture, we’re going to focus on client-side web content
and interacting with APIs. For the moment, you can practice your
rvest
-based scraping skills by following along with any of
the many (many) tutorials available online. Lastly, we spoke a bit about
the “be nice” scraping motto at the beginning of the lecture. I also
wanted to point you to the polite package (link). It provides some
helpful tools to maintain web etiquette, such as checking for permission
and not hammering the host website with requests. As a bonus, it plays
very nicely with the rvest workflow that we covered
today, so please take a look. For more advanced webscraping, you’ll need
to use RSelenium, which simulates clicking through
webpages, so it can be used to scrape client-side content that relies on
JavaScript. It is rare that you’ll absolutely need to access data this
way, so I’ve left it out.
As we’ll see during the next lecture, scraping a website or application that is built on a client-side (i.e. API) framework is often easier; particularly when it comes to downloading information en masse.↩︎
XML stands for Extensible Markup Language and is one of the primary languages used for encoding and formatting web pages.↩︎
Historically, at least, it also had a tendency to provide CSS selectors that weren’t exactly what we were looking for. To be fair, this may have reflected some issues coming from the R + rvest as much as anything else. These minor incompatibilities have been largely eliminated with rvest 1.0.0.↩︎