Screenscraping

Chris Bail
Duke University

How to Follow along with Me

RPres (HTML) Files (slides)

 

Markdown Files (annotated code)

 

What is Screen-Scraping?

What is Screen-Scraping?

Is Screen-Scraping Legal?

Warning: Screen-Scraping is Frustrating

Reading a Web-Page into R

Reading a Web-Page into R

 

install.packages("rvest")
library(rvest)

Scraping a Wikipedia Page

 

We are going to begin by scraping this very simple web page from Wikipedia.

Scraping a Wikipedia Page

What a Web Page Looks like to a Computer

What a Web Page Looks like to a Computer

 

Downloading HTML

 

wikipedia_page<-read_html("https://en.wikipedia.org/wiki/World_Health_Organization_ranking_of_health_systems_in_2000")

Downloading HTML

 

wikipedia_page
{xml_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...

Parsing HTML

 

Parsing HTML

 

Parsing HTML

 

Right Click "Inspect"

 

The XPath (Right Click, copy XPath)

 

Using the XPath

 

section_of_wikipedia<-html_node(wikipedia_page, xpath='//*[@id="mw-content-text"]/div/table')
head(section_of_wikipedia)
$node
<pointer: 0x7f88b2b11e10>

$doc
<pointer: 0x7f88b2b02470>

Extracting the Table

 

health_rankings<-html_table(section_of_wikipedia)
head(health_rankings[,(1:2)])
              Country Attainment of goals / Health / Level (DALE)
1         Afghanistan                                         168
2             Albania                                         102
3             Algeria                                          84
4             Andorra                                          10
5              Angola                                         165
6 Antigua and Barbuda                                          48

When the XPath fails...

A more complicated page: www.duke.edu

 

Selector Gadget

Parsing with the CSS Selector

Parsing with the CSS Selector

 

duke_page<-read_html("https://www.duke.edu")
duke_events<-html_nodes(duke_page, css="li:nth-child(1) .epsilon")
html_text(duke_events)
[1] "Six Weeks of Modern Dance Across the Campus\n\n\t\t\t\t\t\t\t"
[2] "Get Started with Get Moving\n\n\t\t\t\t\t\t\t"                
[3] "University Worship - Rev. Dr.  Luke A. Powery"                
[4] "Doctor Dolls, Coming Soon in 3-D"                             

Browser Automation

 

RSelenium

 

devtools::install_github("ropensci/RSelenium")
library(Rselenium)

 

Note: you may need to install Java to get up and running see this tutorial

RSelenium

 

rD <- rsDriver()
remDr <- rD$client

Launch a webpage

 

remDr$navigate("https://www.duke.edu")

Navigate to the search bar

 

search_box <- remDr$findElement(using = 'css selector', 'fieldset input')

Input a search

 

search_box$sendKeysToElement(list("data science", "\uE007"))

Screenscraping within a Loop

Screenscraping within a Loop

Example (non-functional) code:

#create list of websites to scrape
my_list_of_websites<-c("www.duke.edu","www.penn.edu")

#create place to store text data
text_data<-as.data.frame(NULL)

#loop
for(i in 1: length(my_list_of_websites)){

  #read in page and extract text
  page<-read_html("https://www.duke.edu")
  events<-html_nodes(page, css="li:nth-child(1) .epsilon")
  text<-html_text(events)

  #store text in dataset created above
  text_data<-rbind(text_data,text)

  #print iteration for de-bugging
  print(i)
}

When Should I Use Screen-scraping?

Remember Annotated Code Here: