Web scraping

Data Science in a Box

datasciencebox.org

1 / 15

Scraping the web

2 / 15

Scraping the web: what? why?

Increasing amount of data is available on the web

3 / 15

Scraping the web: what? why?

Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors

3 / 15

Scraping the web: what? why?

Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset

3 / 15

Scraping the web: what? why?

Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
Two different scenarios:
- Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
- Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.

3 / 15

Web Scraping with rvest

4 / 15

Hypertext Markup Language

Most of the data on the web is still largely available as HTML
It is structured (hierarchical / tree based), but it''s often not available in a form useful for analysis (flat / tidy).

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>

5 / 15

rvest

The rvest package makes basic processing and manipulation of HTML data straight forward
It's designed to work with pipelines built with %>%

6 / 15

Core rvest functions

read_html - Read HTML data from a url or character string
html_node - Select a specified node from HTML document
html_nodes - Select specified nodes from HTML document
html_table - Parse an HTML table into a data frame
html_text - Extract tag pairs' content
html_name - Extract tags' names
html_attrs - Extract all of each tag's attributes
html_attr - Extract tags' attribute value by name

7 / 15

SelectorGadget

Open source tool that eases CSS selector generation and discovery
Easiest to use with the Chrome Extension
Find out more on the SelectorGadget vignette

8 / 15

Using the SelectorGadget

9 / 15

10 / 15

11 / 15

12 / 15

13 / 15

14 / 15

Using the SelectorGadget

Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Web scraping

Data Science in a Box

datasciencebox.org

Scraping the web

Scraping the web: what? why?

Scraping the web: what? why?

Scraping the web: what? why?

Scraping the web: what? why?

Web Scraping with rvest

Hypertext Markup Language

rvest

Core rvest functions

SelectorGadget

Using the SelectorGadget

Using the SelectorGadget

Scraping the web

Help