+ - 0:00:00
Notes for current slide
Notes for next slide

Web scraping



Data Science in a Box

1 / 15

Scraping the web

2 / 15

Scraping the web: what? why?

  • Increasing amount of data is available on the web
3 / 15

Scraping the web: what? why?

  • Increasing amount of data is available on the web
  • These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors
3 / 15

Scraping the web: what? why?

  • Increasing amount of data is available on the web
  • These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors
  • Web scraping is the process of extracting this information automatically and transform it into a structured dataset
3 / 15

Scraping the web: what? why?

  • Increasing amount of data is available on the web
  • These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors
  • Web scraping is the process of extracting this information automatically and transform it into a structured dataset
  • Two different scenarios:
    • Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
    • Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
3 / 15

Web Scraping with rvest

4 / 15

Hypertext Markup Language

  • Most of the data on the web is still largely available as HTML
  • It is structured (hierarchical / tree based), but it''s often not available in a form useful for analysis (flat / tidy).
<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
</body>
</html>
5 / 15

rvest

  • The rvest package makes basic processing and manipulation of HTML data straight forward
  • It's designed to work with pipelines built with %>%

6 / 15

Core rvest functions

  • read_html - Read HTML data from a url or character string
  • html_node - Select a specified node from HTML document
  • html_nodes - Select specified nodes from HTML document
  • html_table - Parse an HTML table into a data frame
  • html_text - Extract tag pairs' content
  • html_name - Extract tags' names
  • html_attrs - Extract all of each tag's attributes
  • html_attr - Extract tags' attribute value by name
7 / 15

SelectorGadget

8 / 15

Using the SelectorGadget

9 / 15

Using the SelectorGadget

Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs

15 / 15

Scraping the web

2 / 15
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow