+ - 0:00:00
Notes for current slide
Notes for next slide

Web scraping considerations



Data Science in a Box

1 / 10

Ethics

2 / 10

"Can you?" vs "Should you?"

3 / 10

"Can you?" vs "Should you?"

4 / 10

Challenges

5 / 10

Unreliable formatting at the source

6 / 10

Data broken into many pages

7 / 10

Workflow

8 / 10

Screen scraping vs. APIs

Two different scenarios for web scraping:

  • Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy)

  • Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files

9 / 10

A new R workflow

  • When working in an R Markdown document, your analysis is re-run each time you knit

  • If web scraping in an R Markdown document, you'd be re-scraping the data each time you knit, which is undesirable (and not nice)!

  • An alternative workflow:

    • Use an R script to save your code
    • Saving interim data scraped using the code in the script as CSV or RDS files
    • Use the saved data in your analysis in your R Markdown document
10 / 10

Ethics

2 / 10
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow