class: center, middle, inverse, title-slide

.title[
# Web scraping considerations
]
.subtitle[
## <br><br> Data Science in a Box
]
.author[
### <a href="https://datasciencebox.org/">datasciencebox.org</a>
]

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

class: middle

# Ethics

---

## "Can you?" vs "Should you?"

<img src="img/ok-cupid-1.png" width="60%" style="display: block; margin: auto;" />

.footnote[.small[
Source: Brian Resnick, [Researchers just released profile data on 70,000 OkCupid users without permission](https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release), Vox.
]]

---

## "Can you?" vs "Should you?"

<img src="img/ok-cupid-2.png" width="70%" style="display: block; margin: auto;" />

---

class: middle

# Challenges

---

## Unreliable formatting at the source

<img src="img/unreliable-formatting.png" width="70%" style="display: block; margin: auto;" />

---

## Data broken into many pages

<img src="img/many-pages.png" width="70%" style="display: block; margin: auto;" />

---

class: middle

# Workflow

---

## Screen scraping vs. APIs

Two different scenarios for web scraping:

- Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy)

- Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files

---

## A new R workflow

- When working in an R Markdown document, your analysis is re-run each time you knit

- If web scraping in an R Markdown document, you'd be re-scraping the data each time you knit, which is undesirable (and not *nice*)!

- An alternative workflow: 
  - Use an R script to save your code 
  - Saving interim data scraped using the code in the script as CSV or RDS files
  - Use the saved data in your analysis in your R Markdown document

Notes for current slide

Notes for next slide

Web scraping considerations

Data Science in a Box

datasciencebox.org

1 / 10

Ethics

2 / 10

"Can you?" vs "Should you?"

Source: Brian Resnick, Researchers just released profile data on 70,000 OkCupid users without permission, Vox.

3 / 10

"Can you?" vs "Should you?"

4 / 10

Challenges

5 / 10

Unreliable formatting at the source

6 / 10

Data broken into many pages

7 / 10

Workflow

8 / 10

Screen scraping vs. APIs

Two different scenarios for web scraping:

Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy)
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files

9 / 10

A new R workflow

When working in an R Markdown document, your analysis is re-run each time you knit
If web scraping in an R Markdown document, you'd be re-scraping the data each time you knit, which is undesirable (and not nice)!
An alternative workflow:
- Use an R script to save your code
- Saving interim data scraped using the code in the script as CSV or RDS files
- Use the saved data in your analysis in your R Markdown document

10 / 10

Ethics

2 / 10

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow