Functions

Data Science in a Box

datasciencebox.org

1 / 36

First Minister's COVID speeches

2 / 36

🏁 Start with

3 / 36

End with 🛑

## # A tibble: 218 × 6
##    title                 date       location abstract text  url  
##    <chr>                 <date>     <chr>    <chr>    <chr> <chr>
##  1 Coronavirus (COVID-1… 2021-04-20 St Andr… Stateme… "Goo… http…
##  2 Coronavirus (COVID-1… 2021-04-13 St Andr… Stateme… "Tha… http…
##  3 Coronavirus (COVID-1… 2021-04-06 St Andr… Stateme… "Goo… http…
##  4 Coronavirus (COVID-1… 2021-03-30 St Andr… Stateme… "Tha… http…
##  5 Coronavirus (COVID-1… 2021-03-24 Scottis… Stateme… "Tha… http…
##  6 Coronavirus (Covid-1… 2021-03-23 The Sco… Stateme… "Pre… http…
##  7 Coronavirus (COVID-1… 2021-03-18 Scottis… Stateme… "Tha… http…
##  8 Coronavirus (COVID-1… 2021-03-17 St Andr… Stateme… "\nG… http…
##  9 Coronavirus (COVID-1… 2021-03-16 Scottis… Stateme… "Pre… http…
## 10 Coronavirus (COVID-1… 2021-03-15 St Andr… Stateme… "\nG… http…
## 11 Coronavirus (COVID-1… 2021-03-11 Scottis… Stateme… "I c… http…
## 12 Coronavirus (COVID-1… 2021-03-09 Scottis… Stateme… "Pre… http…
## 13 Coronavirus (COVID-1… 2021-03-05 Scottis… Parliam… "Hel… http…
## 14 Coronavirus (COVID-1… 2021-03-04 Scottis… Parliam… "I w… http…
## 15 Coronavirus (COVID-1… 2021-03-02 Scottis… Stateme… "Pre… http…
## # … with 203 more rows

4 / 36

www.gov.scot/collections/first-ministers-speeches

5 / 36

6 / 36

Plan

Scrape title, date, location, abstract, and text from a few COVID-19 speech pages to develop the code
Write a function that scrapes title, date, location, abstract, and text from COVID-19 speech pages
Scrape the urls of COVID-19 speeches from the main page
Use this function to scrape from each individual COVID-19 speech from these urls and create a data frame with the columns title, date, location, abstract, text, and url

7 / 36

Scrape data from a few COVID-19 speech pages

8 / 36

Read page for 26 Oct speech

url <- "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-26-october/"
speech_page <- read_html(url)

speech_page

## {html_document}
## <html dir="ltr" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ...
## [2] <body class="fontawesome site-header__container">\n\n\n\n\ ...

9 / 36

Extract title

title <- speech_page %>%
    html_node(".article-header__title") %>%
    html_text()
title

## [1] "Coronavirus (COVID-19) update: First Minister's speech 26 October"

10 / 36

Extract date

library(lubridate)
speech_page %>%
    html_node(".content-data__list:nth-child(1) strong") %>%
    html_text()

## [1] "26 Oct 2020"

date <- speech_page %>%
    html_node(".content-data__list:nth-child(1) strong") %>%
    html_text() %>%
    dmy()
date

## [1] "2020-10-26"

11 / 36

Extract location

location <- speech_page %>%
    html_node(".content-data__list+ .content-data__list strong") %>%
    html_text()
location

## [1] "St Andrew's House, Edinburgh"

12 / 36

Extract abstract

abstract <- speech_page %>%
    html_node(".leader--first-para p") %>%
    html_text()
abstract

## [1] "Statement given by First Minister Nicola Sturgeon at a media briefing in St Andrew's House on Monday 26 October 2020."

13 / 36

Extract text

text <- speech_page %>% 
    html_nodes("#preamble p") %>%
    html_text() %>%
    list()
text

## [[1]]
##  [1] "\nGood afternoon, and thanks for joining us. I want to start with the usual daily report on the COVID statistics."                                                                                                                                                                                                                                                                                                                  
##  [2] "The total number of positive cases reported yesterday was 1,122."                                                                                                                                                                                                                                                                                                                                                                   
##  [3] "This represents 7.1% of the total number of tests carried out. 428 of the new cases were in Greater Glasgow and Clyde, 274 in Lanarkshire, 105 in Lothian and 97 in Ayrshire and Arran.&nbsp;"                                                                                                                                                                                                                                           
##  [4] "The remaining cases were spread across the mainland health board regions.&nbsp;"                                                                                                                                                                                                                                                                                                                                                         
##  [5] "The total number of confirmed cases is now 57,874."                                                                                                                                                                                                                                                                                                                                                                                 
##  [6] "I can also confirm that 1,152 people are in hospital – that is an increase of 36 from yesterday"                                                                                                                                                                                                                                                                                                                                    
##  [7] "90 people are in intensive care, which is four more than yesterday."                                                                                                                                                                                                                                                                                                                                                                
##  [8] "And I regret to say that in the last 24 hours, one further death has been registered of a patient who first tested positive over the previous 28 days.&nbsp; It is important though to remember that registration offices tend not to be open as normal over the weekend so the Sunday and Monday figures are often lower."                                                                                                              
##  [9] "We also reported 11 deaths on Saturday, and one yesterday.&nbsp; So since the last briefing on Friday, 13 additional deaths have been registered. That takes the total number of deaths, under this measurement, to 2,701."                                                                                                                                                                                                              
## [10] "That reminds us again of how dangerous this virus can be and I want to send my condolences to everyone who has lost someone."                                                                                                                                                                                                                                                                                                       
...

14 / 36

Put it all in a data frame

oct_26_speech <- tibble(
  title    = title,
  date     = date,
  location = location,
  abstract = abstract,
  text     = text,
  url      = url
)
oct_26_speech

## # A tibble: 1 × 6
##   title                  date       location abstract text  url  
##   <chr>                  <date>     <chr>    <chr>    <lis> <chr>
## 1 Coronavirus (COVID-19… 2020-10-26 St Andr… Stateme… <chr> http…

15 / 36

Read page for 23 Oct speech

url <- "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-23-october/"
speech_page <- read_html(url)

speech_page

## {html_document}
## <html dir="ltr" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ...
## [2] <body class="fontawesome site-header__container">\n\n\n\n\ ...

16 / 36

Extract components of 23 Oct speech

title <- speech_page %>%
  html_node(".article-header__title") %>%
  html_text()
date <- speech_page %>%
  html_node(".content-data__list:nth-child(1) strong") %>%
  html_text() %>%
  dmy()
location <- speech_page %>%
  html_node(".content-data__list+ .content-data__list strong") %>%
  html_text()
abstract <- speech_page %>%
  html_node(".leader--first-para p") %>%
  html_text()
text <- speech_page %>%
  html_nodes("#preamble p") %>%
  html_text() %>%
  list()

17 / 36

Put it all in a data frame

oct_23_speech <- tibble(
  title    = title,
  date     = date,
  location = location,
  abstract = abstract,
  text     = text,
  url      = url
)
oct_23_speech

## # A tibble: 1 × 6
##   title                  date       location abstract text  url  
##   <chr>                  <date>     <chr>    <chr>    <lis> <chr>
## 1 Coronavirus (COVID-19… 2020-10-23 St Andr… Stateme… <chr> http…

18 / 36

this is getting tiring...

19 / 36

Functions

20 / 36

When should you write a function?

21 / 36

When should you write a function?

21 / 36

When should you write a function?

When you’ve copied and pasted a block of code more than twice.

21 / 36

How many times will we need to copy and paste the code we developed to scrape data on all of First Minister's COVID-19 speeches?

22 / 36

Why functions?

Automate common tasks in a more powerful and general way than copy-and-pasting:
- Give your function an evocative name that makes your code easier to understand
- As requirements change, only need to update code in one place, instead of many
- Eliminate chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another)

23 / 36

Why functions?

Automate common tasks in a more powerful and general way than copy-and-pasting:
- Give your function an evocative name that makes your code easier to understand
- As requirements change, only need to update code in one place, instead of many
- Eliminate chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another)
Down the line: Improve your reach as a data scientist by writing functions (and packages!) that others use

23 / 36

Assuming that the page structure is the same for each speech page, how many "things" do you need to know for each speech page to scrape the data we want from it?

url_23_oct <- "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-23-october/"
speech_page <- read_html(url_23_oct)
title <- speech_page %>%
  html_node(".article-header__title") %>%
  html_text()
date <- speech_page %>%
  html_node(".content-data__list:nth-child(1) strong") %>%
  html_text() %>%
  dmy()
location <- speech_page %>%
  html_node(".content-data__list+ .content-data__list strong") %>%
  html_text()
abstract <- speech_page %>%
  html_node(".leader--first-para p") %>%
  html_text()
text <- speech_page %>%
  html_nodes("#preamble p") %>%
  html_text() %>%
  list()
tibble(
  title = title, date = date, location = location,
  abstract = abstract, text = text, url= url
)

24 / 36

Turn your code into a function

Pick a short but informative name, preferably a verb.

scrape_speech <-

25 / 36

Turn your code into a function

Pick a short but evocative name, preferably a verb.
List inputs, or arguments, to the function inside function. If we had more the call would look like function(x, y, z).

scrape_speech <- function(x){
}

26 / 36

Turn your code into a function

Pick a short but informative name, preferably a verb.
List inputs, or arguments, to the function inside function. If we had more the call would look like function(x, y, z).
Place the code you have developed in body of the function, a { block that immediately follows function(...).

scrape_speech <- function(url){
  # code we developed earlier to scrape info 
  # on single art piece goes here
}

27 / 36

datasciencebox.org

scrape_speech()scrape_speech <- function(url) {
  speech_page <- read_html(url)
  title <- speech_page %>%
    html_node(".article-header__title") %>%
    html_text()
  date <- speech_page %>%
    html_node(".content-data__list:nth-child(1) strong") %>%
    html_text() %>%
    dmy()
  location <- speech_page %>%
    html_node(".content-data__list+ .content-data__list strong") %>%
    html_text()
  abstract <- speech_page %>%
    html_node(".leader--first-para p") %>%
    html_text()
  text <- speech_page %>%
    html_nodes("#preamble p") %>%
    html_text() %>%
    list()
  tibble(
    title = title, date = date, location = location,
    abstract = abstract, text = text, url = url
  )
}

28 / 36

Function in action

scrape_speech(url = "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-26-october/") %>%
  glimpse()

## Rows: 1
## Columns: 6
## $ title    <chr> NA
## $ date     <date> NA
## $ location <chr> NA
## $ abstract <chr> NA
## $ text     <list> <"\nGood afternoon, and thanks for joining us.…
## $ url      <chr> "https://www.gov.scot/publications/coronaviru…

29 / 36

Function in action

scrape_speech(url = "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-23-october/") %>%
  glimpse()

## Rows: 1
## Columns: 6
## $ title    <chr> NA
## $ date     <date> NA
## $ location <chr> NA
## $ abstract <chr> NA
## $ text     <list> <"\nGood afternoon everyone. Thank you very mu…
## $ url      <chr> "https://www.gov.scot/publications/coronaviru…

30 / 36

Function in action

scrape_speech(url = "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-22-october/") %>%
  glimpse()

## Rows: 1
## Columns: 6
## $ title    <chr> NA
## $ date     <date> NA
## $ location <chr> NA
## $ abstract <chr> NA
## $ text     <list> <"\nGood afternoon, let me start as usual with…
## $ url      <chr> "https://www.gov.scot/publications/coronaviru…

31 / 36

Writing functions

32 / 36

datasciencebox.org

What goes in / what comes out?They take input(s) defined in the function definition

function([inputs separated by commas]){
  # what to do with those inputs
}

By default they return the last value computed in the function

scrape_page <- function(x){
  # do bunch of stuff with the input...
  # return a tibble
  tibble(...)
}

You can define more outputs to be returned in a list as well as nice print methods (but we won't go there for now...)

33 / 36

What is going on here?

add_2 <- function(x){
  x + 2
  1000
}

add_2(3)

## [1] 1000

add_2(10)

## [1] 1000

34 / 36

Naming functions

"There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton

35 / 36

Naming functions

Names should be short but clearly evoke what the function does

36 / 36

Naming functions

Names should be short but clearly evoke what the function does
Names should be verbs, not nouns

36 / 36

Naming functions

Names should be short but clearly evoke what the function does
Names should be verbs, not nouns
Multi-word names should be separated by underscores (snake_case as opposed to camelCase)

36 / 36

Naming functions

Names should be short but clearly evoke what the function does
Names should be verbs, not nouns
Multi-word names should be separated by underscores (snake_case as opposed to camelCase)
A family of functions should be named similarly (scrape_page(), scrape_speech() OR str_remove(), str_replace() etc.)

36 / 36

Naming functions

Names should be short but clearly evoke what the function does
Names should be verbs, not nouns
Multi-word names should be separated by underscores (snake_case as opposed to camelCase)
A family of functions should be named similarly (scrape_page(), scrape_speech() OR str_remove(), str_replace() etc.)
Avoid overwriting existing (especially widely used) functions

# JUST DON'T
mean <- function(x){ 
  x * 3 
  }

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Functions

Data Science in a Box

datasciencebox.org

First Minister's COVID speeches

🏁 Start with

End with 🛑

www.gov.scot/collections/first-ministers-speeches

Plan

Scrape data from a few COVID-19 speech pages

Read page for 26 Oct speech

Extract title

Extract date

Extract location

Extract abstract

Extract text

Put it all in a data frame

Read page for 23 Oct speech

Extract components of 23 Oct speech

Put it all in a data frame

Functions

When should you write a function?

When should you write a function?

When should you write a function?

Why functions?

Why functions?

Turn your code into a function

Turn your code into a function

Turn your code into a function

scrape_speech()

Function in action

Function in action

Function in action

Writing functions

What goes in / what comes out?

Naming functions

Naming functions

Naming functions

Naming functions

Naming functions

Naming functions

First Minister's COVID speeches

Help

`scrape_speech()`