class: center, middle, inverse, title-slide .title[ # Functions ] .subtitle[ ##
Data Science in a Box ] .author[ ###
datasciencebox.org
] --- layout: true <div class="my-footer"> <span> <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle # First Minister's COVID speeches --- ## š Start with <img src="img/fm-speeches.png" width="75%" style="display: block; margin: auto;" /> --- ## End with š ``` ## # A tibble: 218 Ć 6 ## title date location abstract text url ## <chr> <date> <chr> <chr> <chr> <chr> ## 1 Coronavirus (COVID-1ā¦ 2021-04-20 St Andrā¦ Statemeā¦ "Gooā¦ httpā¦ ## 2 Coronavirus (COVID-1ā¦ 2021-04-13 St Andrā¦ Statemeā¦ "Thaā¦ httpā¦ ## 3 Coronavirus (COVID-1ā¦ 2021-04-06 St Andrā¦ Statemeā¦ "Gooā¦ httpā¦ ## 4 Coronavirus (COVID-1ā¦ 2021-03-30 St Andrā¦ Statemeā¦ "Thaā¦ httpā¦ ## 5 Coronavirus (COVID-1ā¦ 2021-03-24 Scottisā¦ Statemeā¦ "Thaā¦ httpā¦ ## 6 Coronavirus (Covid-1ā¦ 2021-03-23 The Scoā¦ Statemeā¦ "Preā¦ httpā¦ ## 7 Coronavirus (COVID-1ā¦ 2021-03-18 Scottisā¦ Statemeā¦ "Thaā¦ httpā¦ ## 8 Coronavirus (COVID-1ā¦ 2021-03-17 St Andrā¦ Statemeā¦ "\nGā¦ httpā¦ ## 9 Coronavirus (COVID-1ā¦ 2021-03-16 Scottisā¦ Statemeā¦ "Preā¦ httpā¦ ## 10 Coronavirus (COVID-1ā¦ 2021-03-15 St Andrā¦ Statemeā¦ "\nGā¦ httpā¦ ## 11 Coronavirus (COVID-1ā¦ 2021-03-11 Scottisā¦ Statemeā¦ "I cā¦ httpā¦ ## 12 Coronavirus (COVID-1ā¦ 2021-03-09 Scottisā¦ Statemeā¦ "Preā¦ httpā¦ ## 13 Coronavirus (COVID-1ā¦ 2021-03-05 Scottisā¦ Parliamā¦ "Helā¦ httpā¦ ## 14 Coronavirus (COVID-1ā¦ 2021-03-04 Scottisā¦ Parliamā¦ "I wā¦ httpā¦ ## 15 Coronavirus (COVID-1ā¦ 2021-03-02 Scottisā¦ Statemeā¦ "Preā¦ httpā¦ ## # ā¦ with 203 more rows ``` --- #### .center[ [www.gov.scot/collections/first-ministers-speeches](https://www.gov.scot/collections/first-ministers-speeches/) ] <img src="img/fm-speeches-annotated.png" width="75%" style="display: block; margin: auto;" /> --- <img src="img/fm-speech-oct-26-annotated.png" width="65%" style="display: block; margin: auto;" /> --- ## Plan 1. Scrape `title`, `date`, `location`, `abstract`, and `text` from a few COVID-19 speech pages to develop the code 2. Write a function that scrapes `title`, `date`, `location`, `abstract`, and `text` from COVID-19 speech pages 3. Scrape the `url`s of COVID-19 speeches from the main page 4. Use this function to scrape from each individual COVID-19 speech from these `url`s and create a data frame with the columns `title`, `date`, `location`, `abstract`, `text`, and `url` --- class: middle # Scrape data from a few COVID-19 speech pages --- ## Read page for 26 Oct speech ```r url <- "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-26-october/" speech_page <- read_html(url) ``` .pull-left[ ```r speech_page ``` ``` ## {html_document} ## <html dir="ltr" lang="en"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ... ## [2] <body class="fontawesome site-header__container">\n\n\n\n\ ... ``` ] .pull-right[ <img src="img/fm-speech-oct-26.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Extract title .pull-left-wide[ <br><br> ```r title <- speech_page %>% html_node(".article-header__title") %>% html_text() title ``` ``` ## [1] "Coronavirus (COVID-19) update: First Minister's speech 26 October" ``` ] .pull-right-narrow[ <img src="img/title.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract date .pull-left-wide[ ```r library(lubridate) speech_page %>% html_node(".content-data__list:nth-child(1) strong") %>% html_text() ``` ``` ## [1] "26 Oct 2020" ``` ```r date <- speech_page %>% html_node(".content-data__list:nth-child(1) strong") %>% html_text() %>% dmy() date ``` ``` ## [1] "2020-10-26" ``` ] .pull-right-narrow[ <img src="img/date.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract location .pull-left-wide[ ```r location <- speech_page %>% html_node(".content-data__list+ .content-data__list strong") %>% html_text() location ``` ``` ## [1] "St Andrew's House, Edinburgh" ``` ] .pull-right-narrow[ <img src="img/location.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract abstract .pull-left-wide[ ```r abstract <- speech_page %>% html_node(".leader--first-para p") %>% html_text() abstract ``` ``` ## [1] "Statement given by First Minister Nicola Sturgeon at a media briefing in St Andrew's House on Monday 26 October 2020." ``` ] .pull-right-narrow[ <img src="img/abstract.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract text .pull-left-wide[ ```r text <- speech_page %>% html_nodes("#preamble p") %>% html_text() %>% list() text ``` ``` ## [[1]] ## [1] "\nGood afternoon, and thanks for joining us. I want to start with the usual daily report on the COVID statistics." ## [2] "The total number of positive cases reported yesterday was 1,122." ## [3] "This represents 7.1% of the total number of tests carried out. 428 of the new cases were in Greater Glasgow and Clyde, 274 in Lanarkshire, 105 in Lothian and 97 in Ayrshire and Arran.Ā " ## [4] "The remaining cases were spread across the mainland health board regions.Ā " ## [5] "The total number of confirmed cases is now 57,874." ## [6] "I can also confirm that 1,152 people are in hospital ā that is an increase of 36 from yesterday" ## [7] "90 people are in intensive care, which is four more than yesterday." ## [8] "And I regret to say that in the last 24 hours, one further death has been registered of a patient who first tested positive over the previous 28 days.Ā It is important though to remember that registration offices tend not to be open as normal over the weekend so the Sunday and Monday figures are often lower." ## [9] "We also reported 11 deaths on Saturday, and one yesterday.Ā So since the last briefing on Friday, 13 additional deaths have been registered. That takes the total number of deaths, under this measurement, to 2,701." ## [10] "That reminds us again of how dangerous this virus can be and I want to send my condolences to everyone who has lost someone." ... ``` ] .pull-right-narrow[ <img src="img/text.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Put it all in a data frame .pull-left[ ```r oct_26_speech <- tibble( title = title, date = date, location = location, abstract = abstract, text = text, url = url ) oct_26_speech ``` ``` ## # A tibble: 1 Ć 6 ## title date location abstract text url ## <chr> <date> <chr> <chr> <lis> <chr> ## 1 Coronavirus (COVID-19ā¦ 2020-10-26 St Andrā¦ Statemeā¦ <chr> httpā¦ ``` ] .pull-right[ <img src="img/fm-speech-oct-26.png" width="75%" style="display: block; margin: auto;" /> ] --- ## Read page for 23 Oct speech ```r url <- "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-23-october/" speech_page <- read_html(url) ``` ```r speech_page ``` ``` ## {html_document} ## <html dir="ltr" lang="en"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ... ## [2] <body class="fontawesome site-header__container">\n\n\n\n\ ... ``` --- ## Extract components of 23 Oct speech ```r title <- speech_page %>% html_node(".article-header__title") %>% html_text() date <- speech_page %>% html_node(".content-data__list:nth-child(1) strong") %>% html_text() %>% dmy() location <- speech_page %>% html_node(".content-data__list+ .content-data__list strong") %>% html_text() abstract <- speech_page %>% html_node(".leader--first-para p") %>% html_text() text <- speech_page %>% html_nodes("#preamble p") %>% html_text() %>% list() ``` --- ## Put it all in a data frame .pull-left[ ```r oct_23_speech <- tibble( title = title, date = date, location = location, abstract = abstract, text = text, url = url ) oct_23_speech ``` ``` ## # A tibble: 1 Ć 6 ## title date location abstract text url ## <chr> <date> <chr> <chr> <lis> <chr> ## 1 Coronavirus (COVID-19ā¦ 2020-10-23 St Andrā¦ Statemeā¦ <chr> httpā¦ ``` ] .pull-right[ <img src="img/fm-speech-oct-23.png" width="75%" style="display: block; margin: auto;" /> ] --- class: middle .larger[ .light-blue[ .hand[ this is getting tiring... ] ] ] --- class: middle # Functions --- ## When should you write a function? -- .pull-left[ <img src="img/funct-all-things.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ When youāve copied and pasted a block of code more than twice. ] --- .question[ How many times will we need to copy and paste the code we developed to scrape data on all of First Minister's COVID-19 speeches? ] <img src="img/search-result.png" width="55%" style="display: block; margin: auto;" /> --- ## Why functions? - Automate common tasks in a more powerful and general way than copy-and-pasting: - Give your function an evocative name that makes your code easier to understand - As requirements change, only need to update code in one place, instead of many - Eliminate chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another) -- - Down the line: Improve your reach as a data scientist by writing functions (and packages!) that others use --- .question[ Assuming that the page structure is the same for each speech page, how many "things" do you need to know for each speech page to scrape the data we want from it? ] .pull-left-wide[ .xsmall[ ```r url_23_oct <- "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-23-october/" speech_page <- read_html(url_23_oct) title <- speech_page %>% html_node(".article-header__title") %>% html_text() date <- speech_page %>% html_node(".content-data__list:nth-child(1) strong") %>% html_text() %>% dmy() location <- speech_page %>% html_node(".content-data__list+ .content-data__list strong") %>% html_text() abstract <- speech_page %>% html_node(".leader--first-para p") %>% html_text() text <- speech_page %>% html_nodes("#preamble p") %>% html_text() %>% list() tibble( title = title, date = date, location = location, abstract = abstract, text = text, url= url ) ``` ] ] --- ## Turn your code into a function - Pick a short but informative **name**, preferably a verb. <br> <br> <br> <br> ```r scrape_speech <- ``` --- ## Turn your code into a function - Pick a short but evocative **name**, preferably a verb. - List inputs, or **arguments**, to the function inside `function`. If we had more the call would look like `function(x, y, z)`. <br> ```r scrape_speech <- function(x){ } ``` --- ## Turn your code into a function - Pick a short but informative **name**, preferably a verb. - List inputs, or **arguments**, to the function inside `function`. If we had more the call would look like `function(x, y, z)`. - Place the **code** you have developed in body of the function, a `{` block that immediately follows `function(...)`. ```r scrape_speech <- function(url){ # code we developed earlier to scrape info # on single art piece goes here } ``` --- ## `scrape_speech()` .pull-left-wide[ .small[ ```r scrape_speech <- function(url) { speech_page <- read_html(url) title <- speech_page %>% html_node(".article-header__title") %>% html_text() date <- speech_page %>% html_node(".content-data__list:nth-child(1) strong") %>% html_text() %>% dmy() location <- speech_page %>% html_node(".content-data__list+ .content-data__list strong") %>% html_text() abstract <- speech_page %>% html_node(".leader--first-para p") %>% html_text() text <- speech_page %>% html_nodes("#preamble p") %>% html_text() %>% list() tibble( title = title, date = date, location = location, abstract = abstract, text = text, url = url ) } ``` ] ] --- ## Function in action ```r scrape_speech(url = "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-26-october/") %>% glimpse() ``` ``` ## Rows: 1 ## Columns: 6 ## $ title <chr> NA ## $ date <date> NA ## $ location <chr> NA ## $ abstract <chr> NA ## $ text <list> <"\nGood afternoon, and thanks for joining us.ā¦ ## $ url <chr> "https://www.gov.scot/publications/coronaviruā¦ ``` --- ## Function in action ```r scrape_speech(url = "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-23-october/") %>% glimpse() ``` ``` ## Rows: 1 ## Columns: 6 ## $ title <chr> NA ## $ date <date> NA ## $ location <chr> NA ## $ abstract <chr> NA ## $ text <list> <"\nGood afternoon everyone. Thank you very muā¦ ## $ url <chr> "https://www.gov.scot/publications/coronaviruā¦ ``` --- ## Function in action ```r scrape_speech(url = "https://www.gov.scot/publications/coronavirus-covid-19-update-first-ministers-speech-22-october/") %>% glimpse() ``` ``` ## Rows: 1 ## Columns: 6 ## $ title <chr> NA ## $ date <date> NA ## $ location <chr> NA ## $ abstract <chr> NA ## $ text <list> <"\nGood afternoon, let me start as usual withā¦ ## $ url <chr> "https://www.gov.scot/publications/coronaviruā¦ ``` --- class: middle # Writing functions --- ## What goes in / what comes out? .pull-left-wide[ - They take input(s) defined in the function definition ```r function([inputs separated by commas]){ # what to do with those inputs } ``` - By default they return the last value computed in the function ```r scrape_page <- function(x){ # do bunch of stuff with the input... # return a tibble tibble(...) } ``` - You can define more outputs to be returned in a list as well as nice print methods (but we won't go there for now...) ] --- .question[ What is going on here? ] ```r add_2 <- function(x){ x + 2 1000 } ``` ```r add_2(3) ``` ``` ## [1] 1000 ``` ```r add_2(10) ``` ``` ## [1] 1000 ``` --- ## Naming functions > "There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton --- ## Naming functions - Names should be short but clearly evoke what the function does -- - Names should be verbs, not nouns -- - Multi-word names should be separated by underscores (`snake_case` as opposed to `camelCase`) -- - A family of functions should be named similarly (`scrape_page()`, `scrape_speech()` OR `str_remove()`, `str_replace()` etc.) -- - Avoid overwriting existing (especially widely used) functions ```r # JUST DON'T mean <- function(x){ x * 3 } ```