+ - 0:00:00
Notes for current slide
Notes for next slide

Importing data



Data Science in a Box

1 / 32

Reading rectangular data into R

2 / 32

readr

  • read_csv() - comma delimited files
  • read_csv2() - semicolon separated files (common in countries where , is used as the decimal place)
  • read_tsv() - tab delimited files
  • read_delim() - reads in files with any delimiter
  • read_fwf() - fixed width files
  • ...
4 / 32

readr

  • read_csv() - comma delimited files
  • read_csv2() - semicolon separated files (common in countries where , is used as the decimal place)
  • read_tsv() - tab delimited files
  • read_delim() - reads in files with any delimiter
  • read_fwf() - fixed width files
  • ...

readxl

  • read_excel() - read xls or xlsx files
  • ...
4 / 32

Reading data

nobel <- read_csv(file = "data/nobel.csv")
nobel
## # A tibble: 935 × 26
## id firstname surname year category affiliation city
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 1 Wilhelm Conrad Röntgen 1901 Physics Munich Uni… Muni…
## 2 2 Hendrik A. Lorentz 1902 Physics Leiden Uni… Leid…
## 3 3 Pieter Zeeman 1902 Physics Amsterdam … Amst…
## 4 4 Henri Becquerel 1903 Physics École Poly… Paris
## 5 5 Pierre Curie 1903 Physics École muni… Paris
## 6 6 Marie Curie 1903 Physics <NA> <NA>
## # … with 929 more rows, and 19 more variables: country <chr>,
## # born_date <date>, died_date <date>, gender <chr>,
## # born_city <chr>, born_country <chr>,
## # born_country_code <chr>, died_city <chr>,
## # died_country <chr>, died_country_code <chr>,
## # overall_motivation <chr>, share <dbl>, motivation <chr>,
## # born_country_original <chr>, born_city_original <chr>, …
5 / 32

Writing data

  • Write a file
df <- tribble(
~x, ~y,
1, "a",
2, "b",
3, "c"
)
write_csv(df, file = "data/df.csv")
6 / 32

Writing data

  • Write a file
df <- tribble(
~x, ~y,
1, "a",
2, "b",
3, "c"
)
write_csv(df, file = "data/df.csv")
  • Read it back in to inspect
read_csv("data/df.csv")
## # A tibble: 3 × 2
## x y
## <dbl> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
6 / 32

Your turn!

  • RStudio Cloud > AE 06 - Nobels and sales + Data import > open nobels-csv.Rmd and knit.
  • Read in the nobels.csv file from the data-raw/ folder.
  • Split into two (STEM and non-STEM):
    • Create a new data frame, nobel_stem, that filters for the STEM fields (Physics, Medicine, Chemistry, and Economics).
    • Create another data frame, nobel_nonstem, that filters for the remaining fields.
  • Write out the two data frames to nobel-stem.csv and nobel-nonstem.csv, respectively, to data/.

Hint: Use the %in% operator when filter()ing.

7 / 32

Variable names

8 / 32

Data with bad names

edibnb_badnames <- read_csv("data/edibnb-badnames.csv")
names(edibnb_badnames)
## [1] "ID" "Price"
## [3] "neighbourhood" "accommodates"
## [5] "Number of bathrooms" "Number of Bedrooms"
## [7] "n beds" "Review Scores Rating"
## [9] "Number of reviews" "listing_url"
9 / 32

Data with bad names

edibnb_badnames <- read_csv("data/edibnb-badnames.csv")
names(edibnb_badnames)
## [1] "ID" "Price"
## [3] "neighbourhood" "accommodates"
## [5] "Number of bathrooms" "Number of Bedrooms"
## [7] "n beds" "Review Scores Rating"
## [9] "Number of reviews" "listing_url"

... but R doesn't allow spaces in variable names

ggplot(edibnb_badnames, aes(x = Number of bathrooms, y = Price)) +
geom_point()
## Error: <text>:1:40: unexpected symbol
## 1: ggplot(edibnb_badnames, aes(x = Number of
## ^
9 / 32

Option 1 - Define column names

edibnb_col_names <- read_csv("data/edibnb-badnames.csv",
col_names = c("id", "price",
"neighbourhood", "accommodates",
"bathroom", "bedroom",
"bed", "review_scores_rating",
"n_reviews", "url"))
names(edibnb_col_names)
## [1] "id" "price"
## [3] "neighbourhood" "accommodates"
## [5] "bathroom" "bedroom"
## [7] "bed" "review_scores_rating"
## [9] "n_reviews" "url"
10 / 32

Option 2 - Format text to snake_case

edibnb_clean_names <- read_csv("data/edibnb-badnames.csv") %>%
janitor::clean_names()
names(edibnb_clean_names)
## [1] "id" "price"
## [3] "neighbourhood" "accommodates"
## [5] "number_of_bathrooms" "number_of_bedrooms"
## [7] "n_beds" "review_scores_rating"
## [9] "number_of_reviews" "listing_url"
11 / 32

Variable types

12 / 32

Which type is x? Why?

read_csv("data/df-na.csv")
## # A tibble: 9 × 3
## x y z
## <chr> <chr> <chr>
## 1 1 a hi
## 2 <NA> b hello
## 3 3 Not applicable 9999
## 4 4 d ola
## 5 5 e hola
## 6 . f whatup
## 7 7 g wassup
## 8 8 h sup
## 9 9 i <NA>
13 / 32

Option 1. Explicit NAs

read_csv("data/df-na.csv",
na = c("", "NA", ".", "9999", "Not applicable"))

## # A tibble: 9 × 3
## x y z
## <dbl> <chr> <chr>
## 1 1 a hi
## 2 NA b hello
## 3 3 <NA> <NA>
## 4 4 d ola
## 5 5 e hola
## 6 NA f whatup
## 7 7 g wassup
## 8 8 h sup
## 9 9 i <NA>
14 / 32

Option 2. Specify column types

read_csv("data/df-na.csv", col_types = list(col_double(),
col_character(),
col_character()))
## Warning: One or more parsing issues, see `problems()` for details
## # A tibble: 9 × 3
## x y z
## <dbl> <chr> <chr>
## 1 1 a hi
## 2 NA b hello
## 3 3 Not applicable 9999
## 4 4 d ola
## 5 5 e hola
## 6 NA f whatup
## 7 7 g wassup
## 8 8 h sup
## 9 9 i <NA>
15 / 32

Column types

type function data type
col_character() character
col_date() date
col_datetime() POSIXct (date-time)
col_double() double (numeric)
col_factor() factor
col_guess() let readr guess (default)
col_integer() integer
col_logical() logical
col_number() numbers mixed with non-number characters
col_numeric() double or integer
col_skip() do not read
col_time() time
16 / 32

Wondering where you remember these from?

read_csv("data/df-na.csv")
## Rows: 9 Columns: 3
## ── Column specification ─────────────────────────────────────────
## Delimiter: ","
## chr (3): x, y, z
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 9 × 3
## x y z
## <chr> <chr> <chr>
## 1 1 a hi
## 2 <NA> b hello
## 3 3 Not applicable 9999
## 4 4 d ola
...
17 / 32

Case study: Favourite foods

18 / 32

Favourite foods

19 / 32

Favourite foods

fav_food <- read_excel("data/favourite-food.xlsx")
fav_food
## # A tibble: 5 × 6
## `Student ID` `Full Name` favourite.food mealPlan AGE SES
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 Sunil Huffmann Strawberry yo… Lunch o… 4 High
## 2 2 Barclay Lynn French fries Lunch o… 5 Midd…
## 3 3 Jayendra Lyne N/A Breakfa… 7 Low
## 4 4 Leon Rossini Anchovies Lunch o… 99999 Midd…
## 5 5 Chidiegwu Dun… Pizza Breakfa… five High
19 / 32

Variable names

20 / 32

Variable names

fav_food <- read_excel("data/favourite-food.xlsx") %>%
janitor::clean_names()
fav_food
## # A tibble: 5 × 6
## student_id full_name favourite_food meal_plan age ses
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 Sunil Huffmann Strawberry yo… Lunch on… 4 High
## 2 2 Barclay Lynn French fries Lunch on… 5 Midd…
## 3 3 Jayendra Lyne N/A Breakfas… 7 Low
## 4 4 Leon Rossini Anchovies Lunch on… 99999 Midd…
## 5 5 Chidiegwu Dunk… Pizza Breakfas… five High
20 / 32

Handling NAs

21 / 32

Handling NAs

fav_food <- read_excel("data/favourite-food.xlsx",
na = c("N/A", "99999")) %>%
janitor::clean_names()
fav_food
## # A tibble: 5 × 6
## student_id full_name favourite_food meal_plan age ses
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 Sunil Huffmann Strawberry yo… Lunch on… 4 High
## 2 2 Barclay Lynn French fries Lunch on… 5 Midd…
## 3 3 Jayendra Lyne <NA> Breakfas… 7 Low
## 4 4 Leon Rossini Anchovies Lunch on… <NA> Midd…
## 5 5 Chidiegwu Dunk… Pizza Breakfas… five High
21 / 32

Make age numeric

fav_food <- fav_food %>%
mutate(
age = if_else(age == "five", "5", age),
age = as.numeric(age)
)
glimpse(fav_food)
## Rows: 5
## Columns: 6
## $ student_id <dbl> 1, 2, 3, 4, 5
## $ full_name <chr> "Sunil Huffmann", "Barclay Lynn", "Jayen…
## $ favourite_food <chr> "Strawberry yoghurt", "French fries", NA…
## $ meal_plan <chr> "Lunch only", "Lunch only", "Breakfast a…
## $ age <dbl> 4, 5, 7, NA, 5
## $ ses <chr> "High", "Middle", "Low", "Middle", "High"

22 / 32

Socio-economic status

What order are the levels of ses listed in?

fav_food %>%
count(ses)
## # A tibble: 3 × 2
## ses n
## <chr> <int>
## 1 High 2
## 2 Low 1
## 3 Middle 2

23 / 32

Make ses factor

fav_food <- fav_food %>%
mutate(ses = fct_relevel(ses, "Low", "Middle", "High"))
fav_food %>%
count(ses)
## # A tibble: 3 × 2
## ses n
## <fct> <int>
## 1 Low 1
## 2 Middle 2
## 3 High 2
24 / 32

Putting it altogether

fav_food <- read_excel("data/favourite-food.xlsx", na = c("N/A", "99999")) %>%
janitor::clean_names() %>%
mutate(
age = if_else(age == "five", "5", age),
age = as.numeric(age),
ses = fct_relevel(ses, "Low", "Middle", "High")
)
fav_food
## # A tibble: 5 × 6
## student_id full_name favourite_food meal_plan age ses
## <dbl> <chr> <chr> <chr> <dbl> <fct>
## 1 1 Sunil Huffmann Strawberry yo… Lunch on… 4 High
## 2 2 Barclay Lynn French fries Lunch on… 5 Midd…
## 3 3 Jayendra Lyne <NA> Breakfas… 7 Low
## 4 4 Leon Rossini Anchovies Lunch on… NA Midd…
## 5 5 Chidiegwu Dunk… Pizza Breakfas… 5 High
25 / 32

Out and back in

write_csv(fav_food, file = "data/fav-food-clean.csv")
fav_food_clean <- read_csv("data/fav-food-clean.csv")
26 / 32

What happened to ses again?

fav_food_clean %>%
count(ses)
## # A tibble: 3 × 2
## ses n
## <chr> <int>
## 1 High 2
## 2 Low 1
## 3 Middle 2
27 / 32

read_rds() and write_rds()

  • CSVs can be unreliable for saving interim results if there is specific variable type information you want to hold on to.
  • An alternative is RDS files, you can read and write them with read_rds() and write_rds(), respectively.
read_rds(path)
write_rds(x, path)
28 / 32

Out and back in, take 2

write_rds(fav_food, file = "data/fav-food-clean.rds")
fav_food_clean <- read_rds("data/fav-food-clean.rds")
fav_food_clean %>%
count(ses)
## # A tibble: 3 × 2
## ses n
## <fct> <int>
## 1 Low 1
## 2 Middle 2
## 3 High 2
29 / 32

Other types of data

30 / 32

Other types of data

  • googlesheets4: Google Sheets
  • haven: SPSS, Stata, and SAS files
  • DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): allows you to run SQL queries against a database and return a data frame
  • jsonline: JSON
  • xml2: xml
  • rvest: web scraping
  • httr: web APIs
  • sparklyr: data loaded into spark
31 / 32

Your turn!

  • RStudio Cloud > AE 06 - Nobels and sales + Data import > sales-excel.Rmd.
  • Load the sales.xlsx file from the data-raw/ folder, using appropriate arguments for the read_excel() function such that it looks like the output on the left.
  • Stretch goal: Manipulate the sales data such that it looks like the output on the right.
## # A tibble: 9 × 2
## id n
## <chr> <chr>
## 1 Brand 1 n
## 2 1234 8
## 3 8721 2
## 4 1822 3
## 5 Brand 2 n
## 6 3333 1
## # … with 3 more rows
## # A tibble: 7 × 3
## brand id n
## <chr> <dbl> <dbl>
## 1 Brand 1 1234 8
## 2 Brand 1 8721 2
## 3 Brand 1 1822 3
## 4 Brand 2 3333 1
## 5 Brand 2 2156 3
## 6 Brand 2 3987 6
## # … with 1 more row
32 / 32

Reading rectangular data into R

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow