class: center, middle, inverse, title-slide .title[ # Data types ] .subtitle[ ##
Data Science in a Box ] .author[ ###
datasciencebox.org
] --- layout: true <div class="my-footer"> <span> <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle # Why should you care about data types? --- ## Example: Cat lovers A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value. ```r cat_lovers <- read_csv("data/cat-lovers.csv") ``` ``` ## # A tibble: 60 × 3 ## name number_of_cats handedness ## <chr> <chr> <chr> ## 1 Bernice Warren 0 left ## 2 Woodrow Stone 0 left ## 3 Willie Bass 1 left ## 4 Tyrone Estrada 3 left ## 5 Alex Daniels 3 left ## 6 Jane Bates 2 left ## # … with 54 more rows ``` --- ## Oh why won't you work?! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning in mean.default(number_of_cats): argument is not numeric ## or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ```r ?mean ``` <img src="img/mean-help.png" width="75%" style="display: block; margin: auto;" /> --- ## Oh why won't you still work??!! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats, na.rm = TRUE)) ``` ``` ## Warning in mean.default(number_of_cats, na.rm = TRUE): argument ## is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ## Take a breath and look at your data .question[ What is the type of the `number_of_cats` variable? ] ```r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", … ## $ handedness <chr> "left", "left", "left", "left", "left", … ``` --- ## Let's take another look .small[
] --- ## Sometimes you might need to babysit your respondents .midi[ ```r cat_lovers %>% mutate(number_of_cats = case_when( name == "Ginger Clark" ~ 2, name == "Doug Bass" ~ 3, TRUE ~ as.numeric(number_of_cats) )) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced ## by coercion ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` ] --- ## Always you need to respect data types ```r cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` --- ## Now that we know what we're doing... ```r *cat_lovers <- cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) ``` --- ## Moral of the story - If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason. - Go in and investigate your data, apply the fix, *save your data*, live happily ever after. --- class: middle .hand[.light-blue[now that we have a good motivation for]] .hand[.light-blue[learning about data types in R]] <br> .large[ .hand[.light-blue[let's learn about data types in R!]] ] --- class: middle # Data types --- ## Data types in R - **logical** - **double** - **integer** - **character** - and some more, but we won't be focusing on those --- ## Logical & character .pull-left[ **logical** - boolean values `TRUE` and `FALSE` ```r typeof(TRUE) ``` ``` ## [1] "logical" ``` ] .pull-right[ **character** - character strings ```r typeof("hello") ``` ``` ## [1] "character" ``` ] --- ## Double & integer .pull-left[ **double** - floating point numerical values (default numerical type) ```r typeof(1.335) ``` ``` ## [1] "double" ``` ```r typeof(7) ``` ``` ## [1] "double" ``` ] .pull-right[ **integer** - integer numerical values (indicated with an `L`) ```r typeof(7L) ``` ``` ## [1] "integer" ``` ```r typeof(1:3) ``` ``` ## [1] "integer" ``` ] --- ## Concatenation Vectors can be constructed using the `c()` function. ```r c(1, 2, 3) ``` ``` ## [1] 1 2 3 ``` ```r c("Hello", "World!") ``` ``` ## [1] "Hello" "World!" ``` ```r c(c("hi", "hello"), c("bye", "jello")) ``` ``` ## [1] "hi" "hello" "bye" "jello" ``` --- ## Converting between types .hand[with intention...] .pull-left[ ```r x <- 1:3 x ``` ``` ## [1] 1 2 3 ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` ] -- .pull-right[ ```r y <- as.character(x) y ``` ``` ## [1] "1" "2" "3" ``` ```r typeof(y) ``` ``` ## [1] "character" ``` ] --- ## Converting between types .hand[with intention...] .pull-left[ ```r x <- c(TRUE, FALSE) x ``` ``` ## [1] TRUE FALSE ``` ```r typeof(x) ``` ``` ## [1] "logical" ``` ] -- .pull-right[ ```r y <- as.numeric(x) y ``` ``` ## [1] 1 0 ``` ```r typeof(y) ``` ``` ## [1] "double" ``` ] --- ## Converting between types .hand[without intention...] R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing! .pull-left[ ```r c(1, "Hello") ``` ``` ## [1] "1" "Hello" ``` ```r c(FALSE, 3L) ``` ``` ## [1] 0 3 ``` ] -- .pull-right[ ```r c(1.2, 3L) ``` ``` ## [1] 1.2 3.0 ``` ```r c(2L, "two") ``` ``` ## [1] "2" "two" ``` ] --- ## Explicit vs. implicit coercion Let's give formal names to what we've seen so far: -- - **Explicit coercion** is when you call a function like `as.logical()`, `as.numeric()`, `as.integer()`, `as.double()`, or `as.character()` -- - **Implicit coercion** happens when you use a vector in a specific context that expects a certain type of vector --- .midi[ .your-turn[ ### .hand[Your turn!] - RStudio Cloud > `AE 05 - Hotels + Data types` > open `type-coercion.Rmd` and knit. - What is the type of the given vectors? First, guess. Then, try it out in R. If your guess was correct, great! If not, discuss why they have that type. ] ] -- .small[ **Example:** Suppose we want to know the type of `c(1, "a")`. First, I'd look at: .pull-left[ ```r typeof(1) ``` ``` ## [1] "double" ``` ] .pull-right[ ```r typeof("a") ``` ``` ## [1] "character" ``` ] and make a guess based on these. Then finally I'd check: .pull-left[ ```r typeof(c(1, "a")) ``` ``` ## [1] "character" ``` ] ] --- class: middle # Special values --- ## Special values - `NA`: Not available - `NaN`: Not a number - `Inf`: Positive infinity - `-Inf`: Negative infinity -- .pull-left[ ```r pi / 0 ``` ``` ## [1] Inf ``` ```r 0 / 0 ``` ``` ## [1] NaN ``` ] .pull-right[ ```r 1/0 - 1/0 ``` ``` ## [1] NaN ``` ```r 1/0 + 1/0 ``` ``` ## [1] Inf ``` ] --- ## `NA`s are special ❄️s ```r x <- c(1, 2, 3, 4, NA) ``` ```r mean(x) ``` ``` ## [1] NA ``` ```r mean(x, na.rm = TRUE) ``` ``` ## [1] 2.5 ``` ```r summary(x) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.00 1.75 2.50 2.50 3.25 4.00 1 ``` --- ## `NA`s are logical R uses `NA` to represent missing values in its data structures. ```r typeof(NA) ``` ``` ## [1] "logical" ``` --- ## Mental model for `NA`s - Unlike `NaN`, `NA`s are genuinely unknown values - But that doesn't mean they can't function in a logical way - Let's think about why `NA`s are logical... -- .question[ Why do the following give different answers? ] .pull-left[ ```r # TRUE or NA TRUE | NA ``` ``` ## [1] TRUE ``` ] .pull-right[ ```r # FALSE or NA FALSE | NA ``` ``` ## [1] NA ``` ] `\(\rightarrow\)` See next slide for answers... --- - `NA` is unknown, so it could be `TRUE` or `FALSE` .pull-left[ .midi[ - `TRUE | NA` ```r TRUE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ```r TRUE | FALSE # if NA was FALSE ``` ``` ## [1] TRUE ``` ] ] .pull-right[ .midi[ - `FALSE | NA` ```r FALSE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ```r FALSE | FALSE # if NA was FALSE ``` ``` ## [1] FALSE ``` ] ] - Doesn't make sense for mathematical operations - Makes sense in the context of missing data