class: center, middle, inverse, title-slide .title[ # Data Importing & Cleaning ] .subtitle[ ## STAT 7500 ] .author[ ### Katie Fitzgerald, adpated from datasciencebox.org ] --- layout: true <div class="my-footer"> <span> <a href="https://kgfitzgerald.github.io/stat-7500" target="_blank">kgfitzgerald.github.io/stat-7500</a> </span> </div> --- class: middle # Why should you care about data types? --- ## Example: Cat lovers A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value. ``` r cat_lovers <- read_csv("data/cat-lovers.csv") ``` ``` ## # A tibble: 60 × 3 ## name number_of_cats handedness ## <chr> <chr> <chr> ## 1 Bernice Warren 0 left ## 2 Woodrow Stone 0 left ## 3 Willie Bass 1 left ## 4 Tyrone Estrada 3 left ## 5 Alex Daniels 3 left ## 6 Jane Bates 2 left ## # ℹ 54 more rows ``` --- ## Oh why won't you work?! ``` r cat_lovers |> summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = mean(number_of_cats)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ``` r ?mean ``` <img src="img/mean-help.png" width="75%" style="display: block; margin: auto;" /> --- ## Oh why won't you still work??!! ``` r cat_lovers |> summarise(mean_cats = mean(number_of_cats, na.rm = TRUE)) ``` ``` ## Warning: There was 1 warning in `summarise()`. ## ℹ In argument: `mean_cats = mean(number_of_cats, na.rm = TRUE)`. ## Caused by warning in `mean.default()`: ## ! argument is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ## Take a breath and look at your data .question[ What is the type of the `number_of_cats` variable? ] ``` r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", … ## $ handedness <chr> "left", "left", "left", "left", "left", … ``` --- ## Let's take another look .small[
] --- ## Sometimes you might need to babysit your respondents .midi[ ``` r cat_lovers |> mutate(number_of_cats = case_when( name == "Ginger Clark" ~ 2, name == "Doug Bass" ~ 3, TRUE ~ as.numeric(number_of_cats) )) |> summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning: There was 1 warning in `mutate()`. ## ℹ In argument: `number_of_cats = case_when(...)`. ## Caused by warning: ## ! NAs introduced by coercion ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` ] --- ## Always you need to respect data types ``` r cat_lovers |> mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) |> summarise(mean_cats = mean(number_of_cats)) ``` ``` ## # A tibble: 1 × 1 ## mean_cats ## <dbl> ## 1 0.833 ``` --- ## Now that we know what we're doing... ``` r *cat_lovers <- cat_lovers |> mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) ``` --- ## Moral of the story - If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason. - Go in and investigate your data, apply the fix, *save your data*, live happily ever after. --- class: middle .light-blue[now that we have a good motivation for] .light-blue[learning about data types in R] <br> .large[ .hand[.light-blue[let's learn about data types in R!]] ] --- class: middle # Data types --- ## Data types in R - **logical** - **double** - **integer** - **character** - and some more, but we won't be focusing on those --- ## Logical & character .pull-left[ **logical** - boolean values `TRUE` and `FALSE` ``` r typeof(TRUE) ``` ``` ## [1] "logical" ``` ] .pull-right[ **character** - character strings ``` r typeof("hello") ``` ``` ## [1] "character" ``` ] --- ## Double & integer .pull-left[ **double** - floating point numerical values (default numerical type) ``` r typeof(1.335) ``` ``` ## [1] "double" ``` ``` r typeof(7) ``` ``` ## [1] "double" ``` ] .pull-right[ **integer** - integer numerical values (indicated with an `L`) ``` r typeof(7L) ``` ``` ## [1] "integer" ``` ``` r typeof(1:3) ``` ``` ## [1] "integer" ``` ] --- ## Converting between types .hand[with intention...] .pull-left[ ``` r x <- 1:3 x ``` ``` ## [1] 1 2 3 ``` ``` r typeof(x) ``` ``` ## [1] "integer" ``` ] -- .pull-right[ ``` r y <- as.character(x) y ``` ``` ## [1] "1" "2" "3" ``` ``` r typeof(y) ``` ``` ## [1] "character" ``` ] --- ## Converting between types .hand[with intention...] .pull-left[ ``` r x <- c(TRUE, FALSE) x ``` ``` ## [1] TRUE FALSE ``` ``` r typeof(x) ``` ``` ## [1] "logical" ``` ] -- .pull-right[ ``` r y <- as.numeric(x) y ``` ``` ## [1] 1 0 ``` ``` r typeof(y) ``` ``` ## [1] "double" ``` ] --- ## Converting between types .hand[without intention...] R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing! .pull-left[ ``` r c(1, "Hello") ``` ``` ## [1] "1" "Hello" ``` ``` r c(FALSE, 3L) ``` ``` ## [1] 0 3 ``` ] -- .pull-right[ ``` r c(1.2, 3L) ``` ``` ## [1] 1.2 3.0 ``` ``` r c(2L, "two") ``` ``` ## [1] "2" "two" ``` ] --- ## Explicit vs. implicit coercion Let's give formal names to what we've seen so far: -- - **Explicit coercion** is when you call a function like `as.logical()`, `as.numeric()`, `as.integer()`, `as.double()`, or `as.character()` -- - **Implicit coercion** happens when you use a vector in a specific context that expects a certain type of vector --- class: middle # Special values --- ## Special values - `NA`: Not available - `NaN`: Not a number - `Inf`: Positive infinity - `-Inf`: Negative infinity -- .pull-left[ ``` r pi / 0 ``` ``` ## [1] Inf ``` ``` r 0 / 0 ``` ``` ## [1] NaN ``` ] .pull-right[ ``` r 1/0 - 1/0 ``` ``` ## [1] NaN ``` ``` r 1/0 + 1/0 ``` ``` ## [1] Inf ``` ] <!-- --- --> --- ## `NA`s are special ❄️s Unlike `NaN`, `NA`s are genuinely unknown values. R uses `NA` to represent missing values in its data structures. ``` r x <- c(1, 2, 3, 4, NA) ``` ``` r mean(x) ``` ``` ## [1] NA ``` ``` r mean(x, na.rm = TRUE) ``` ``` ## [1] 2.5 ``` ``` r summary(x) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.00 1.75 2.50 2.50 3.25 4.00 1 ``` --- class: middle # Data classes --- ## Data classes We talked about *types* so far, next we'll introduce the concept of *classes* - Vectors are like Lego building blocks -- - We stick them together to build more complicated constructs, e.g. *representations of data* -- - The **class** attribute relates to the S3 class of an object which determines its behaviour - You don't need to worry about what S3 classes really mean, but you can read more about it [here](https://adv-r.hadley.nz/s3.html#s3-classes) if you're curious -- - Examples: factors, dates, and data frames --- ## Factors R uses factors to handle categorical variables, variables that have a fixed and known set of possible values ``` r x <- factor(c("BS", "MS", "PhD", "MS")) x ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` -- .pull-left[ ``` r typeof(x) ``` ``` ## [1] "integer" ``` ] .pull-right[ ``` r class(x) ``` ``` ## [1] "factor" ``` ] --- ## More on factors We can think of factors like character (level labels) and an integer (level numbers) glued together ``` r glimpse(x) ``` ``` ## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2 ``` ``` r as.integer(x) ``` ``` ## [1] 1 2 3 2 ``` --- ## Dates ``` r y <- as.Date("2020-01-01") y ``` ``` ## [1] "2020-01-01" ``` ``` r typeof(y) ``` ``` ## [1] "double" ``` ``` r class(y) ``` ``` ## [1] "Date" ``` --- ## More on dates We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) and an integer (the origin) glued together ``` r as.integer(y) ``` ``` ## [1] 18262 ``` ``` r as.integer(y) / 365 # roughly 50 yrs ``` ``` ## [1] 50.03288 ``` --- ## Data frames We can think of data frames like like vectors of equal length glued together ``` r df <- data.frame(x = 1:2, y = 3:4) df ``` ``` ## x y ## 1 1 3 ## 2 2 4 ``` .pull-left[ ``` r typeof(df) ``` ``` ## [1] "list" ``` ] .pull-right[ ``` r class(df) ``` ``` ## [1] "data.frame" ``` ] --- ## Lists Lists are a generic vector container; vectors of any type can go in them ``` r l <- list( x = 1:4, y = c("hi", "hello", "jello"), z = c(TRUE, FALSE) ) l ``` ``` ## $x ## [1] 1 2 3 4 ## ## $y ## [1] "hi" "hello" "jello" ## ## $z ## [1] TRUE FALSE ``` --- ## Lists and data frames - A data frame is a special list containing vectors of equal length - When we use the `pull()` function, we extract a vector from the data frame ``` r df ``` ``` ## x y ## 1 1 3 ## 2 2 4 ``` ``` r df |> pull(y) ``` ``` ## [1] 3 4 ``` --- class: middle # Working with factors --- ## Read data in as character strings ``` r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", … ## $ handedness <chr> "left", "left", "left", "left", "left", … ``` --- ## But coerce when plotting ``` r ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar() ``` <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-40-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Use forcats to manipulate factors ``` r cat_lovers |> * mutate(handedness = fct_infreq(handedness)) |> ggplot(mapping = aes(x = handedness)) + geom_bar() ``` <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-41-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Come for the functionality .pull-left[ ... stay for the logo ] .pull-right[ <img src="img/forcats-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" /> ] .pull-left-wide[ - Factors are useful when you have true categorical data and you want to override the ordering of character vectors to improve display - They are also useful in modeling scenarios - The **forcats** package provides a suite of useful tools that solve common problems with factors ] --- class: middle # Working with dates --- ## Make a date .pull-left[ <img src="img/lubridate-not-part-of-tidyverse.png" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ - **lubridate** is the tidyverse-friendly package that makes dealing with dates a little easier - It's not one of the *core* tidyverse packages, hence it's installed with `install.packages("tidyverse)` but it's not loaded with it, and needs to be explicitly loaded with `library(lubridate)` ] --- class: middle .hand[.light-blue[ we're just going to scratch the surface of working with dates in R here... ]] --- .question[ Calculate and visualise the number of bookings on any given arrival date. ] ``` r hotels |> select(starts_with("arrival_")) ``` ``` ## # A tibble: 119,390 × 4 ## arrival_date_year arrival_date_month arrival_date_week_number ## <dbl> <chr> <dbl> ## 1 2015 July 27 ## 2 2015 July 27 ## 3 2015 July 27 ## 4 2015 July 27 ## 5 2015 July 27 ## 6 2015 July 27 ## # ℹ 119,384 more rows ## # ℹ 1 more variable: arrival_date_day_of_month <dbl> ``` --- ## Step 1. Construct dates .midi[ ``` r library(glue) hotels |> mutate( * arrival_date = glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}") ) |> relocate(arrival_date) ``` ``` ## # A tibble: 119,390 × 33 ## arrival_date hotel is_canceled lead_time arrival_date_year ## <glue> <chr> <dbl> <dbl> <dbl> ## 1 2015 July 1 Resort Ho… 0 342 2015 ## 2 2015 July 1 Resort Ho… 0 737 2015 ## 3 2015 July 1 Resort Ho… 0 7 2015 ## 4 2015 July 1 Resort Ho… 0 13 2015 ... ``` ] --- ## Step 2. Count bookings per date .midi[ ``` r hotels |> mutate(arrival_date = glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")) |> count(arrival_date) ``` ``` ## # A tibble: 793 × 2 ## arrival_date n ## <glue> <int> ## 1 2015 August 1 110 ## 2 2015 August 10 207 ## 3 2015 August 11 117 ## 4 2015 August 12 133 ## 5 2015 August 13 107 ## 6 2015 August 14 329 ## # ℹ 787 more rows ``` ] --- ## Step 3. Visualise bookings per date .midi[ ``` r hotels |> mutate(arrival_date = glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")) |> count(arrival_date) |> ggplot(aes(x = arrival_date, y = n, group = 1)) + geom_line() ``` <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-49-1.png" width="80%" style="display: block; margin: auto;" /> ] --- .hand[zooming in a bit...] .question[ Why does the plot start with August when we know our data start in July? And why does 10 August come after 1 August? ] .midi[ <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-50-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Step 1. *REVISED* Construct dates "as dates" .midi[ ``` r library(lubridate) hotels |> mutate( * arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")) ) |> relocate(arrival_date) ``` ``` ## # A tibble: 119,390 × 33 ## arrival_date hotel is_canceled lead_time arrival_date_year ## <date> <chr> <dbl> <dbl> <dbl> ## 1 2015-07-01 Resort Ho… 0 342 2015 ## 2 2015-07-01 Resort Ho… 0 737 2015 ## 3 2015-07-01 Resort Ho… 0 7 2015 ## 4 2015-07-01 Resort Ho… 0 13 2015 ... ``` ] --- ## Step 2. Count bookings per date .midi[ ``` r hotels |> mutate(arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}"))) |> count(arrival_date) ``` ``` ## # A tibble: 793 × 2 ## arrival_date n ## <date> <int> ## 1 2015-07-01 122 ## 2 2015-07-02 93 ## 3 2015-07-03 56 ## 4 2015-07-04 88 ## 5 2015-07-05 53 ## 6 2015-07-06 75 ## # ℹ 787 more rows ``` ] --- ## Step 3a. Visualise bookings per date .midi[ ``` r hotels |> mutate(arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}"))) |> count(arrival_date) |> ggplot(aes(x = arrival_date, y = n, group = 1)) + geom_line() ``` <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-53-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Step 3b. Visualise using a smooth curve .midi[ ``` r hotels |> mutate(arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}"))) |> count(arrival_date) |> ggplot(aes(x = arrival_date, y = n, group = 1)) + * geom_smooth() ``` <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-54-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: middle # Reading rectangular data into R --- class: middle .pull-left[ <img src="img/readr.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/readxl.png" width="80%" style="display: block; margin: auto;" /> ] --- .pull-left[ ## readr - `read_csv()` - comma delimited files - `read_csv2()` - semicolon separated files (common in countries where , is used as the decimal place) - `read_tsv()` - tab delimited files - `read_delim()` - reads in files with any delimiter - `read_fwf()` - fixed width files - ... ] -- .pull-right[ ## readxl - `read_excel()` - read xls or xlsx files - ... ] --- ## Reading data ``` r nobel <- read_csv(file = "data/nobel.csv") nobel ``` ``` ## # A tibble: 935 × 26 ## id firstname surname year category affiliation city ## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 1 Wilhelm Conrad Röntgen 1901 Physics Munich Uni… Muni… ## 2 2 Hendrik A. Lorentz 1902 Physics Leiden Uni… Leid… ## 3 3 Pieter Zeeman 1902 Physics Amsterdam … Amst… ## 4 4 Henri Becquerel 1903 Physics École Poly… Paris ## 5 5 Pierre Curie 1903 Physics École muni… Paris ## 6 6 Marie Curie 1903 Physics <NA> <NA> ## # ℹ 929 more rows ## # ℹ 19 more variables: country <chr>, born_date <date>, ## # died_date <date>, gender <chr>, born_city <chr>, ## # born_country <chr>, born_country_code <chr>, ## # died_city <chr>, died_country <chr>, ## # died_country_code <chr>, overall_motivation <chr>, ## # share <dbl>, motivation <chr>, … ``` --- ## Writing data .pull-left[ - Write a file ``` r df <- tribble( ~x, ~y, 1, "a", 2, "b", 3, "c" ) write_csv(df, file = "data/df.csv") ``` ] -- .pull-right[ - Read it back in to inspect ``` r read_csv("data/df.csv") ``` ``` ## # A tibble: 3 × 2 ## x y ## <dbl> <chr> ## 1 1 a ## 2 2 b ## 3 3 c ``` ] --- class: middle # Variable names --- ## Data with bad names ``` r edibnb_badnames <- read_csv("data/edibnb-badnames.csv") names(edibnb_badnames) ``` ``` ## [1] "ID" "Price" ## [3] "neighbourhood" "accommodates" ## [5] "Number of bathrooms" "Number of Bedrooms" ## [7] "n beds" "Review Scores Rating" ## [9] "Number of reviews" "listing_url" ``` -- ... but R doesn't allow spaces in variable names ``` r ggplot(edibnb_badnames, aes(x = Number of bathrooms, y = Price)) + geom_point() ``` ``` ## Error in parse(text = input): <text>:1:40: unexpected symbol ## 1: ggplot(edibnb_badnames, aes(x = Number of ## ^ ``` --- ## Option 1 - Define column names .small[ ``` r edibnb_col_names <- read_csv("data/edibnb-badnames.csv", col_names = c("id", "price", "neighbourhood", "accommodates", "bathroom", "bedroom", "bed", "review_scores_rating", "n_reviews", "url")) names(edibnb_col_names) ``` ``` ## [1] "id" "price" ## [3] "neighbourhood" "accommodates" ## [5] "bathroom" "bedroom" ## [7] "bed" "review_scores_rating" ## [9] "n_reviews" "url" ``` ] --- ## Option 2 - Format text to snake_case ``` r edibnb_clean_names <- read_csv("data/edibnb-badnames.csv") |> janitor::clean_names() names(edibnb_clean_names) ``` ``` ## [1] "id" "price" ## [3] "neighbourhood" "accommodates" ## [5] "number_of_bathrooms" "number_of_bedrooms" ## [7] "n_beds" "review_scores_rating" ## [9] "number_of_reviews" "listing_url" ``` --- class: middle # Variable types --- .question[ Which type is `x`? Why? ] .pull-left[ <img src="img/df-na.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r read_csv("data/df-na.csv") ``` ``` ## # A tibble: 9 × 3 ## x y z ## <chr> <chr> <chr> ## 1 1 a hi ## 2 <NA> b hello ## 3 3 Not applicable 9999 ## 4 4 d ola ## 5 5 e hola ## 6 . f whatup ## 7 7 g wassup ## 8 8 h sup ## 9 9 i <NA> ``` ] --- ## Option 1. Explicit NAs ``` r read_csv("data/df-na.csv", na = c("", "NA", ".", "9999", "Not applicable")) ``` .pull-left[ <img src="img/df-na.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` ## # A tibble: 9 × 3 ## x y z ## <dbl> <chr> <chr> ## 1 1 a hi ## 2 NA b hello ## 3 3 <NA> <NA> ## 4 4 d ola ## 5 5 e hola ## 6 NA f whatup ## 7 7 g wassup ## 8 8 h sup ## 9 9 i <NA> ``` ] --- ## Option 2. Specify column types ``` r read_csv("data/df-na.csv", col_types = list(col_double(), col_character(), col_character())) ``` ``` ## Warning: One or more parsing issues, call `problems()` on your data frame ## for details, e.g.: ## dat <- vroom(...) ## problems(dat) ``` ``` ## # A tibble: 9 × 3 ## x y z ## <dbl> <chr> <chr> ## 1 1 a hi ## 2 NA b hello ## 3 3 Not applicable 9999 ## 4 4 d ola ## 5 5 e hola ## 6 NA f whatup ## 7 7 g wassup ## 8 8 h sup ## 9 9 i <NA> ``` --- ## Column types .small[ **type function** | **data type** ------------------ | ------------- `col_character()` | character `col_date()` | date `col_datetime()` | POSIXct (date-time) `col_double()` | double (numeric) `col_factor()` | factor `col_guess()` | let readr guess (default) `col_integer()` | integer `col_logical()` | logical `col_number()` | numbers mixed with non-number characters `col_numeric()` | double or integer `col_skip()` | do not read `col_time()` | time ] --- class:middle # Case study: Favourite foods --- ## Favourite foods <img src="img/fav-food/fav-food.png" width="60%" style="display: block; margin: auto;" /> -- ``` r *fav_food <- read_excel("data/favourite-food.xlsx") fav_food ``` ``` ## # A tibble: 5 × 6 ## `Student ID` `Full Name` favourite.food mealPlan AGE SES ## <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 1 Sunil Huffmann Strawberry yo… Lunch o… 4 High ## 2 2 Barclay Lynn French fries Lunch o… 5 Midd… ## 3 3 Jayendra Lyne N/A Breakfa… 7 Low ## 4 4 Leon Rossini Anchovies Lunch o… 99999 Midd… ## 5 5 Chidiegwu Dun… Pizza Breakfa… five High ``` --- ## Variable names <img src="img/fav-food/fav-food-names.png" width="60%" style="display: block; margin: auto;" /> -- ``` r fav_food <- read_excel("data/favourite-food.xlsx") |> * janitor::clean_names() fav_food ``` ``` ## # A tibble: 5 × 6 ## student_id full_name favourite_food meal_plan age ses ## <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 1 Sunil Huffmann Strawberry yo… Lunch on… 4 High ## 2 2 Barclay Lynn French fries Lunch on… 5 Midd… ## 3 3 Jayendra Lyne N/A Breakfas… 7 Low ## 4 4 Leon Rossini Anchovies Lunch on… 99999 Midd… ## 5 5 Chidiegwu Dunk… Pizza Breakfas… five High ``` --- ## Handling NAs <img src="img/fav-food/fav-food-nas.png" width="60%" style="display: block; margin: auto;" /> -- ``` r fav_food <- read_excel("data/favourite-food.xlsx", * na = c("N/A", "99999")) |> janitor::clean_names() fav_food ``` ``` ## # A tibble: 5 × 6 ## student_id full_name favourite_food meal_plan age ses ## <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 1 Sunil Huffmann Strawberry yo… Lunch on… 4 High ## 2 2 Barclay Lynn French fries Lunch on… 5 Midd… ## 3 3 Jayendra Lyne <NA> Breakfas… 7 Low ## 4 4 Leon Rossini Anchovies Lunch on… <NA> Midd… ## 5 5 Chidiegwu Dunk… Pizza Breakfas… five High ``` --- ## Make `age` numeric .pull-left-wide[ ``` r fav_food <- fav_food |> * mutate( * age = if_else(age == "five", "5", age), * age = as.numeric(age) * ) glimpse(fav_food) ``` ``` ## Rows: 5 ## Columns: 6 ## $ student_id <dbl> 1, 2, 3, 4, 5 ## $ full_name <chr> "Sunil Huffmann", "Barclay Lynn", "Jayen… ## $ favourite_food <chr> "Strawberry yoghurt", "French fries", NA… ## $ meal_plan <chr> "Lunch only", "Lunch only", "Breakfast a… ## $ age <dbl> 4, 5, 7, NA, 5 ## $ ses <chr> "High", "Middle", "Low", "Middle", "High" ``` ] .pull-right-narrow[ <img src="img/fav-food/fav-food-age.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Socio-economic status .question[ What order are the levels of `ses` listed in? ] .pull-left-wide[ ``` r fav_food |> count(ses) ``` ``` ## # A tibble: 3 × 2 ## ses n ## <chr> <int> ## 1 High 2 ## 2 Low 1 ## 3 Middle 2 ``` ] .pull-right-narrow[ <img src="img/fav-food/fav-food-ses.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Make `ses` factor .pull-left-wide[ ``` r fav_food <- fav_food |> * mutate(ses = fct_relevel(ses, "Low", "Middle", "High")) fav_food |> count(ses) ``` ``` ## # A tibble: 3 × 2 ## ses n ## <fct> <int> ## 1 Low 1 ## 2 Middle 2 ## 3 High 2 ``` ] --- ## Putting it altogether ``` r fav_food <- read_excel("data/favourite-food.xlsx", na = c("N/A", "99999")) |> janitor::clean_names() |> mutate( age = if_else(age == "five", "5", age), age = as.numeric(age), ses = fct_relevel(ses, "Low", "Middle", "High") ) fav_food ``` ``` ## # A tibble: 5 × 6 ## student_id full_name favourite_food meal_plan age ses ## <dbl> <chr> <chr> <chr> <dbl> <fct> ## 1 1 Sunil Huffmann Strawberry yo… Lunch on… 4 High ## 2 2 Barclay Lynn French fries Lunch on… 5 Midd… ## 3 3 Jayendra Lyne <NA> Breakfas… 7 Low ## 4 4 Leon Rossini Anchovies Lunch on… NA Midd… ## 5 5 Chidiegwu Dunk… Pizza Breakfas… 5 High ``` --- ## Out and back in ``` r write_csv(fav_food, file = "data/fav-food-clean.csv") fav_food_clean <- read_csv("data/fav-food-clean.csv") ``` --- .question[ What happened to `ses` again? ] ``` r fav_food_clean |> count(ses) ``` ``` ## # A tibble: 3 × 2 ## ses n ## <chr> <int> ## 1 High 2 ## 2 Low 1 ## 3 Middle 2 ``` --- ## `read_rds()` and `write_rds()` - CSVs can be unreliable for saving interim results if there is specific variable type information you want to hold on to. - An alternative is RDS files, you can read and write them with `read_rds()` and `write_rds()`, respectively. ``` r read_rds(path) write_rds(x, path) ``` --- ## Out and back in, take 2 ``` r write_rds(fav_food, file = "data/fav-food-clean.rds") fav_food_clean <- read_rds("data/fav-food-clean.rds") fav_food_clean |> count(ses) ``` ``` ## # A tibble: 3 × 2 ## ses n ## <fct> <int> ## 1 Low 1 ## 2 Middle 2 ## 3 High 2 ``` --- class: middle # Other types of data --- ## Other types of data - **googlesheets4:** Google Sheets - **haven**: SPSS, Stata, and SAS files - **DBI**, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): allows you to run SQL queries against a database and return a data frame - **jsonline**: JSON - **xml2**: xml - **rvest**: web scraping - **httr**: web APIs - **sparklyr**: data loaded into spark --- class: middle # Case study: Religion and income --- <img src="img/relig-income.png" width="75%" style="display: block; margin: auto;" /> .footnote[ Source: [pewforum.org/religious-landscape-study/income-distribution](https://www.pewforum.org/religious-landscape-study/income-distribution/), Retrieved 14 April, 2020 ] --- ## Read data ``` r library(readxl) rel_inc <- read_excel("data/relig-income.xlsx") ``` .small[ ``` ## # A tibble: 12 × 6 ## `Religious tradition` `Less than $30,000` `$30,000-$49,999` ## <chr> <dbl> <dbl> ## 1 Buddhist 0.36 0.18 ## 2 Catholic 0.36 0.19 ## 3 Evangelical Protestant 0.35 0.22 ## 4 Hindu 0.17 0.13 ## 5 Historically Black Prote… 0.53 0.22 ## 6 Jehovah's Witness 0.48 0.25 ## # ℹ 6 more rows ## # ℹ 3 more variables: `$50,000-$99,999` <dbl>, ## # `$100,000 or more` <dbl>, `Sample Size` <dbl> ``` ] --- ## Rename columns .midi[ ``` r rel_inc |> rename( religion = `Religious tradition`, n = `Sample Size` ) ``` ``` ## # A tibble: 12 × 6 ## religion `Less than $30,000` `$30,000-$49,999` ## <chr> <dbl> <dbl> ## 1 Buddhist 0.36 0.18 ## 2 Catholic 0.36 0.19 ## 3 Evangelical Protestant 0.35 0.22 ## 4 Hindu 0.17 0.13 ## 5 Historically Black Prote… 0.53 0.22 ## 6 Jehovah's Witness 0.48 0.25 ## # ℹ 6 more rows ## # ℹ 3 more variables: `$50,000-$99,999` <dbl>, ## # `$100,000 or more` <dbl>, n <dbl> ``` ] --- .question[ If we want a new variable called `income` with levels such as "Less than $30,000", "$30,000-$49,999", ... etc. which function should we use? ] ``` ## # A tibble: 48 × 4 ## religion n income proportion ## <chr> <dbl> <chr> <dbl> ## 1 Buddhist 233 Less than $30,000 0.36 ## 2 Buddhist 233 $30,000-$49,999 0.18 ## 3 Buddhist 233 $50,000-$99,999 0.32 ## 4 Buddhist 233 $100,000 or more 0.13 ## 5 Catholic 6137 Less than $30,000 0.36 ## 6 Catholic 6137 $30,000-$49,999 0.19 ## 7 Catholic 6137 $50,000-$99,999 0.26 ## 8 Catholic 6137 $100,000 or more 0.19 ## 9 Evangelical Protestant 7462 Less than $30,000 0.35 ## 10 Evangelical Protestant 7462 $30,000-$49,999 0.22 ## 11 Evangelical Protestant 7462 $50,000-$99,999 0.28 ## 12 Evangelical Protestant 7462 $100,000 or more 0.14 ## 13 Hindu 172 Less than $30,000 0.17 ## 14 Hindu 172 $30,000-$49,999 0.13 ## 15 Hindu 172 $50,000-$99,999 0.34 ## # ℹ 33 more rows ``` --- ## Pivot longer .midi[ ``` r rel_inc |> rename( religion = `Religious tradition`, n = `Sample Size` ) |> pivot_longer( cols = -c(religion, n), # all but religion and n names_to = "income", values_to = "proportion" ) ``` ``` ## # A tibble: 48 × 4 ## religion n income proportion ## <chr> <dbl> <chr> <dbl> ## 1 Buddhist 233 Less than $30,000 0.36 ## 2 Buddhist 233 $30,000-$49,999 0.18 ## 3 Buddhist 233 $50,000-$99,999 0.32 ## 4 Buddhist 233 $100,000 or more 0.13 ## 5 Catholic 6137 Less than $30,000 0.36 ## 6 Catholic 6137 $30,000-$49,999 0.19 ## # ℹ 42 more rows ``` ] --- ## Calculate frequencies .midi[ ``` r rel_inc |> rename( religion = `Religious tradition`, n = `Sample Size` ) |> pivot_longer( cols = -c(religion, n), names_to = "income", values_to = "proportion" ) |> mutate(frequency = round(proportion * n)) ``` ``` ## # A tibble: 48 × 5 ## religion n income proportion frequency ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 Buddhist 233 Less than $30,000 0.36 84 ## 2 Buddhist 233 $30,000-$49,999 0.18 42 ## 3 Buddhist 233 $50,000-$99,999 0.32 75 ## 4 Buddhist 233 $100,000 or more 0.13 30 ## 5 Catholic 6137 Less than $30,000 0.36 2209 ## 6 Catholic 6137 $30,000-$49,999 0.19 1166 ## # ℹ 42 more rows ``` ] --- ## Save data ``` r rel_inc_long <- rel_inc |> rename( religion = `Religious tradition`, n = `Sample Size` ) |> pivot_longer( cols = -c(religion, n), names_to = "income", values_to = "proportion" ) |> mutate(frequency = round(proportion * n)) ``` --- ## Barplot ``` r ggplot(rel_inc_long, aes(y = religion, x = frequency)) + geom_col() ``` <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-96-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Recode religion .panelset[ .panel[.panel-name[Recode] ``` r rel_inc_long <- rel_inc_long |> mutate(religion = case_when( religion == "Evangelical Protestant" ~ "Ev. Protestant", religion == "Historically Black Protestant" ~ "Hist. Black Protestant", religion == 'Unaffiliated (religious "nones")' ~ "Unaffiliated", TRUE ~ religion )) ``` ] .panel[.panel-name[Plot] <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-98-1.png" width="65%" style="display: block; margin: auto;" /> ] ] --- ## Reverse religion order .panelset[ .panel[.panel-name[Recode] ``` r rel_inc_long <- rel_inc_long |> mutate(religion = fct_rev(religion)) ``` ] .panel[.panel-name[Plot] <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-100-1.png" width="65%" style="display: block; margin: auto;" /> ] ] --- ## Add income .panelset[ .panel[.panel-name[Plot] <img src="06-data-importing-cleaning_files/figure-html/rel-income-1.png" width="65%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) + geom_col() ``` ] ] --- ## Fill bars .panelset[ .panel[.panel-name[Plot] <img src="06-data-importing-cleaning_files/figure-html/rel-income-fill-1.png" width="65%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) + geom_col(position = "fill") ``` ] ] --- ## Change colors .panelset[ .panel[.panel-name[Plot] <img src="06-data-importing-cleaning_files/figure-html/rel-income-fill-viridis-1.png" width="65%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) + geom_col(position = "fill") + scale_fill_viridis_d() ``` ] ] --- ## Change theme .panelset[ .panel[.panel-name[Plot] <img src="06-data-importing-cleaning_files/figure-html/rel-income-fill-viridis-minimal-1.png" width="65%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) + geom_col(position = "fill") + scale_fill_viridis_d() + theme_minimal() ``` ] ] --- ## Move legend to the bottom .panelset[ .panel[.panel-name[Plot] <img src="06-data-importing-cleaning_files/figure-html/bottom-legend-1.png" width="65%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) + geom_col(position = "fill") + scale_fill_viridis_d() + theme_minimal() + theme(legend.position = "bottom") ``` ] ] --- ## Legend adjustments .panelset[ .panel[.panel-name[Plot] <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-106-1.png" width="65%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) + geom_col(position = "fill") + scale_fill_viridis_d() + theme_minimal() + theme(legend.position = "bottom") + guides(fill = guide_legend(nrow = 2, byrow = TRUE)) ``` ] ] --- ## Fix labels .panelset[ .panel[.panel-name[Plot] <img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-107-1.png" width="65%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) + geom_col(position = "fill") + scale_fill_viridis_d() + theme_minimal() + theme(legend.position = "bottom") + guides(fill = guide_legend(nrow = 2, byrow = TRUE)) + labs( x = "Proportion", y = "", title = "Income distribution by religious group", subtitle = "Source: Pew Research Center, Religious Landscape Study", fill = "Income" ) ``` ] ]