Data Importing & Cleaning

.title[
# Data Importing & Cleaning
]
.subtitle[
## STAT 7500
]
.author[
### Katie Fitzgerald, adpated from datasciencebox.org
]

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://kgfitzgerald.github.io/stat-7500" target="_blank">kgfitzgerald.github.io/stat-7500</a>
</span>
</div>

---

# Why should you care about data types?

---

## Example: Cat lovers

A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value.

``` r
cat_lovers <- read_csv("data/cat-lovers.csv")
```

```
## # A tibble: 60 × 3
##   name           number_of_cats handedness
##   <chr>          <chr>          <chr>     
## 1 Bernice Warren 0              left      
## 2 Woodrow Stone  0              left      
## 3 Willie Bass    1              left      
## 4 Tyrone Estrada 3              left      
## 5 Alex Daniels   3              left      
## 6 Jane Bates     2              left      
## # ℹ 54 more rows
```

---

## Oh why won't you work?!

``` r
cat_lovers |>
  summarise(mean_cats = mean(number_of_cats))
```

```
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `mean_cats = mean(number_of_cats)`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
```

```
## # A tibble: 1 × 1
##   mean_cats
##       <dbl>
## 1        NA
```

---

``` r
?mean
```

---

## Oh why won't you still work??!!

``` r
cat_lovers |>
  summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))
```

```
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `mean_cats = mean(number_of_cats, na.rm = TRUE)`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
```

```
## # A tibble: 1 × 1
##   mean_cats
##       <dbl>
## 1        NA
```

---

## Take a breath and look at your data

``` r
glimpse(cat_lovers)
```

```
## Rows: 60
## Columns: 3
## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Will…
## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", …
## $ handedness     <chr> "left", "left", "left", "left", "left", …
```

---

## Let's take another look

.small[
<div class="datatables html-widget html-fill-item" id="htmlwidget-1b4ff99564eb6e8884a5" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-1b4ff99564eb6e8884a5">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60"],["Bernice Warren","Woodrow Stone","Willie Bass","Tyrone Estrada","Alex Daniels","Jane Bates","Latoya Simpson","Darin Woods","Agnes Cobb","Tabitha Grant","Perry Cross","Wanda Silva","Alicia Sims","Emily Logan","Woodrow Elliott","Brent Copeland","Pedro Carlson","Patsy Luna","Brett Robbins","Oliver George","Calvin Perry","Lora Gutierrez","Charlotte Sparks","Earl Mack","Leslie Wade","Santiago Barker","Jose Bell","Lynda Smith","Bradford Marshall","Irving Miller","Caroline Simpson","Frances Welch","Melba Jenkins","Veronica Morales","Juanita Cunningham","Maurice Howard","Teri Pierce","Phil Franklin","Jan Zimmerman","Leslie Price","Bessie Patterson","Ethel Wolfe","Naomi Wright","Sadie Frank","Lonnie Cannon","Tony Garcia","Darla Newton","Ginger Clark","Lionel Campbell","Florence Klein","Harriet Leonard","Terrence Harrington","Travis Garner","Doug Bass","Pat Norris","Dawn Young","Shari Alvarez","Tamara Robinson","Megan Morgan","Kara Obrien"],["0","0","1","3","3","2","1","1","0","0","0","0","1","3","3","2","1","1","0","0","1","1","0","0","4","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","1","3","3","2","1","1.5 - honestly I think one of my cats is half human","0","0","1","0","1","three","1","1","1","0","0","2"],["left","left","left","left","left","left","left","left","left","left","left","left","left","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","ambidextrous","ambidextrous","ambidextrous","ambidextrous","ambidextrous"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>name<\/th>\n      <th>number_of_cats<\/th>\n      <th>handedness<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"name","targets":1},{"name":"number_of_cats","targets":2},{"name":"handedness","targets":3}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]

---

## Sometimes you might need to babysit your respondents

``` r
cat_lovers |>
  mutate(number_of_cats = case_when(
    name == "Ginger Clark" ~ 2,
    name == "Doug Bass"    ~ 3,
    TRUE                   ~ as.numeric(number_of_cats)
    )) |>
  summarise(mean_cats = mean(number_of_cats))
```

```
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `number_of_cats = case_when(...)`.
## Caused by warning:
## ! NAs introduced by coercion
```

```
## # A tibble: 1 × 1
##   mean_cats
##       <dbl>
## 1     0.833
```
]

---

## Always you need to respect data types

``` r
cat_lovers |>
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    ) |>
  summarise(mean_cats = mean(number_of_cats))
```

```
## # A tibble: 1 × 1
##   mean_cats
##       <dbl>
## 1     0.833
```

---

## Now that we know what we're doing...

``` r
*cat_lovers <- cat_lovers |>
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    )
```

---

## Moral of the story

- If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason.
- Go in and investigate your data, apply the fix, *save your data*, live happily ever after.

---

<br>

---

# Data types

---

## Data types in R

- **logical**
- **double**
- **integer**
- **character**
- and some more, but we won't be focusing on those

---

## Logical & character

``` r
typeof(TRUE)
```

```
## [1] "logical"
```
]
.pull-right[
**character** - character strings

``` r
typeof("hello")
```

```
## [1] "character"
```
]

---

## Double & integer

``` r
typeof(1.335)
```

```
## [1] "double"
```

``` r
typeof(7)
```

```
## [1] "double"
```
]
.pull-right[
**integer** - integer numerical values (indicated with an `L`)

``` r
typeof(7L)
```

```
## [1] "integer"
```

``` r
typeof(1:3)
```

```
## [1] "integer"
```
]

---

## Converting between types

``` r
x <- 1:3
x
```

```
## [1] 1 2 3
```

``` r
typeof(x)
```

```
## [1] "integer"
```
]
--
.pull-right[

``` r
y <- as.character(x)
y
```

```
## [1] "1" "2" "3"
```

``` r
typeof(y)
```

```
## [1] "character"
```
]

---

## Converting between types

``` r
x <- c(TRUE, FALSE)
x
```

```
## [1]  TRUE FALSE
```

``` r
typeof(x)
```

```
## [1] "logical"
```
]
--
.pull-right[

``` r
y <- as.numeric(x)
y
```

```
## [1] 1 0
```

``` r
typeof(y)
```

```
## [1] "double"
```
]

---

## Converting between types

R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing!

``` r
c(1, "Hello")
```

```
## [1] "1"     "Hello"
```

``` r
c(FALSE, 3L)
```

```
## [1] 0 3
```
]
--
.pull-right[

``` r
c(1.2, 3L)
```

```
## [1] 1.2 3.0
```

``` r
c(2L, "two")
```

```
## [1] "2"   "two"
```
]

---

## Explicit vs. implicit coercion

Let's give formal names to what we've seen so far:

--
- **Explicit coercion** is when you call a function like `as.logical()`, `as.numeric()`, `as.integer()`, `as.double()`, or `as.character()`

--
- **Implicit coercion** happens when you use a vector in a specific context that expects a certain type of vector

---

# Special values

---

## Special values

- `NA`: Not available
- `NaN`: Not a number
- `Inf`: Positive infinity
- `-Inf`: Negative infinity

``` r
pi / 0
```

```
## [1] Inf
```

``` r
0 / 0
```

```
## [1] NaN
```
]
.pull-right[

``` r
1/0 - 1/0
```

```
## [1] NaN
```

``` r
1/0 + 1/0
```

```
## [1] Inf
```
]

---

## `NA`s are special ❄️s

Unlike `NaN`, `NA`s are genuinely unknown values. R uses `NA` to represent missing values in its data structures.

``` r
x <- c(1, 2, 3, 4, NA)
```

``` r
mean(x)
```

```
## [1] NA
```

``` r
mean(x, na.rm = TRUE)
```

```
## [1] 2.5
```

``` r
summary(x)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.75    2.50    2.50    3.25    4.00       1
```

---

# Data classes

---

## Data classes

We talked about *types* so far, next we'll introduce the concept of *classes*

- Vectors are like Lego building blocks

--
- We stick them together to build more complicated constructs, e.g. *representations of data*

--
- The **class** attribute relates to the S3 class of an object which determines its behaviour
  - You don't need to worry about what S3 classes really mean, but you can read more about it [here](https://adv-r.hadley.nz/s3.html#s3-classes) if you're curious

--
- Examples: factors, dates, and data frames
  
---

## Factors

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values

``` r
x <- factor(c("BS", "MS", "PhD", "MS"))
x
```

```
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
```

``` r
typeof(x)
```

```
## [1] "integer"
```
]
.pull-right[

``` r
class(x)
```

```
## [1] "factor"
```
]

---

## More on factors

We can think of factors like character (level labels) and an integer (level numbers) glued together

``` r
glimpse(x)
```

```
##  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
```

``` r
as.integer(x)
```

```
## [1] 1 2 3 2
```

---

## Dates

``` r
y <- as.Date("2020-01-01")
y
```

```
## [1] "2020-01-01"
```

``` r
typeof(y)
```

```
## [1] "double"
```

``` r
class(y)
```

```
## [1] "Date"
```

---

## More on dates

We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) and an integer (the origin) glued together

``` r
as.integer(y)
```

```
## [1] 18262
```

``` r
as.integer(y) / 365 # roughly 50 yrs
```

```
## [1] 50.03288
```

---

## Data frames

We can think of data frames like like vectors of equal length glued together

``` r
df <- data.frame(x = 1:2, y = 3:4)
df
```

```
##   x y
## 1 1 3
## 2 2 4
```

``` r
typeof(df)
```

```
## [1] "list"
```
]
.pull-right[

``` r
class(df)
```

```
## [1] "data.frame"
```
]

---

## Lists

Lists are a generic vector container; vectors of any type can go in them

``` r
l <- list(
  x = 1:4,
  y = c("hi", "hello", "jello"),
  z = c(TRUE, FALSE)
)
l
```

```
## $x
## [1] 1 2 3 4
## 
## $y
## [1] "hi"    "hello" "jello"
## 
## $z
## [1]  TRUE FALSE
```

---

## Lists and data frames

- A data frame is a special list containing vectors of equal length
- When we use the `pull()` function, we extract a vector from the data frame

``` r
df
```

```
##   x y
## 1 1 3
## 2 2 4
```

``` r
df |>
  pull(y)
```

```
## [1] 3 4
```

---

# Working with factors

---

## Read data in as character strings

``` r
glimpse(cat_lovers)
```

---

## But coerce when plotting

``` r
ggplot(cat_lovers, mapping = aes(x = handedness)) +
  geom_bar()
```

---

## Use forcats to manipulate factors

``` r
cat_lovers |>
* mutate(handedness = fct_infreq(handedness)) |>
  ggplot(mapping = aes(x = handedness)) +
  geom_bar()
```

---

## Come for the functionality

.pull-left[
... stay for the logo
]
.pull-right[
<img src="img/forcats-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" />
]

.pull-left-wide[
- Factors are useful when you have true categorical data and you want to override the ordering of character vectors to improve display
- They are also useful in modeling scenarios
- The **forcats** package provides a suite of useful tools that solve common problems with factors
]

---

# Working with dates

---

## Make a date

.pull-left[
<img src="img/lubridate-not-part-of-tidyverse.png" width="65%" style="display: block; margin: auto;" />
]
.pull-right[
- **lubridate** is the tidyverse-friendly package that makes dealing with dates a little easier
- It's not one of the *core* tidyverse packages, hence it's installed with `install.packages("tidyverse)` but it's not loaded with it, and needs to be explicitly loaded with `library(lubridate)`
]

---

---

``` r
hotels |>
  select(starts_with("arrival_"))
```

```
## # A tibble: 119,390 × 4
##   arrival_date_year arrival_date_month arrival_date_week_number
##               <dbl> <chr>                                 <dbl>
## 1              2015 July                                     27
## 2              2015 July                                     27
## 3              2015 July                                     27
## 4              2015 July                                     27
## 5              2015 July                                     27
## 6              2015 July                                     27
## # ℹ 119,384 more rows
## # ℹ 1 more variable: arrival_date_day_of_month <dbl>
```

---

## Step 1. Construct dates

``` r
library(glue)

hotels |>
  mutate(
*   arrival_date = glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")
    ) |> 
  relocate(arrival_date)
```

```
## # A tibble: 119,390 × 33
##   arrival_date hotel      is_canceled lead_time arrival_date_year
##   <glue>       <chr>            <dbl>     <dbl>             <dbl>
## 1 2015 July 1  Resort Ho…           0       342              2015
## 2 2015 July 1  Resort Ho…           0       737              2015
## 3 2015 July 1  Resort Ho…           0         7              2015
## 4 2015 July 1  Resort Ho…           0        13              2015
...
```
]

---

## Step 2. Count bookings per date

``` r
hotels |>
  mutate(arrival_date = glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")) |>
  count(arrival_date)
```

```
## # A tibble: 793 × 2
##   arrival_date       n
##   <glue>         <int>
## 1 2015 August 1    110
## 2 2015 August 10   207
## 3 2015 August 11   117
## 4 2015 August 12   133
## 5 2015 August 13   107
## 6 2015 August 14   329
## # ℹ 787 more rows
```
]

---

## Step 3. Visualise bookings per date

``` r
hotels |>
  mutate(arrival_date = glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}")) |>
  count(arrival_date) |>
  ggplot(aes(x = arrival_date, y = n, group = 1)) +
  geom_line()
```

<img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-49-1.png" width="80%" style="display: block; margin: auto;" />
]

---

.question[
Why does the plot start with August when we know our data start in July? And why does 10 August come after 1 August?
]

.midi[
<img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-50-1.png" width="80%" style="display: block; margin: auto;" />
]

---

## Step 1. *REVISED* Construct dates "as dates"

``` r
library(lubridate)

hotels |>
  mutate(
*   arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}"))
    ) |> 
  relocate(arrival_date)
```

```
## # A tibble: 119,390 × 33
##   arrival_date hotel      is_canceled lead_time arrival_date_year
##   <date>       <chr>            <dbl>     <dbl>             <dbl>
## 1 2015-07-01   Resort Ho…           0       342              2015
## 2 2015-07-01   Resort Ho…           0       737              2015
## 3 2015-07-01   Resort Ho…           0         7              2015
## 4 2015-07-01   Resort Ho…           0        13              2015
...
```
]

---

## Step 2. Count bookings per date

``` r
hotels |>
  mutate(arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}"))) |> 
  count(arrival_date)
```

```
## # A tibble: 793 × 2
##   arrival_date     n
##   <date>       <int>
## 1 2015-07-01     122
## 2 2015-07-02      93
## 3 2015-07-03      56
## 4 2015-07-04      88
## 5 2015-07-05      53
## 6 2015-07-06      75
## # ℹ 787 more rows
```
]

---

## Step 3a. Visualise bookings per date

``` r
hotels |>
  mutate(arrival_date = ymd(glue("{arrival_date_year} {arrival_date_month} {arrival_date_day_of_month}"))) |> 
  count(arrival_date) |>
  ggplot(aes(x = arrival_date, y = n, group = 1)) +
  geom_line()
```

<img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-53-1.png" width="80%" style="display: block; margin: auto;" />
]

---

## Step 3b. Visualise using a smooth curve

<img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-54-1.png" width="80%" style="display: block; margin: auto;" />
]

---

# Reading rectangular data into R

---

.pull-left[
<img src="img/readr.png" width="80%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/readxl.png" width="80%" style="display: block; margin: auto;" />
]

---

- `read_csv()` - comma delimited files
- `read_csv2()` - semicolon separated files (common in countries where , is used as the decimal place)
- `read_tsv()` - tab delimited files
- `read_delim()` - reads in files with any delimiter
- `read_fwf()` - fixed width files
- ...
]

--
.pull-right[
## readxl

- `read_excel()` - read xls or xlsx files
- ...
]

---

## Reading data

``` r
nobel <- read_csv(file = "data/nobel.csv")
nobel
```

```
## # A tibble: 935 × 26
##      id firstname      surname    year category affiliation city 
##   <dbl> <chr>          <chr>     <dbl> <chr>    <chr>       <chr>
## 1     1 Wilhelm Conrad Röntgen    1901 Physics  Munich Uni… Muni…
## 2     2 Hendrik A.     Lorentz    1902 Physics  Leiden Uni… Leid…
## 3     3 Pieter         Zeeman     1902 Physics  Amsterdam … Amst…
## 4     4 Henri          Becquerel  1903 Physics  École Poly… Paris
## 5     5 Pierre         Curie      1903 Physics  École muni… Paris
## 6     6 Marie          Curie      1903 Physics  <NA>        <NA> 
## # ℹ 929 more rows
## # ℹ 19 more variables: country <chr>, born_date <date>,
## #   died_date <date>, gender <chr>, born_city <chr>,
## #   born_country <chr>, born_country_code <chr>,
## #   died_city <chr>, died_country <chr>,
## #   died_country_code <chr>, overall_motivation <chr>,
## #   share <dbl>, motivation <chr>, …
```

---

## Writing data

``` r
df <- tribble(
  ~x, ~y,
  1,  "a",
  2,  "b",
  3,  "c"
)

write_csv(df, file = "data/df.csv")
```
]

--
.pull-right[
- Read it back in to inspect

``` r
read_csv("data/df.csv")
```

```
## # A tibble: 3 × 2
##       x y    
##   <dbl> <chr>
## 1     1 a    
## 2     2 b    
## 3     3 c
```
]

---

# Variable names

---

## Data with bad names

``` r
edibnb_badnames <- read_csv("data/edibnb-badnames.csv")
names(edibnb_badnames)
```

```
##  [1] "ID"                   "Price"               
##  [3] "neighbourhood"        "accommodates"        
##  [5] "Number of bathrooms"  "Number of Bedrooms"  
##  [7] "n beds"               "Review Scores Rating"
##  [9] "Number of reviews"    "listing_url"
```

... but R doesn't allow spaces in variable names

``` r
ggplot(edibnb_badnames, aes(x = Number of bathrooms, y = Price)) +
  geom_point()
```

```
## Error in parse(text = input): <text>:1:40: unexpected symbol
## 1: ggplot(edibnb_badnames, aes(x = Number of
##                                            ^
```

---

## Option 1 - Define column names

``` r
edibnb_col_names <- read_csv("data/edibnb-badnames.csv",
                             col_names = c("id", "price", 
                                           "neighbourhood", "accommodates",
                                           "bathroom", "bedroom", 
                                           "bed", "review_scores_rating", 
                                           "n_reviews", "url"))

names(edibnb_col_names)
```

```
##  [1] "id"                   "price"               
##  [3] "neighbourhood"        "accommodates"        
##  [5] "bathroom"             "bedroom"             
##  [7] "bed"                  "review_scores_rating"
##  [9] "n_reviews"            "url"
```
]

---

## Option 2 - Format text to snake_case

``` r
edibnb_clean_names <- read_csv("data/edibnb-badnames.csv") |>
  janitor::clean_names()

names(edibnb_clean_names)
```

```
##  [1] "id"                   "price"               
##  [3] "neighbourhood"        "accommodates"        
##  [5] "number_of_bathrooms"  "number_of_bedrooms"  
##  [7] "n_beds"               "review_scores_rating"
##  [9] "number_of_reviews"    "listing_url"
```

---

# Variable types

---

.pull-left[
<img src="img/df-na.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[

``` r
read_csv("data/df-na.csv")
```

```
## # A tibble: 9 × 3
##   x     y              z     
##   <chr> <chr>          <chr> 
## 1 1     a              hi    
## 2 <NA>  b              hello 
## 3 3     Not applicable 9999  
## 4 4     d              ola   
## 5 5     e              hola  
## 6 .     f              whatup
## 7 7     g              wassup
## 8 8     h              sup   
## 9 9     i              <NA>
```
]

---

## Option 1. Explicit NAs

``` r
read_csv("data/df-na.csv", 
         na = c("", "NA", ".", "9999", "Not applicable"))
```

.pull-left[
<img src="img/df-na.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[

```
## # A tibble: 9 × 3
##       x y     z     
##   <dbl> <chr> <chr> 
## 1     1 a     hi    
## 2    NA b     hello 
## 3     3 <NA>  <NA>  
## 4     4 d     ola   
## 5     5 e     hola  
## 6    NA f     whatup
## 7     7 g     wassup
## 8     8 h     sup   
## 9     9 i     <NA>
```
]

---

## Option 2. Specify column types

``` r
read_csv("data/df-na.csv", col_types = list(col_double(), 
                                            col_character(), 
                                            col_character()))
```

```
## Warning: One or more parsing issues, call `problems()` on your data frame
## for details, e.g.:
##   dat <- vroom(...)
##   problems(dat)
```

```
## # A tibble: 9 × 3
##       x y              z     
##   <dbl> <chr>          <chr> 
## 1     1 a              hi    
## 2    NA b              hello 
## 3     3 Not applicable 9999  
## 4     4 d              ola   
## 5     5 e              hola  
## 6    NA f              whatup
## 7     7 g              wassup
## 8     8 h              sup   
## 9     9 i              <NA>
```

---

## Column types

.small[
**type function**  | **data type**
------------------ | -------------
`col_character()`  | character
`col_date()`       | date
`col_datetime()`   | POSIXct (date-time)
`col_double()`     | double (numeric)
`col_factor()`     | factor
`col_guess()`      | let readr guess (default)
`col_integer()`    | integer
`col_logical()`    | logical
`col_number()`     | numbers mixed with non-number characters
`col_numeric()`    | double or integer
`col_skip()`       | do not read
`col_time()`       | time
]

---

class:middle

# Case study: Favourite foods

---

## Favourite foods

``` r
*fav_food <- read_excel("data/favourite-food.xlsx")

fav_food
```

```
## # A tibble: 5 × 6
##   `Student ID` `Full Name`    favourite.food mealPlan AGE   SES  
##          <dbl> <chr>          <chr>          <chr>    <chr> <chr>
## 1            1 Sunil Huffmann Strawberry yo… Lunch o… 4     High 
## 2            2 Barclay Lynn   French fries   Lunch o… 5     Midd…
## 3            3 Jayendra Lyne  N/A            Breakfa… 7     Low  
## 4            4 Leon Rossini   Anchovies      Lunch o… 99999 Midd…
## 5            5 Chidiegwu Dun… Pizza          Breakfa… five  High
```

---

## Variable names

``` r
fav_food <- read_excel("data/favourite-food.xlsx") |>
* janitor::clean_names()

fav_food 
```

```
## # A tibble: 5 × 6
##   student_id full_name       favourite_food meal_plan age   ses  
##        <dbl> <chr>           <chr>          <chr>     <chr> <chr>
## 1          1 Sunil Huffmann  Strawberry yo… Lunch on… 4     High 
## 2          2 Barclay Lynn    French fries   Lunch on… 5     Midd…
## 3          3 Jayendra Lyne   N/A            Breakfas… 7     Low  
## 4          4 Leon Rossini    Anchovies      Lunch on… 99999 Midd…
## 5          5 Chidiegwu Dunk… Pizza          Breakfas… five  High
```

---

## Handling NAs

``` r
fav_food <- read_excel("data/favourite-food.xlsx",
*                      na = c("N/A", "99999")) |>
  janitor::clean_names()

fav_food 
```

```
## # A tibble: 5 × 6
##   student_id full_name       favourite_food meal_plan age   ses  
##        <dbl> <chr>           <chr>          <chr>     <chr> <chr>
## 1          1 Sunil Huffmann  Strawberry yo… Lunch on… 4     High 
## 2          2 Barclay Lynn    French fries   Lunch on… 5     Midd…
## 3          3 Jayendra Lyne   <NA>           Breakfas… 7     Low  
## 4          4 Leon Rossini    Anchovies      Lunch on… <NA>  Midd…
## 5          5 Chidiegwu Dunk… Pizza          Breakfas… five  High
```

---

## Make `age` numeric

``` r
fav_food <- fav_food |>
* mutate(
*   age = if_else(age == "five", "5", age),
*   age = as.numeric(age)
*   )

glimpse(fav_food) 
```

```
## Rows: 5
## Columns: 6
## $ student_id     <dbl> 1, 2, 3, 4, 5
## $ full_name      <chr> "Sunil Huffmann", "Barclay Lynn", "Jayen…
## $ favourite_food <chr> "Strawberry yoghurt", "French fries", NA…
## $ meal_plan      <chr> "Lunch only", "Lunch only", "Breakfast a…
## $ age            <dbl> 4, 5, 7, NA, 5
## $ ses            <chr> "High", "Middle", "Low", "Middle", "High"
```
]
.pull-right-narrow[
<img src="img/fav-food/fav-food-age.png" width="60%" style="display: block; margin: auto;" />
]

---

## Socio-economic status

``` r
fav_food |>
  count(ses)
```

```
## # A tibble: 3 × 2
##   ses        n
##   <chr>  <int>
## 1 High       2
## 2 Low        1
## 3 Middle     2
```
]
.pull-right-narrow[
<img src="img/fav-food/fav-food-ses.png" width="60%" style="display: block; margin: auto;" />
]

---

## Make `ses` factor

``` r
fav_food <- fav_food |>
* mutate(ses = fct_relevel(ses, "Low", "Middle", "High"))

fav_food |>
  count(ses)
```

```
## # A tibble: 3 × 2
##   ses        n
##   <fct>  <int>
## 1 Low        1
## 2 Middle     2
## 3 High       2
```
]

---

## Putting it altogether

``` r
fav_food <- read_excel("data/favourite-food.xlsx", na = c("N/A", "99999")) |>
  janitor::clean_names() |>
  mutate(
    age = if_else(age == "five", "5", age), 
    age = as.numeric(age),
    ses = fct_relevel(ses, "Low", "Middle", "High")
  )

fav_food
```

```
## # A tibble: 5 × 6
##   student_id full_name       favourite_food meal_plan   age ses  
##        <dbl> <chr>           <chr>          <chr>     <dbl> <fct>
## 1          1 Sunil Huffmann  Strawberry yo… Lunch on…     4 High 
## 2          2 Barclay Lynn    French fries   Lunch on…     5 Midd…
## 3          3 Jayendra Lyne   <NA>           Breakfas…     7 Low  
## 4          4 Leon Rossini    Anchovies      Lunch on…    NA Midd…
## 5          5 Chidiegwu Dunk… Pizza          Breakfas…     5 High
```

---

## Out and back in

``` r
write_csv(fav_food, file = "data/fav-food-clean.csv")

fav_food_clean <- read_csv("data/fav-food-clean.csv")
```

---

``` r
fav_food_clean |>
  count(ses)
```

```
## # A tibble: 3 × 2
##   ses        n
##   <chr>  <int>
## 1 High       2
## 2 Low        1
## 3 Middle     2
```

---

## `read_rds()` and `write_rds()`

- CSVs can be unreliable for saving interim results if there is specific variable type information you want to hold on to.
- An alternative is RDS files, you can read and write them with `read_rds()` and `write_rds()`, respectively.

``` r
read_rds(path)
write_rds(x, path)
```

---

## Out and back in, take 2

``` r
write_rds(fav_food, file = "data/fav-food-clean.rds")

fav_food_clean <- read_rds("data/fav-food-clean.rds")

fav_food_clean |>
  count(ses)
```

```
## # A tibble: 3 × 2
##   ses        n
##   <fct>  <int>
## 1 Low        1
## 2 Middle     2
## 3 High       2
```

---

# Other types of data

---

## Other types of data

- **googlesheets4:** Google Sheets
- **haven**: SPSS, Stata, and SAS files
- **DBI**, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): allows you to run SQL queries against a database and return a data frame
- **jsonline**: JSON
- **xml2**: xml
- **rvest**: web scraping
- **httr**: web APIs
- **sparklyr**: data loaded into spark

---

# Case study: Religion and income

---

.footnote[
Source: [pewforum.org/religious-landscape-study/income-distribution](https://www.pewforum.org/religious-landscape-study/income-distribution/), Retrieved 14 April, 2020
]

---

## Read data

``` r
library(readxl)
rel_inc <- read_excel("data/relig-income.xlsx")
```

```
## # A tibble: 12 × 6
##   `Religious tradition`     `Less than $30,000` `$30,000-$49,999`
##   <chr>                                   <dbl>             <dbl>
## 1 Buddhist                                 0.36              0.18
## 2 Catholic                                 0.36              0.19
## 3 Evangelical Protestant                   0.35              0.22
## 4 Hindu                                    0.17              0.13
## 5 Historically Black Prote…                0.53              0.22
## 6 Jehovah's Witness                        0.48              0.25
## # ℹ 6 more rows
## # ℹ 3 more variables: `$50,000-$99,999` <dbl>,
## #   `$100,000 or more` <dbl>, `Sample Size` <dbl>
```
]

---

## Rename columns

``` r
rel_inc |>
  rename( 
    religion = `Religious tradition`, 
    n = `Sample Size` 
  ) 
```

```
## # A tibble: 12 × 6
##   religion                  `Less than $30,000` `$30,000-$49,999`
##   <chr>                                   <dbl>             <dbl>
## 1 Buddhist                                 0.36              0.18
## 2 Catholic                                 0.36              0.19
## 3 Evangelical Protestant                   0.35              0.22
## 4 Hindu                                    0.17              0.13
## 5 Historically Black Prote…                0.53              0.22
## 6 Jehovah's Witness                        0.48              0.25
## # ℹ 6 more rows
## # ℹ 3 more variables: `$50,000-$99,999` <dbl>,
## #   `$100,000 or more` <dbl>, n <dbl>
```
]

---

.question[
If we want a new variable called `income` with levels such as "Less than $30,000", "$30,000-$49,999", ... etc. which function should we use?
]

```
## # A tibble: 48 × 4
##    religion                   n income            proportion
##    <chr>                  <dbl> <chr>                  <dbl>
##  1 Buddhist                 233 Less than $30,000       0.36
##  2 Buddhist                 233 $30,000-$49,999         0.18
##  3 Buddhist                 233 $50,000-$99,999         0.32
##  4 Buddhist                 233 $100,000 or more        0.13
##  5 Catholic                6137 Less than $30,000       0.36
##  6 Catholic                6137 $30,000-$49,999         0.19
##  7 Catholic                6137 $50,000-$99,999         0.26
##  8 Catholic                6137 $100,000 or more        0.19
##  9 Evangelical Protestant  7462 Less than $30,000       0.35
## 10 Evangelical Protestant  7462 $30,000-$49,999         0.22
## 11 Evangelical Protestant  7462 $50,000-$99,999         0.28
## 12 Evangelical Protestant  7462 $100,000 or more        0.14
## 13 Hindu                    172 Less than $30,000       0.17
## 14 Hindu                    172 $30,000-$49,999         0.13
## 15 Hindu                    172 $50,000-$99,999         0.34
## # ℹ 33 more rows
```

---

## Pivot longer

``` r
rel_inc |>
  rename(
    religion = `Religious tradition`,
    n = `Sample Size`
  ) |>
  pivot_longer( 
    cols = -c(religion, n),   # all but religion and n 
    names_to = "income",  
    values_to = "proportion" 
  ) 
```

---

## Calculate frequencies

``` r
rel_inc |>
  rename(
    religion = `Religious tradition`,
    n = `Sample Size`
  ) |>
  pivot_longer(
    cols = -c(religion, n), 
    names_to = "income", 
    values_to = "proportion"
  ) |>
  mutate(frequency = round(proportion * n)) 
```

```
## # A tibble: 48 × 5
##   religion     n income            proportion frequency
##   <chr>    <dbl> <chr>                  <dbl>     <dbl>
## 1 Buddhist   233 Less than $30,000       0.36        84
## 2 Buddhist   233 $30,000-$49,999         0.18        42
## 3 Buddhist   233 $50,000-$99,999         0.32        75
## 4 Buddhist   233 $100,000 or more        0.13        30
## 5 Catholic  6137 Less than $30,000       0.36      2209
## 6 Catholic  6137 $30,000-$49,999         0.19      1166
## # ℹ 42 more rows
```
]

---

## Save data

``` r
rel_inc_long <- rel_inc |> 
  rename(
    religion = `Religious tradition`,
    n = `Sample Size`
  ) |>
  pivot_longer(
    cols = -c(religion, n), 
    names_to = "income", 
    values_to = "proportion"
  ) |>
  mutate(frequency = round(proportion * n))
```

---

## Barplot

``` r
ggplot(rel_inc_long, aes(y = religion, x = frequency)) +
  geom_col()
```

---

## Recode religion

``` r
rel_inc_long <- rel_inc_long |>
  mutate(religion = case_when(
    religion == "Evangelical Protestant"           ~ "Ev. Protestant",
    religion == "Historically Black Protestant"    ~ "Hist. Black Protestant",
    religion == 'Unaffiliated (religious "nones")' ~ "Unaffiliated",
    TRUE                                           ~ religion
  ))
```
]

.panel[.panel-name[Plot]
<img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-98-1.png" width="65%" style="display: block; margin: auto;" />
]

]

---

## Reverse religion order

``` r
rel_inc_long <- rel_inc_long |>
  mutate(religion = fct_rev(religion)) 
```
]

.panel[.panel-name[Plot]
<img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-100-1.png" width="65%" style="display: block; margin: auto;" />
]

]

---

## Add income

.panel[.panel-name[Plot]
<img src="06-data-importing-cleaning_files/figure-html/rel-income-1.png" width="65%" style="display: block; margin: auto;" />
]

``` r
ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) + 
  geom_col() 
```
]

]

---

## Fill bars

.panel[.panel-name[Plot]
<img src="06-data-importing-cleaning_files/figure-html/rel-income-fill-1.png" width="65%" style="display: block; margin: auto;" />
]

``` r
ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) +
  geom_col(position = "fill")
```
]

]

---

## Change colors

.panel[.panel-name[Plot]
<img src="06-data-importing-cleaning_files/figure-html/rel-income-fill-viridis-1.png" width="65%" style="display: block; margin: auto;" />
]

``` r
ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) +
  geom_col(position = "fill") +
  scale_fill_viridis_d()
```
]

]

---

## Change theme

.panel[.panel-name[Plot]
<img src="06-data-importing-cleaning_files/figure-html/rel-income-fill-viridis-minimal-1.png" width="65%" style="display: block; margin: auto;" />
]

``` r
ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) +
  geom_col(position = "fill") +
  scale_fill_viridis_d() +
  theme_minimal() 
```
]

]

---

## Move legend to the bottom

.panel[.panel-name[Plot]
<img src="06-data-importing-cleaning_files/figure-html/bottom-legend-1.png" width="65%" style="display: block; margin: auto;" />
]

``` r
ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) +
  geom_col(position = "fill") +
  scale_fill_viridis_d() +
  theme_minimal() +
  theme(legend.position = "bottom")
```
]

]

---

## Legend adjustments

.panel[.panel-name[Plot]
<img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-106-1.png" width="65%" style="display: block; margin: auto;" />
]

]

---

## Fix labels

.panel[.panel-name[Plot]
<img src="06-data-importing-cleaning_files/figure-html/unnamed-chunk-107-1.png" width="65%" style="display: block; margin: auto;" />
]

``` r
ggplot(rel_inc_long, aes(y = religion, x = frequency, fill = income)) +
  geom_col(position = "fill") +
  scale_fill_viridis_d() +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(fill = guide_legend(nrow = 2, byrow = TRUE)) +
  labs(
    x = "Proportion", y = "", 
    title = "Income distribution by religious group", 
    subtitle = "Source: Pew Research Center, Religious Landscape Study", 
    fill = "Income" 
    )
```
]

]