Data Wrangling I

.title[
# Data Wrangling I
]
.subtitle[
## STAT 4380
]
.author[
### Katie Fitzgerald, adpated from datasciencebox.org
]

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://nova-stat-4380.netlify.app" target="_blank">nova-stat-4380.netlify.app</a>
</span>
</div>

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy

> Tidy datasets are all alike; every untidy dataset is untidy in its own way

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
]
--
.pull-right[
**Characteristics of untidy data:**

!@#$%^&*()
]

---

.footnote[
Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)
]

---

<br>

.footnote[
Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data)
]

---

<br>

.footnote[
Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)
]

---

# Grammar of data wrangling

---

## A grammar of data wrangling...

... based on the concepts of functions as verbs that manipulate data frames

.pull-left[
<img src="img/dplyr-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" />
]
.pull-right[
.midi[
- `select`: pick columns by name
- `arrange`: reorder rows
- `slice`: pick rows using index(es)
- `filter`: pick rows matching criteria
- `distinct`: filter for unique rows
- `mutate`: add new variables
- `summarise`: reduce variables to values
- `group_by`: for grouped operations
- ... (many more)
]
]

---

## Rules of **dplyr** functions

- First argument is *always* a data frame
- Subsequent arguments say what to do with that data frame
- Always return a data frame
- Don't modify in place

---

## Data: Hotel bookings

- Data from two hotels: one resort and one city hotel
- Observations: Each row represents a hotel booking
- Goal for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5))

``` r
hotels <- read_csv("data/hotels.csv")
```

.footnote[
Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)
]

---

## First look: Variables

``` r
names(hotels)
```

```
##  [1] "hotel"                         
##  [2] "is_canceled"                   
##  [3] "lead_time"                     
##  [4] "arrival_date_year"             
##  [5] "arrival_date_month"            
##  [6] "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"     
##  [8] "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"          
## [10] "adults"                        
## [11] "children"                      
## [12] "babies"                        
## [13] "meal"                          
## [14] "country"                       
## [15] "market_segment"                
## [16] "distribution_channel"          
## [17] "is_repeated_guest"             
## [18] "previous_cancellations"        
...
```

---

## Second look: Overview

``` r
glimpse(hotels)
```

```
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort …
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14,…
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, …
## $ arrival_date_month             <chr> "July", "July", "July", …
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, …
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, …
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, …
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <chr> "BB", "BB", "BB", "BB", …
## $ country                        <chr> "PRT", "PRT", "GBR", "GB…
## $ market_segment                 <chr> "Direct", "Direct", "Dir…
## $ distribution_channel           <chr> "Direct", "Direct", "Dir…
...
```

---

## `select` a single column

View only `lead_time` (number of days between booking and arrival date):

``` r
*select(
  hotels, 
  lead_time
  )
```
]
.pull-right[
- Start with the function (a verb): `select()`
]

---

## `select` a single column

View only `lead_time` (number of days between booking and arrival date):

``` r
select( 
* hotels,
  lead_time
  )
```
]
.pull-right[
- Start with the function (a verb): `select()`
- First argument: data frame we're working with , `hotels`
]

---

## `select` a single column

View only `lead_time` (number of days between booking and arrival date):

``` r
select( 
  hotels, 
* lead_time
  )
```
]
.pull-right[
- Start with the function (a verb): `select()`
- First argument: data frame we're working with , `hotels`
- Second argument: variable we want to select, `lead_time`
]

---

## `select` a single column

View only `lead_time` (number of days between booking and arrival date):

``` r
select( 
  hotels, 
  lead_time
  )
```

```
## # A tibble: 119,390 × 1
##   lead_time
##       <dbl>
## 1       342
## 2       737
## 3         7
## 4        13
## 5        14
## 6        14
## # ℹ 119,384 more rows
```
]
.pull-right[
- Start with the function (a verb): `select()`
- First argument: data frame we're working with , `hotels`
- Second argument: variable we want to select, `lead_time`
- Result: data frame with 119390 rows and 1 column
]

---

``` r
select(hotels, lead_time)
```

```
## # A tibble: 119,390 × 1
##   lead_time
##       <dbl>
## 1       342
## 2       737
## 3         7
## 4        13
## 5        14
## 6        14
## # ℹ 119,384 more rows
```

---

## `select` multiple columns

View only the `hotel` type and `lead_time`:

``` r
select(hotels, hotel, lead_time)
```

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel       342
## 2 Resort Hotel       737
## 3 Resort Hotel         7
## 4 Resort Hotel        13
## 5 Resort Hotel        14
## 6 Resort Hotel        14
## # ℹ 119,384 more rows
```
]

---

## `select` to exclude variables

``` r
hotels |>
* select(-agent)
```

```
## # A tibble: 119,390 × 31
##   hotel        is_canceled lead_time arrival_date_year
##   <chr>              <dbl>     <dbl>             <dbl>
## 1 Resort Hotel           0       342              2015
## 2 Resort Hotel           0       737              2015
## 3 Resort Hotel           0         7              2015
## 4 Resort Hotel           0        13              2015
## 5 Resort Hotel           0        14              2015
## 6 Resort Hotel           0        14              2015
## # ℹ 119,384 more rows
## # ℹ 27 more variables: arrival_date_month <chr>,
## #   arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
## #   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
## #   country <chr>, market_segment <chr>, …
```
]

---

## `select` a range of variables

``` r
hotels |>
* select(hotel:arrival_date_month)
```

```
## # A tibble: 119,390 × 5
##   hotel        is_canceled lead_time arrival_date_year
##   <chr>              <dbl>     <dbl>             <dbl>
## 1 Resort Hotel           0       342              2015
## 2 Resort Hotel           0       737              2015
## 3 Resort Hotel           0         7              2015
## 4 Resort Hotel           0        13              2015
## 5 Resort Hotel           0        14              2015
## 6 Resort Hotel           0        14              2015
## # ℹ 119,384 more rows
## # ℹ 1 more variable: arrival_date_month <chr>
```

---

## `select` variables with certain characteristics

``` r
hotels |>
* select(starts_with("arrival"))
```

```
## # A tibble: 119,390 × 4
##   arrival_date_year arrival_date_month arrival_date_week_number
##               <dbl> <chr>                                 <dbl>
## 1              2015 July                                     27
## 2              2015 July                                     27
## 3              2015 July                                     27
## 4              2015 July                                     27
## 5              2015 July                                     27
## 6              2015 July                                     27
## # ℹ 119,384 more rows
## # ℹ 1 more variable: arrival_date_day_of_month <dbl>
```

---

## `select` variables with certain characteristics

``` r
hotels |>
* select(ends_with("type"))
```

```
## # A tibble: 119,390 × 4
##   reserved_room_type assigned_room_type deposit_type
##   <chr>              <chr>              <chr>       
## 1 C                  C                  No Deposit  
## 2 C                  C                  No Deposit  
## 3 A                  C                  No Deposit  
## 4 A                  A                  No Deposit  
## 5 A                  A                  No Deposit  
## 6 A                  A                  No Deposit  
## # ℹ 119,384 more rows
## # ℹ 1 more variable: customer_type <chr>
```

---

## Select helpers

- `starts_with()`: Starts with a prefix
- `ends_with()`: Ends with a suffix
- `contains()`: Contains a literal string
- `num_range()`: Matches a numerical range like x01, x02, x03
- `one_of()`: Matches variable names in a character vector
- `everything()`: Matches all variables
- `last_col()`: Select last variable, possibly with an offset
- `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

.question[
What if we wanted to select columns, and then arrange the data in descending order of lead time?
]

---

## Data wrangling, step-by-step

``` r
hotels |>
  select(hotel, lead_time)
```

--
.pull-right[
Select, then arrange:

``` r
hotels |>
  select(hotel, lead_time) |>
  arrange(desc(lead_time))
```

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel       737
## 2 Resort Hotel       709
## 3 City Hotel         629
## 4 City Hotel         629
## 5 City Hotel         629
## 6 City Hotel         629
## # ℹ 119,384 more rows
```
]

---

# Pipes

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
]
.pull-right[
.small[

``` r
*hotels |>
  select(hotel, lead_time) |>
  arrange(desc(lead_time))
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
- then we select the variables `hotel` and `lead_time`,
]
.pull-right[
.small[

``` r
hotels |>
* select(hotel, lead_time) |>
  arrange(desc(lead_time))
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
- then we select the variables `hotel` and `lead_time`,
- and then we arrange the data frame by `lead_time` in descending order.
]
.pull-right[
.small[

``` r
hotels |>
  select(hotel, lead_time) |> 
* arrange(desc(lead_time))
```

---

## Aside %>% vs. |>

Historically, the `tidyverse` community has used the `%>%` pipe operator, from the `magrittr` package (loaded with the `tidyverse`)

A couple of years ago, Hadley Wickham & Posit recommended we move to the "native pipe" `|>`

They are equivalent in most circumstances; `|>` supposedly is more streamlined and faster.

---

# Shortcuts for the pipe

The following keystroke is a "shortcut" for typing the pipe operator:

+ Command + Shift + M (Mac)
+ Ctrl + Shift + M (Windows)

By default it will type the old pipe `%>%`. To change this, go to Tools > Global Options > Code > Use native pipe operator.

---

## How does a pipe work?

- You can think about the following sequence of actions - find keys, start car, drive to work, park.

--
- Expressed as a set of nested functions in R pseudocode this would look like:

``` r
park(drive(start_car(find("keys")), to = "work"))
```

--
- Writing it out using pipes give it a more natural (and easier to read) 
structure:

``` r
find("keys") |>
  start_car() |>
  drive(to = "work") |>
  park()
```

---

## A note on piping and layering

- `|>` used mainly in **dplyr** pipelines, *we pipe the output of the previous line of code as the first input of the next line of code*

--
- `+` used in **ggplot2** plots is used for "layering", *we create the plot in layers, separated by `+`*

---

## dplyr

``` r
hotels +
  select(hotel, lead_time)
```

```
## Error: object 'hotel' not found
```

✅

``` r
hotels |>
  select(hotel, lead_time)
```

```
## # A tibble: 119,390 × 2
##   hotel        lead_time
##   <chr>            <dbl>
## 1 Resort Hotel       342
## 2 Resort Hotel       737
## 3 Resort Hotel         7
...
```
]

---

## ggplot2

``` r
ggplot(hotels, aes(x = hotel, fill = deposit_type)) |>
  geom_bar()
```

```
## Error in `geom_bar()`:
## ! `mapping` must be created by `aes()`.
## ✖ You've supplied a <ggplot2::ggplot> object.
## ℹ Did you use `%>%` or `|>` instead of `+`?
```

✅

``` r
ggplot(hotels, aes(x = hotel, fill = deposit_type)) +
  geom_bar()
```

<img src="05-data-wrangling-1a_files/figure-html/unnamed-chunk-26-1.png" width="25%" style="display: block; margin: auto;" />
]

---

## Code styling

Many of the styling principles are consistent across `|>` and `+`:

- always a space before
- always a line break after (for pipelines with more than 2 lines)

❌

``` r
ggplot(hotels,aes(x=hotel,y=deposit_type))+geom_bar()
```

✅

``` r
ggplot(hotels, aes(x = hotel, y = deposit_type)) + 
  geom_bar()
```

---

## `arrange` in ascending / descending order

``` r
hotels |>
  select(adults, children, babies) |>
* arrange(babies)
```

```
## # A tibble: 119,390 × 3
##   adults children babies
##    <dbl>    <dbl>  <dbl>
## 1      2        0      0
## 2      2        0      0
## 3      1        0      0
## 4      1        0      0
## 5      2        0      0
## 6      2        0      0
## # ℹ 119,384 more rows
```
]
.pull-right[

``` r
hotels |>
  select(adults, children, babies) |>
* arrange(desc(babies))
```

```
## # A tibble: 119,390 × 3
##   adults children babies
##    <dbl>    <dbl>  <dbl>
## 1      2        0     10
## 2      1        0      9
## 3      2        0      2
## 4      2        0      2
## 5      2        0      2
## 6      2        0      2
## # ℹ 119,384 more rows
```
]

---

## `slice` for certain row numbers

``` r
# first five
hotels |>
* slice(1:5)
```

```
## # A tibble: 5 × 32
##   hotel        is_canceled lead_time arrival_date_year
##   <chr>              <dbl>     <dbl>             <dbl>
## 1 Resort Hotel           0       342              2015
## 2 Resort Hotel           0       737              2015
## 3 Resort Hotel           0         7              2015
## 4 Resort Hotel           0        13              2015
## 5 Resort Hotel           0        14              2015
## # ℹ 28 more variables: arrival_date_month <chr>,
## #   arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
## #   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
## #   country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>, …
```
]

---

.tip[
In R, you can use the `#` for adding comments to your code. 
Any text following `#` will be printed as is, and won't be run as R code.
This is useful for leaving comments in your code and for temporarily disabling 
certain lines of code while debugging.
]

``` r
hotels |>
  # slice the first five rows  # this line is a comment
  #select(hotel) |>           # this one doesn't run
  slice(1:5)                   # this line runs
```

---

# Quiz

How many rows and columns do you expect the following code to produce? `starwars` has 87 rows and 14 columns to begin with (87 x 14).

``` r
starwars |> 
  select(name, homeworld, species)
```

a. 87 x 14
    b. 3 x 14
    c. 87 x 3
    d. 14 x 3

---

# Quiz

How many rows and columns do you expect the following code to produce? `starwars` has 87 rows and 14 columns to begin with.

``` r
starwars |> 
  slice(1:10)
```

a. 87 x 10
    b. 10 x 14
    c. 14 x 10
    d. 10  x 87