class: center, middle, inverse, title-slide .title[ # Data Wrangling I ] .subtitle[ ## STAT 4380 ] .author[ ### Katie Fitzgerald, adpated from datasciencebox.org ] --- layout: true <div class="my-footer"> <span> <a href="https://nova-stat-4380.netlify.app" target="_blank">nova-stat-4380.netlify.app</a> </span> </div> --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- > Tidy datasets are all alike; every untidy dataset is untidy in its own way .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- ## .question[ What makes this data not tidy? ] <img src="img/hyperwar-airplanes-on-hand.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html) ] --- .question[ What makes this data not tidy? ] <br> <img src="img/hiv-est-prevalence-15-49.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data) ] --- .question[ What makes this data not tidy? ] <br> <img src="img/us-general-economic-characteristic-acs-2017.png" width="85%" style="display: block; margin: auto;" /> .footnote[ Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt) ] --- class: middle # Grammar of data wrangling --- ## A grammar of data wrangling... ... based on the concepts of functions as verbs that manipulate data frames .pull-left[ <img src="img/dplyr-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ .midi[ - `select`: pick columns by name - `arrange`: reorder rows - `slice`: pick rows using index(es) - `filter`: pick rows matching criteria - `distinct`: filter for unique rows - `mutate`: add new variables - `summarise`: reduce variables to values - `group_by`: for grouped operations - ... (many more) ] ] --- ## Rules of **dplyr** functions - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Don't modify in place --- ## Data: Hotel bookings - Data from two hotels: one resort and one city hotel - Observations: Each row represents a hotel booking - Goal for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5)) ``` r hotels <- read_csv("data/hotels.csv") ``` .footnote[ Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md) ] --- ## First look: Variables ``` r names(hotels) ``` ``` ## [1] "hotel" ## [2] "is_canceled" ## [3] "lead_time" ## [4] "arrival_date_year" ## [5] "arrival_date_month" ## [6] "arrival_date_week_number" ## [7] "arrival_date_day_of_month" ## [8] "stays_in_weekend_nights" ## [9] "stays_in_week_nights" ## [10] "adults" ## [11] "children" ## [12] "babies" ## [13] "meal" ## [14] "country" ## [15] "market_segment" ## [16] "distribution_channel" ## [17] "is_repeated_guest" ## [18] "previous_cancellations" ... ``` --- ## Second look: Overview ``` r glimpse(hotels) ``` ``` ## Rows: 119,390 ## Columns: 32 ## $ hotel <chr> "Resort Hotel", "Resort … ## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ lead_time <dbl> 342, 737, 7, 13, 14, 14,… ## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, … ## $ arrival_date_month <chr> "July", "July", "July", … ## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, … ## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, … ## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, … ## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, … ## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, … ## $ meal <chr> "BB", "BB", "BB", "BB", … ## $ country <chr> "PRT", "PRT", "GBR", "GB… ## $ market_segment <chr> "Direct", "Direct", "Dir… ## $ distribution_channel <chr> "Direct", "Direct", "Dir… ... ``` --- ## `select` a single column View only `lead_time` (number of days between booking and arrival date): .pull-left[ ``` r *select( hotels, lead_time ) ``` ] .pull-right[ - Start with the function (a verb): `select()` ] --- ## `select` a single column View only `lead_time` (number of days between booking and arrival date): .pull-left[ ``` r select( * hotels, lead_time ) ``` ] .pull-right[ - Start with the function (a verb): `select()` - First argument: data frame we're working with , `hotels` ] --- ## `select` a single column View only `lead_time` (number of days between booking and arrival date): .pull-left[ ``` r select( hotels, * lead_time ) ``` ] .pull-right[ - Start with the function (a verb): `select()` - First argument: data frame we're working with , `hotels` - Second argument: variable we want to select, `lead_time` ] --- ## `select` a single column View only `lead_time` (number of days between booking and arrival date): .pull-left[ ``` r select( hotels, lead_time ) ``` ``` ## # A tibble: 119,390 × 1 ## lead_time ## <dbl> ## 1 342 ## 2 737 ## 3 7 ## 4 13 ## 5 14 ## 6 14 ## # ℹ 119,384 more rows ``` ] .pull-right[ - Start with the function (a verb): `select()` - First argument: data frame we're working with , `hotels` - Second argument: variable we want to select, `lead_time` - Result: data frame with 119390 rows and 1 column ] --- .tip[ dplyr functions always expect a data frame and always yield a data frame. ] ``` r select(hotels, lead_time) ``` ``` ## # A tibble: 119,390 × 1 ## lead_time ## <dbl> ## 1 342 ## 2 737 ## 3 7 ## 4 13 ## 5 14 ## 6 14 ## # ℹ 119,384 more rows ``` --- ## `select` multiple columns View only the `hotel` type and `lead_time`: -- .pull-left[ ``` r select(hotels, hotel, lead_time) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 342 ## 2 Resort Hotel 737 ## 3 Resort Hotel 7 ## 4 Resort Hotel 13 ## 5 Resort Hotel 14 ## 6 Resort Hotel 14 ## # ℹ 119,384 more rows ``` ] --- ## `select` to exclude variables .small[ ``` r hotels |> * select(-agent) ``` ``` ## # A tibble: 119,390 × 31 ## hotel is_canceled lead_time arrival_date_year ## <chr> <dbl> <dbl> <dbl> ## 1 Resort Hotel 0 342 2015 ## 2 Resort Hotel 0 737 2015 ## 3 Resort Hotel 0 7 2015 ## 4 Resort Hotel 0 13 2015 ## 5 Resort Hotel 0 14 2015 ## 6 Resort Hotel 0 14 2015 ## # ℹ 119,384 more rows ## # ℹ 27 more variables: arrival_date_month <chr>, ## # arrival_date_week_number <dbl>, ## # arrival_date_day_of_month <dbl>, ## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, ## # adults <dbl>, children <dbl>, babies <dbl>, meal <chr>, ## # country <chr>, market_segment <chr>, … ``` ] --- ## `select` a range of variables ``` r hotels |> * select(hotel:arrival_date_month) ``` ``` ## # A tibble: 119,390 × 5 ## hotel is_canceled lead_time arrival_date_year ## <chr> <dbl> <dbl> <dbl> ## 1 Resort Hotel 0 342 2015 ## 2 Resort Hotel 0 737 2015 ## 3 Resort Hotel 0 7 2015 ## 4 Resort Hotel 0 13 2015 ## 5 Resort Hotel 0 14 2015 ## 6 Resort Hotel 0 14 2015 ## # ℹ 119,384 more rows ## # ℹ 1 more variable: arrival_date_month <chr> ``` --- ## `select` variables with certain characteristics ``` r hotels |> * select(starts_with("arrival")) ``` ``` ## # A tibble: 119,390 × 4 ## arrival_date_year arrival_date_month arrival_date_week_number ## <dbl> <chr> <dbl> ## 1 2015 July 27 ## 2 2015 July 27 ## 3 2015 July 27 ## 4 2015 July 27 ## 5 2015 July 27 ## 6 2015 July 27 ## # ℹ 119,384 more rows ## # ℹ 1 more variable: arrival_date_day_of_month <dbl> ``` --- ## `select` variables with certain characteristics ``` r hotels |> * select(ends_with("type")) ``` ``` ## # A tibble: 119,390 × 4 ## reserved_room_type assigned_room_type deposit_type ## <chr> <chr> <chr> ## 1 C C No Deposit ## 2 C C No Deposit ## 3 A C No Deposit ## 4 A A No Deposit ## 5 A A No Deposit ## 6 A A No Deposit ## # ℹ 119,384 more rows ## # ℹ 1 more variable: customer_type <chr> ``` --- ## Select helpers - `starts_with()`: Starts with a prefix - `ends_with()`: Ends with a suffix - `contains()`: Contains a literal string - `num_range()`: Matches a numerical range like x01, x02, x03 - `one_of()`: Matches variable names in a character vector - `everything()`: Matches all variables - `last_col()`: Select last variable, possibly with an offset - `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text) -- .question[ What if we wanted to select columns, and then arrange the data in descending order of lead time? ] --- ## Data wrangling, step-by-step .pull-left[ Select: ``` r hotels |> select(hotel, lead_time) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 342 ## 2 Resort Hotel 737 ## 3 Resort Hotel 7 ## 4 Resort Hotel 13 ## 5 Resort Hotel 14 ## 6 Resort Hotel 14 ## # ℹ 119,384 more rows ``` ] -- .pull-right[ Select, then arrange: ``` r hotels |> select(hotel, lead_time) |> arrange(desc(lead_time)) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 737 ## 2 Resort Hotel 709 ## 3 City Hotel 629 ## 4 City Hotel 629 ## 5 City Hotel 629 ## 6 City Hotel 629 ## # ℹ 119,384 more rows ``` ] --- --- class: middle # Pipes --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. -- .pull-left[ - Start with the data frame `hotels`, and pass it to the `select()` function, ] .pull-right[ .small[ ``` r *hotels |> select(hotel, lead_time) |> arrange(desc(lead_time)) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 737 ## 2 Resort Hotel 709 ## 3 City Hotel 629 ## 4 City Hotel 629 ## 5 City Hotel 629 ## 6 City Hotel 629 ## # ℹ 119,384 more rows ``` ] ] --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. .pull-left[ - Start with the data frame `hotels`, and pass it to the `select()` function, - then we select the variables `hotel` and `lead_time`, ] .pull-right[ .small[ ``` r hotels |> * select(hotel, lead_time) |> arrange(desc(lead_time)) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 737 ## 2 Resort Hotel 709 ## 3 City Hotel 629 ## 4 City Hotel 629 ## 5 City Hotel 629 ## 6 City Hotel 629 ## # ℹ 119,384 more rows ``` ] ] --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. .pull-left[ - Start with the data frame `hotels`, and pass it to the `select()` function, - then we select the variables `hotel` and `lead_time`, - and then we arrange the data frame by `lead_time` in descending order. ] .pull-right[ .small[ ``` r hotels |> select(hotel, lead_time) |> * arrange(desc(lead_time)) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 737 ## 2 Resort Hotel 709 ## 3 City Hotel 629 ## 4 City Hotel 629 ## 5 City Hotel 629 ## 6 City Hotel 629 ## # ℹ 119,384 more rows ``` ] ] --- ## Aside %>% vs. |> Historically, the `tidyverse` community has used the `%>%` pipe operator, from the `magrittr` package (loaded with the `tidyverse`) A couple of years ago, Hadley Wickham & Posit recommended we move to the "native pipe" `|>` They are equivalent in most circumstances; `|>` supposedly is more streamlined and faster. --- # Shortcuts for the pipe The following keystroke is a "shortcut" for typing the pipe operator: + Command + Shift + M (Mac) + Ctrl + Shift + M (Windows) By default it will type the old pipe `%>%`. To change this, go to Tools > Global Options > Code > Use native pipe operator. --- ## How does a pipe work? - You can think about the following sequence of actions - find keys, start car, drive to work, park. -- - Expressed as a set of nested functions in R pseudocode this would look like: ``` r park(drive(start_car(find("keys")), to = "work")) ``` -- - Writing it out using pipes give it a more natural (and easier to read) structure: ``` r find("keys") |> start_car() |> drive(to = "work") |> park() ``` --- ## A note on piping and layering - `|>` used mainly in **dplyr** pipelines, *we pipe the output of the previous line of code as the first input of the next line of code* -- - `+` used in **ggplot2** plots is used for "layering", *we create the plot in layers, separated by `+`* --- ## dplyr .midi[ ❌ ``` r hotels + select(hotel, lead_time) ``` ``` ## Error: object 'hotel' not found ``` ✅ ``` r hotels |> select(hotel, lead_time) ``` ``` ## # A tibble: 119,390 × 2 ## hotel lead_time ## <chr> <dbl> ## 1 Resort Hotel 342 ## 2 Resort Hotel 737 ## 3 Resort Hotel 7 ... ``` ] --- ## ggplot2 .midi[ ❌ ``` r ggplot(hotels, aes(x = hotel, fill = deposit_type)) |> geom_bar() ``` ``` ## Error in `geom_bar()`: ## ! `mapping` must be created by `aes()`. ## ✖ You've supplied a <ggplot2::ggplot> object. ## ℹ Did you use `%>%` or `|>` instead of `+`? ``` ✅ ``` r ggplot(hotels, aes(x = hotel, fill = deposit_type)) + geom_bar() ``` <img src="05-data-wrangling-1a_files/figure-html/unnamed-chunk-26-1.png" width="25%" style="display: block; margin: auto;" /> ] --- ## Code styling Many of the styling principles are consistent across `|>` and `+`: - always a space before - always a line break after (for pipelines with more than 2 lines) ❌ ``` r ggplot(hotels,aes(x=hotel,y=deposit_type))+geom_bar() ``` ✅ ``` r ggplot(hotels, aes(x = hotel, y = deposit_type)) + geom_bar() ``` --- ## `arrange` in ascending / descending order .pull-left[ ``` r hotels |> select(adults, children, babies) |> * arrange(babies) ``` ``` ## # A tibble: 119,390 × 3 ## adults children babies ## <dbl> <dbl> <dbl> ## 1 2 0 0 ## 2 2 0 0 ## 3 1 0 0 ## 4 1 0 0 ## 5 2 0 0 ## 6 2 0 0 ## # ℹ 119,384 more rows ``` ] .pull-right[ ``` r hotels |> select(adults, children, babies) |> * arrange(desc(babies)) ``` ``` ## # A tibble: 119,390 × 3 ## adults children babies ## <dbl> <dbl> <dbl> ## 1 2 0 10 ## 2 1 0 9 ## 3 2 0 2 ## 4 2 0 2 ## 5 2 0 2 ## 6 2 0 2 ## # ℹ 119,384 more rows ``` ] --- ## `slice` for certain row numbers .midi[ ``` r # first five hotels |> * slice(1:5) ``` ``` ## # A tibble: 5 × 32 ## hotel is_canceled lead_time arrival_date_year ## <chr> <dbl> <dbl> <dbl> ## 1 Resort Hotel 0 342 2015 ## 2 Resort Hotel 0 737 2015 ## 3 Resort Hotel 0 7 2015 ## 4 Resort Hotel 0 13 2015 ## 5 Resort Hotel 0 14 2015 ## # ℹ 28 more variables: arrival_date_month <chr>, ## # arrival_date_week_number <dbl>, ## # arrival_date_day_of_month <dbl>, ## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, ## # adults <dbl>, children <dbl>, babies <dbl>, meal <chr>, ## # country <chr>, market_segment <chr>, ## # distribution_channel <chr>, is_repeated_guest <dbl>, … ``` ] --- .tip[ In R, you can use the `#` for adding comments to your code. Any text following `#` will be printed as is, and won't be run as R code. This is useful for leaving comments in your code and for temporarily disabling certain lines of code while debugging. ] .small[ ``` r hotels |> # slice the first five rows # this line is a comment #select(hotel) |> # this one doesn't run slice(1:5) # this line runs ``` ``` ## # A tibble: 5 × 32 ## hotel is_canceled lead_time arrival_date_year ## <chr> <dbl> <dbl> <dbl> ## 1 Resort Hotel 0 342 2015 ## 2 Resort Hotel 0 737 2015 ## 3 Resort Hotel 0 7 2015 ## 4 Resort Hotel 0 13 2015 ## 5 Resort Hotel 0 14 2015 ## # ℹ 28 more variables: arrival_date_month <chr>, ## # arrival_date_week_number <dbl>, ... ``` ] --- # Quiz How many rows and columns do you expect the following code to produce? `starwars` has 87 rows and 14 columns to begin with (87 x 14). ``` r starwars |> select(name, homeworld, species) ``` a. 87 x 14 b. 3 x 14 c. 87 x 3 d. 14 x 3 --- # Quiz How many rows and columns do you expect the following code to produce? `starwars` has 87 rows and 14 columns to begin with. ``` r starwars |> slice(1:10) ``` a. 87 x 10 b. 10 x 14 c. 14 x 10 d. 10 x 87