class: center, middle, inverse, title-slide .title[ # Data Wrangling II ] .subtitle[ ## STAT 7500 ] .author[ ### Katie Fitzgerald, adpated from datasciencebox.org ] --- layout: true <div class="my-footer"> <span> <a href="https://kgfitzgerald.github.io/stat-7500" target="_blank">kgfitzgerald.github.io/stat-7500</a> </span> </div> --- # Goals Today you will practice: + data wrangling with `dplyr` verbs + joining multiple data frames + pivoting data frames to wide or long format + planning a data wrangling pipeline to accomplished a desired task/visualization --- # A quick refresh of `dplyr` verbs in R Which of the following allows you to keep rows that meet a certain criteria? A. `keep()` B. `where()` C. `filter()` D. `select()` E. `mutate()` --- # A quick refresh of `dplyr` verbs in R Which of the following allows you to keep specified columns? A. `keep()` B. `colname()` C. `filter()` D. `select()` E. `mutate()` --- # A quick refresh of `dplyr` verbs in R Recall the `midwest` dataset you used for homework (dimensions: 437 x 2), where each row is a `county`, and there is data on counties from 5 Midwestern states. How many rows and columns will the following code produce? ``` r midwest |> count(state) ``` A. 437 x 3 B. 5 x 2 C. 1 x 1 D. 5 x 28 --- # Still thinking about the `midwest` data... You want to calculate the average percentage of college-educated adults (percollege) by state. Which verbs do you need? A. `filter()` and mutate() B. `group_by()` and `summarize()` C. `mutate()` and `arrange()` D. `count()` and `summarize()` --- # Still thinking about the `midwest` data... You want to keep only counties where less than 10% of the population lives in poverty `(percbelowpoverty)` . Which verb do you use? A. `mutate()` B. `filter()` C. `group_by()` D. `summarize()` --- # Still thinking about the `midwest` data... You want to create a new variable for the proportion of children in the population, using `perc_child = popchild / poptotal`. Which verb should you use? A. `summarize()` B. `filter()` C. `mutate()` D. `count()` --- # Still thinking about the `midwest` data... How many rows will the resulting dataset have? ``` r midwest |> group_by(state) |> mutate(mean_poverty = mean(percbelowpoverty)) ``` A. One per state B. One per value of `percbelowpoverty` C. Same number as the original dataset D. Depends on missing values --- # Still thinking about the `midwest` data... You want to find out which state has the highest average percentage of people in poverty. Which pipeline would give you that result? ``` r #A midwest |> summarize(max(perpoverty)) #B midwest |> group_by(state) |> summarize(avg_pov = mean(percbelowpoverty)) |> arrange(desc(avg_pov)) #C midwest |> mutate(avg_pov = mean(percbelowpoverty)) |> max() #D midwest |> count(percbelowpoverty) |> arrange() ``` --- class: middle # Data Wrangling: Joining multiple datasets --- class: middle # .hand[We...] .huge[.green[have]] .hand[multiple data frames] .huge[.pink[want]] .hand[to bring them together] --- ## Data: Women in science Information on "10 women in science who changed the world" .small[ |name | |:------------------| |Ada Lovelace | |Marie Curie | |Janaki Ammal | |Chien-Shiung Wu | |Katherine Johnson | |Rosalind Franklin | |Vera Rubin | |Gladys West | |Flossie Wong-Staal | |Jennifer Doudna | ] .footnote[ Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world) ] --- ## Inputs .panelset[ .panel[.panel-name[professions] ``` r professions ``` ``` ## # A tibble: 10 × 2 ## name profession ## <chr> <chr> ## 1 Ada Lovelace Mathematician ## 2 Marie Curie Physicist and Chemist ## 3 Janaki Ammal Botanist ## 4 Chien-Shiung Wu Physicist ## 5 Katherine Johnson Mathematician ## 6 Rosalind Franklin Chemist ## 7 Vera Rubin Astronomer ## 8 Gladys West Mathematician ## 9 Flossie Wong-Staal Virologist and Molecular Biologist ## 10 Jennifer Doudna Biochemist ``` ] .panel[.panel-name[dates] ``` r dates ``` ``` ## # A tibble: 8 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Rosalind Franklin 1920 1958 ## 5 Vera Rubin 1928 2016 ## 6 Gladys West 1930 NA ## 7 Flossie Wong-Staal 1947 NA ## 8 Jennifer Doudna 1964 NA ``` ] .panel[.panel-name[works] ``` r works ``` ``` ## # A tibble: 9 × 2 ## name known_for ## <chr> <chr> ## 1 Ada Lovelace first computer algorithm ## 2 Marie Curie theory of radioactivity, discovery of elem… ## 3 Janaki Ammal hybrid species, biodiversity protection ## 4 Chien-Shiung Wu confim and refine theory of radioactive bet… ## 5 Katherine Johnson calculations of orbital mechanics critical … ## 6 Vera Rubin existence of dark matter ## 7 Gladys West mathematical modeling of the shape of the E… ## 8 Flossie Wong-Staal first scientist to clone HIV and create a m… ## 9 Jennifer Doudna one of the primary developers of CRISPR, a … ``` ] ] --- ## Desired output ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematic… NA NA first co… ## 2 Marie Curie Physicist … NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematic… 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematic… 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist… 1947 NA first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t… ``` --- ## Inputs, reminder .pull-left[ ``` r names(professions) ``` ``` ## [1] "name" "profession" ``` ``` r names(dates) ``` ``` ## [1] "name" "birth_year" "death_year" ``` ``` r names(works) ``` ``` ## [1] "name" "known_for" ``` ] .pull-right[ ``` r nrow(professions) ``` ``` ## [1] 10 ``` ``` r nrow(dates) ``` ``` ## [1] 8 ``` ``` r nrow(works) ``` ``` ## [1] 9 ``` ] --- class: middle # Joining data frames --- ## Joining data frames ``` r something_join(x, y) x |> something_join(y) ``` #### Mutating joins (add variables in) - `left_join()`: keeps all rows in x - `right_join()`: keeps all rows in y - `full_join()`: keeps all rows in x or y - `inner_join()`: only keeps rows from x that have a matching key in y --- ## Joining data frames ``` r something_join(x, y) x |> something_join(y) ``` #### Filtering joins (remove rows based on matches, don't bring in new variables) - `semi_join()`: only keeps rows from x that have a match in y; does not bring in y's columns - `anti_join()`: only keeps rows from x that do NOT have a match in y; does not bring in y's columns --- ## Setup For the next few slides... .pull-left[ ``` r x ``` ``` ## # A tibble: 3 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ] .pull-right[ ``` r y ``` ``` ## # A tibble: 3 × 2 ## id value_y ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] --- ## `left_join()` .pull-left[ <img src="img/left-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r left_join(x, y) ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] --- ## `left_join()` What do you expect? How many rows and columns? ``` r professions |> * left_join(dates) ``` -- ``` ## # A tibble: 10 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Ada Lovelace Mathematician NA NA ## 2 Marie Curie Physicist and Chemist NA NA ## 3 Janaki Ammal Botanist 1897 1984 ## 4 Chien-Shiung Wu Physicist 1912 1997 ## 5 Katherine Johnson Mathematician 1918 2020 ## 6 Rosalind Franklin Chemist 1920 1958 ## 7 Vera Rubin Astronomer 1928 2016 ## 8 Gladys West Mathematician 1930 NA ## 9 Flossie Wong-Staal Virologist and Molec… 1947 NA ## 10 Jennifer Doudna Biochemist 1964 NA ``` --- ## `right_join()` .pull-left[ <img src="img/right-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r right_join(x, y) ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] --- ## `right_join()` What do you expect? How many rows and columns? ``` r professions |> * right_join(dates) ``` -- ``` ## # A tibble: 8 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Janaki Ammal Botanist 1897 1984 ## 2 Chien-Shiung Wu Physicist 1912 1997 ## 3 Katherine Johnson Mathematician 1918 2020 ## 4 Rosalind Franklin Chemist 1920 1958 ## 5 Vera Rubin Astronomer 1928 2016 ## 6 Gladys West Mathematician 1930 NA ## 7 Flossie Wong-Staal Virologist and Molecu… 1947 NA ## 8 Jennifer Doudna Biochemist 1964 NA ``` --- ## `full_join()` .pull-left[ <img src="img/full-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r full_join(x, y) ``` ``` ## # A tibble: 4 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] --- ## `full_join()` ``` r dates |> * full_join(works) ``` ``` ## # A tibble: 10 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biod… ## 2 Chien-Shiung Wu 1912 1997 confim and refine th… ## 3 Katherine Johnson 1918 2020 calculations of orbi… ## 4 Rosalind Franklin 1920 1958 <NA> ## 5 Vera Rubin 1928 2016 existence of dark ma… ## 6 Gladys West 1930 NA mathematical modelin… ## 7 Flossie Wong-Staal 1947 NA first scientist to c… ## 8 Jennifer Doudna 1964 NA one of the primary d… ## 9 Ada Lovelace NA NA first computer algor… ## 10 Marie Curie NA NA theory of radioactiv… ``` --- ## `inner_join()` .pull-left[ <img src="img/inner-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r inner_join(x, y) ``` ``` ## # A tibble: 2 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] --- ## `inner_join()` ``` r dates |> * inner_join(works) ``` ``` ## # A tibble: 7 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodi… ## 2 Chien-Shiung Wu 1912 1997 confim and refine the… ## 3 Katherine Johnson 1918 2020 calculations of orbit… ## 4 Vera Rubin 1928 2016 existence of dark mat… ## 5 Gladys West 1930 NA mathematical modeling… ## 6 Flossie Wong-Staal 1947 NA first scientist to cl… ## 7 Jennifer Doudna 1964 NA one of the primary de… ``` --- ## `semi_join()` .pull-left[ <img src="img/semi-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r semi_join(x, y) ``` ``` ## # A tibble: 2 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ``` ] --- ## `semi_join()` What do you expect? How many rows and columns? ``` r dates |> * semi_join(works) ``` -- ``` ## # A tibble: 7 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Vera Rubin 1928 2016 ## 5 Gladys West 1930 NA ## 6 Flossie Wong-Staal 1947 NA ## 7 Jennifer Doudna 1964 NA ``` --- ## `anti_join()` .pull-left[ <img src="img/anti-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r anti_join(x, y) ``` ``` ## # A tibble: 1 × 2 ## id value_x ## <dbl> <chr> ## 1 3 x3 ``` ] --- ## `anti_join()` ``` r dates |> * anti_join(works) ``` ``` ## # A tibble: 1 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Rosalind Franklin 1920 1958 ``` --- ## Putting it altogether ``` r professions |> left_join(dates) |> left_join(works) ``` ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematic… NA NA first co… ## 2 Marie Curie Physicist … NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematic… 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematic… 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist… 1947 NA first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t… ``` --- class: middle # Case study: Student records --- ## Student records - Have: - Enrolment: official university enrolment records - Survey: Student provided info; missing students who never filled it out and including students who filled it out but dropped the class - Want: Survey info for all enrolled in class -- .pull-left[ ``` r enrolment ``` ``` ## # A tibble: 3 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday ## 2 2 Hermine ## 3 3 Sura Selvarajah ``` ] .pull-right[ ``` r survey ``` ``` ## # A tibble: 4 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 2 Hermine bakealongwithhermine ## 2 3 Sura surasbakes ## 3 4 Peter peter_bakes ## 4 5 Mark thebakingbuddha ``` ] --- # Quiz Which code is guaranteed to retain rows for all students enrolled in the course, but not students who dropped (i.e. are not currently enrolled)? A. `enrolment |> left_join(survey, by = "id")` B. `enrolment |> inner_join(survey, by = "id")` C. `enrolment |> semi_join(survey, by = "id")` D. `enrolment |> full_join(survey, by = "id")` --- # Quiz Which code will return rows for students who did not fill out the survey? A. `enrolment |> semi_join(survey, by = "id")` B. `enrolment |> anti_join(survey, by = "id")` C. `survey |> semi_join(enrolment, by = "id")` D. `survey |> anti_join(enrolment, by = "id")` --- # Quiz Which code will return rows for students who filled out the survey but dropped the course? A. `enrolment |> semi_join(survey, by = "id")` B. `enrolment |> anti_join(survey, by = "id")` C. `survey |> semi_join(enrolment, by = "id")` D. `survey |> anti_join(enrolment, by = "id")` --- ## Student records .panelset[ .panel[.panel-name[In class] ``` r enrolment |> * left_join(survey, by = "id") ``` ``` ## # A tibble: 3 × 4 ## id name.x name.y username ## <dbl> <chr> <chr> <chr> ## 1 1 Dave Friday <NA> <NA> ## 2 2 Hermine Hermine bakealongwithhermine ## 3 3 Sura Selvarajah Sura surasbakes ``` ] .panel[.panel-name[Survey missing] ``` r enrolment |> * anti_join(survey, by = "id") ``` ``` ## # A tibble: 1 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday ``` ] .panel[.panel-name[Dropped] ``` r survey |> * anti_join(enrolment, by = "id") ``` ``` ## # A tibble: 2 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 4 Peter peter_bakes ## 2 5 Mark thebakingbuddha ``` ] ] --- # Your turn: week_09.qmd Recall: `flights` contains data about every flight that departed La Guardia, JFK, or Newark airports in 2013 <img src="img/nycflights13.png" width="60%" style="display: block; margin: auto;" /> Exercises 1 - 5 --- # How could we create this? .pull-left[ In groups, brainstorm the data wrangling required to create this visualizaton. + What does the data need to look like? What does one row of the data represent? What columns do you need? Sketch out a few "dummy" rows of what you expect the data to contain. + Which of the above columns already exist (in which datasets), and which ones need to be created? + What joining and/or aggregation steps need to take place? Write some psuedocode for this. ] .pull-right[  ] --- class: middle # Case study: Grocery sales --- ## Grocery sales - Have: - Purchases: One row per customer per item, listing purchases they made - Prices: One row per item in the store, listing their prices - Want: Total revenue -- .pull-left[ ``` r purchases ``` ``` ## # A tibble: 5 × 2 ## customer_id item ## <dbl> <chr> ## 1 1 bread ## 2 1 milk ## 3 1 banana ## 4 2 milk ## 5 2 toilet paper ``` ] .pull-right[ ``` r prices ``` ``` ## # A tibble: 5 × 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3 ``` ] --- ## Grocery sales .panelset[ .panel[.panel-name[Total revenue] .pull-left[ ``` r purchases |> * left_join(prices) ``` ``` ## # A tibble: 5 × 3 ## customer_id item price ## <dbl> <chr> <dbl> ## 1 1 bread 1 ## 2 1 milk 0.8 ## 3 1 banana 0.15 ## 4 2 milk 0.8 ## 5 2 toilet paper 3 ``` ] .pull-right[ ``` r purchases |> left_join(prices) |> * summarise(total_revenue = sum(price)) ``` ``` ## # A tibble: 1 × 1 ## total_revenue ## <dbl> ## 1 5.75 ``` ] ] .panel[.panel-name[Revenue per customer] .pull-left[ ``` r purchases |> left_join(prices) ``` ``` ## # A tibble: 5 × 3 ## customer_id item price ## <dbl> <chr> <dbl> ## 1 1 bread 1 ## 2 1 milk 0.8 ## 3 1 banana 0.15 ## 4 2 milk 0.8 ## 5 2 toilet paper 3 ``` ] .pull-right[ ``` r purchases |> left_join(prices) |> * group_by(customer_id) |> summarise(total_revenue = sum(price)) ``` ``` ## # A tibble: 2 × 2 ## customer_id total_revenue ## <dbl> <dbl> ## 1 1 1.95 ## 2 2 3.8 ``` ] ] ] --- class: middle # .hand[We...] .huge[.green[have]] .hand[data organised in an unideal way for our analysis] .huge[.pink[want]] .hand[to reorganise the data to carry on with our analysis] --- ## Data: Sales <br> .pull-left[ ### .green[We have...] ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[We want...] ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## A grammar of data tidying .pull-left[ <img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ The goal of tidyr is to help you tidy your data via - pivoting for going between wide and long data - splitting and combining character columns - nesting and unnesting columns - clarifying how `NA`s should be treated ] --- class: middle # Pivoting data --- ## Not this... <img src="img/pivot.gif" width="70%" style="display: block; margin: auto;" /> --- ## but this! .center[ <img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] --- ## Wider vs. longer .pull-left[ ### .green[wider] more columns ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[longer] more rows ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) ] .pull-right[ ``` r pivot_longer( * data, cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format ] .pull-right[ ``` r pivot_longer( data, * cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) ] .pull-right[ ``` r pivot_longer( data, cols, * names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) - `values_to`: name of the column where data in pivoted variables go (character string) ] .pull-right[ ``` r pivot_longer( data, cols, names_to = "name", * values_to = "value" ) ``` ] --- ## Customers `\(\rightarrow\)` purchases ``` r purchases <- customers |> * pivot_longer( * cols = item_1:item_3, # variables item_1 to item_3 * names_to = "item_no", # column names -> new column called item_no * values_to = "item" # values in columns -> new column called item * ) purchases ``` ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` --- ## Why pivot? Most likely, because the next step of your analysis needs it -- .pull-left[ ``` r prices ``` ``` ## # A tibble: 5 × 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3 ``` ] .pull-right[ ``` r purchases |> * left_join(prices) ``` ``` ## # A tibble: 6 × 4 ## customer_id item_no item price ## <dbl> <chr> <chr> <dbl> ## 1 1 item_1 bread 1 ## 2 1 item_2 milk 0.8 ## 3 1 item_3 banana 0.15 ## 4 2 item_1 milk 0.8 ## 5 2 item_2 toilet paper 3 ## 6 2 item_3 <NA> NA ``` ] --- ## Purchases `\(\rightarrow\)` customers .pull-left-narrow[ - `data` (as usual) - `names_from`: which column in the long format contains the what should be column names in the wide format - `values_from`: which column in the long format contains the what should be values in the new columns in the wide format ] .pull-right-wide[ ``` r purchases |> * pivot_wider( * names_from = item_no, * values_from = item * ) ``` ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] --- class: middle # Case study: Approval rating of Donald Trump --- <img src="img/trump-approval.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/trump-approval-ratings/adults/) ] --- ## Goal Write psuedocode required to create this visualization <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-71-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Data .pull-left-wide[ ``` r trump ``` ``` ## # A tibble: 2,702 × 4 ## subgroup date approval disapproval ## <chr> <date> <dbl> <dbl> ## 1 Voters 2020-10-04 44.7 52.2 ## 2 Adults 2020-10-04 43.2 52.6 ## 3 Adults 2020-10-03 43.2 52.6 ## 4 Voters 2020-10-03 45.0 51.7 ## 5 Adults 2020-10-02 43.3 52.4 ## 6 Voters 2020-10-02 44.5 52.1 ## 7 Voters 2020-10-01 44.1 52.8 ## 8 Adults 2020-10-01 42.7 53.3 ## 9 Adults 2020-09-30 42.2 53.7 ## 10 Voters 2020-09-30 44.2 52.7 ## # ℹ 2,692 more rows ``` ] -- .pull-right-narrow[ **Aesthetic mappings:** ✅ x = `date` ❌ y = `rating_value` ❌ color = `rating_type` **Facet:** ✅ `subgroup` (Adults and Voters) ] --- ## Goal <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-73-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Pivot ``` r trump_longer <- trump |> pivot_longer( cols = c(approval, disapproval), names_to = "rating_type", values_to = "rating_value" ) trump_longer ``` ``` ## # A tibble: 5,404 × 4 ## subgroup date rating_type rating_value ## <chr> <date> <chr> <dbl> ## 1 Voters 2020-10-04 approval 44.7 ## 2 Voters 2020-10-04 disapproval 52.2 ## 3 Adults 2020-10-04 approval 43.2 ## 4 Adults 2020-10-04 disapproval 52.6 ## 5 Adults 2020-10-03 approval 43.2 ## 6 Adults 2020-10-03 disapproval 52.6 ## 7 Voters 2020-10-03 approval 45.0 ## 8 Voters 2020-10-03 disapproval 51.7 ... ``` --- ## Plot ``` r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) ``` <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-75-1.png" width="60%" style="display: block; margin: auto;" /> --- .panelset[ .panel[.panel-name[Code] ``` r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) + * scale_color_manual(values = c("darkgreen", "orange")) + * labs( * x = "Date", y = "Rating", * color = NULL, * title = "How (un)popular is Donald Trump?", * subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", * caption = "Source: FiveThirtyEight modeling estimates" * ) ``` ] .panel[.panel-name[Plot] <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-76-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- .panelset[ .panel[.panel-name[Code] ``` r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) + scale_color_manual(values = c("darkgreen", "orange")) + labs( x = "Date", y = "Rating", color = NULL, title = "How (un)popular is Donald Trump?", subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", caption = "Source: FiveThirtyEight modeling estimates" ) + * theme_minimal() + * theme(legend.position = "bottom") ``` ] .panel[.panel-name[Plot] <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-77-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- --- class: middle # Extra slides: Case study: Berkeley admission data --- ## Berkeley admission data - Study carried out by the Graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a gender bias in graduate admissions. - The data come from six departments. For confidentiality we'll call them A-F. - We have information on whether the applicant was male or female and whether they were admitted or rejected. - First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department. --- ## Data .pull-left[ ``` ## # A tibble: 4,526 × 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## 11 Admitted Male A ## 12 Admitted Male A ## 13 Admitted Male A ## 14 Admitted Male A ## 15 Admitted Male A ## # ℹ 4,511 more rows ``` ] .pull-right[ ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ``` ## # A tibble: 6 × 2 ## dept n ## <ord> <int> ## 1 A 933 ## 2 B 585 ## 3 C 918 ## 4 D 792 ## 5 E 584 ## 6 F 714 ``` ``` ## # A tibble: 2 × 2 ## admit n ## <fct> <int> ## 1 Rejected 2771 ## 2 Admitted 1755 ``` ] --- .question[ What can you say about the overall gender distribution? Hint: Calculate the following probabilities: `\(P(Admit | Male)\)` and `\(P(Admit | Female)\)`. ] ``` r ucbadmit |> count(gender, admit) ``` ``` ## # A tibble: 4 × 3 ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` --- ``` r ucbadmit |> count(gender, admit) |> group_by(gender) |> mutate(prop_admit = n / sum(n)) ``` ``` ## # A tibble: 4 × 4 ## # Groups: gender [2] ## gender admit n prop_admit ## <fct> <fct> <int> <dbl> ## 1 Female Rejected 1278 0.696 ## 2 Female Admitted 557 0.304 ## 3 Male Rejected 1493 0.555 ## 4 Male Admitted 1198 0.445 ``` - `\(P(Admit | Female)\)` = 0.304 - `\(P(Admit | Male)\)` = 0.445 --- ## Overall gender distribution .panelset[ .panel[.panel-name[Plot] <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-81-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(ucbadmit, aes(y = gender, fill = admit)) + geom_bar(position = "fill") + scale_fill_viridis_d() + labs(title = "Admit by gender", y = NULL, x = NULL) ``` ] ] --- .question[ What can you say about the gender distribution by department ? ] ``` r ucbadmit |> count(dept, gender, admit) ``` ``` ## # A tibble: 24 × 4 ## dept gender admit n ## <ord> <fct> <fct> <int> ## 1 A Female Rejected 19 ## 2 A Female Admitted 89 ## 3 A Male Rejected 313 ## 4 A Male Admitted 512 ## 5 B Female Rejected 8 ## 6 B Female Admitted 17 ## 7 B Male Rejected 207 ## 8 B Male Admitted 353 ## 9 C Female Rejected 391 ## 10 C Female Admitted 202 ## # ℹ 14 more rows ``` --- .question[ Let's try again... What can you say about the gender distribution by department? ] ``` r ucbadmit |> count(dept, gender, admit) |> pivot_wider(names_from = dept, values_from = n) ``` ``` ## # A tibble: 4 × 8 ## gender admit A B C D E F ## <fct> <fct> <int> <int> <int> <int> <int> <int> ## 1 Female Rejected 19 8 391 244 299 317 ## 2 Female Admitted 89 17 202 131 94 24 ## 3 Male Rejected 313 207 205 279 138 351 ## 4 Male Admitted 512 353 120 138 53 22 ``` --- ## Gender distribution, by department .panelset[ .panel[.panel-name[Plot] <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-84-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(ucbadmit, aes(y = gender, fill = admit)) + geom_bar(position = "fill") + facet_wrap(. ~ dept) + scale_x_continuous(labels = label_percent()) + scale_fill_viridis_d() + labs(title = "Admissions by gender and department", x = NULL, y = NULL, fill = NULL) + theme(legend.position = "bottom") ``` ] ] --- ## Case for gender discrimination? .pull-left[ <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-85-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-86-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Closer look at departments .panelset[ .panel[.panel-name[Output] ``` ## # A tibble: 12 × 5 ## # Groups: dept, gender [12] ## dept gender n_admitted n_applied prop_admit ## <ord> <fct> <int> <int> <dbl> ## 1 A Female 89 108 0.824 ## 2 A Male 512 825 0.621 ## 3 B Female 17 25 0.68 ## 4 B Male 353 560 0.630 ## 5 C Female 202 593 0.341 ## 6 C Male 120 325 0.369 ## 7 D Female 131 375 0.349 ## 8 D Male 138 417 0.331 ## 9 E Female 94 393 0.239 ## 10 E Male 53 191 0.277 ## 11 F Female 24 341 0.0704 ## 12 F Male 22 373 0.0590 ``` ] .panel[.panel-name[Code] ``` r ucbadmit |> count(dept, gender, admit) |> group_by(dept, gender) |> mutate( n_applied = sum(n), prop_admit = n / n_applied ) |> filter(admit == "Admitted") |> rename(n_admitted = n) |> select(-admit) |> print(n = 12) ``` ] ] --- class: middle # Simpson's paradox --- ## Relationship between two variables .pull-left[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right[ <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-89-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Relationship between two variables .pull-left[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right[ <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-91-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Considering a third variable .pull-left[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right[ <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-93-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Relationship between three variables .pull-left[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right[ <img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-95-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Simpson's paradox - Not considering an important variable when studying a relationship can result in **Simpson's paradox** - Simpson's paradox illustrates the effect that omission of an explanatory variable can have on the measure of association between another explanatory variable and a response variable - The inclusion of a third variable in the analysis can change the apparent relationship between the other two variables --- class: middle # Aside: `group_by()` and `count()` --- ## What does group_by() do? `group_by()` takes an existing data frame and converts it into a grouped data frame where subsequent operations are performed "once per group" .pull-left[ ``` r ucbadmit ``` ``` ## # A tibble: 4,526 × 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] .pull-right[ ``` r ucbadmit |> group_by(gender) ``` ``` ## # A tibble: 4,526 × 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] --- ## What does group_by() not do? `group_by()` does not sort the data, `arrange()` does .pull-left[ ``` r ucbadmit |> group_by(gender) ``` ``` ## # A tibble: 4,526 × 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] .pull-right[ ``` r ucbadmit |> arrange(gender) ``` ``` ## # A tibble: 4,526 × 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Female A ## 2 Admitted Female A ## 3 Admitted Female A ## 4 Admitted Female A ## 5 Admitted Female A ## 6 Admitted Female A ## 7 Admitted Female A ## 8 Admitted Female A ## 9 Admitted Female A ## 10 Admitted Female A ## # ℹ 4,516 more rows ``` ] --- ## What does group_by() not do? `group_by()` does not create frequency tables, `count()` does .pull-left[ ``` r ucbadmit |> group_by(gender) ``` ``` ## # A tibble: 4,526 × 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] .pull-right[ ``` r ucbadmit |> count(gender) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] --- ## Undo grouping with ungroup() .pull-left[ ``` r ucbadmit |> count(gender, admit) |> group_by(gender) |> mutate(prop_admit = n / sum(n)) |> select(gender, prop_admit) ``` ``` ## # A tibble: 4 × 2 ## # Groups: gender [2] ## gender prop_admit ## <fct> <dbl> ## 1 Female 0.696 ## 2 Female 0.304 ## 3 Male 0.555 ## 4 Male 0.445 ``` ] .pull-right[ ``` r ucbadmit |> count(gender, admit) |> group_by(gender) |> mutate(prop_admit = n / sum(n)) |> select(gender, prop_admit) |> ungroup() ``` ``` ## # A tibble: 4 × 2 ## gender prop_admit ## <fct> <dbl> ## 1 Female 0.696 ## 2 Female 0.304 ## 3 Male 0.555 ## 4 Male 0.445 ``` ] --- ## count() is a short-hand `count()` is a short-hand for `group_by()` and then `summarise()` to count the number of observations in each group .pull-left[ ``` r ucbadmit |> group_by(gender) |> summarise(n = n()) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] .pull-right[ ``` r ucbadmit |> count(gender) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] --- ## count can take multiple arguments .pull-left[ ``` r ucbadmit |> group_by(gender, admit) |> summarise(n = n()) ``` ``` ## # A tibble: 4 × 3 ## # Groups: gender [2] ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` ] .pull-right[ ``` r ucbadmit |> count(gender, admit) ``` ``` ## # A tibble: 4 × 3 ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` ] --- ## `summarise()` after `group_by()` - `count()` ungroups after itself - `summarise()` peels off one layer of grouping by default, or you can specify a different behaviour ``` r ucbadmit |> group_by(gender, admit) |> summarise(n = n()) ``` ``` ## `summarise()` has grouped output by 'gender'. You can override ## using the `.groups` argument. ``` ``` ## # A tibble: 4 × 3 ## # Groups: gender [2] ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ```