class: center, middle, inverse, title-slide .title[ # Data Wrangling II ] .subtitle[ ## STAT 7500 ] .author[ ### Katie Fitzgerald, adpated from datasciencebox.org ] --- layout: true <div class="my-footer"> <span> <a href="https://kgfitzgerald.github.io/stat-7500" target="_blank">kgfitzgerald.github.io/stat-7500</a> </span> </div> --- # Goals Today you will practice: + data wrangling with `dplyr` verbs + joining multiple data frames + pivoting data frames to wide or long format + planning a data wrangling pipeline to accomplished a desired task/visualization --- # A quick refresh of `dplyr` verbs in R Which of the following allows you to keep rows that meet a certain criteria? A. `keep()` B. `where()` C. `filter()` D. `select()` E. `mutate()` --- # A quick refresh of `dplyr` verbs in R Which of the following allows you to keep rows that meet a certain criteria? A. `keep()` B. `where()` <span style="background-color: #FFD580; padding: 2px 4px;">C. `filter()` ✅</span> D. `select()` E. `mutate()` --- # A quick refresh of `dplyr` verbs in R Which of the following allows you to keep specified columns? A. `keep()` B. `colname()` C. `filter()` D. `select()` E. `mutate()` --- # A quick refresh of `dplyr` verbs in R Which of the following allows you to keep specified columns? A. `keep()` B. `colname()` C. `filter()` <span style="background-color: #FFD580; padding: 2px 4px;">D. `select()` ✅</span> E. `mutate()` --- # A quick refresh of `dplyr` verbs in R Recall the `midwest` dataset you used for homework (dimensions: 437 × 28), where each row is a county, and there is data on counties from 5 Midwestern states. How many rows and columns will the following code produce? ``` r midwest |> count(state) ``` A. 437 × 3 B. 5 × 2 C. 1 × 1 D. 5 × 28 --- # A quick refresh of `dplyr` verbs in R Recall the `midwest` dataset you used for homework (dimensions: 437 × 28), where each row is a county, and there is data on counties from 5 Midwestern states. How many rows and columns will the following code produce? ``` r midwest |> count(state) ``` A. 437 × 3 <span style="background-color: #FFD580; padding: 2px 4px;">B. 5 × 2 ✅</span> C. 1 × 1 D. 5 × 28 --- # Still thinking about the `midwest` data... You want to calculate the average percentage of college-educated adults (`percollege`) by state. Which verbs do you need? A. `filter()` and `mutate()` B. `group_by()` and `summarize()` C. `mutate()` and `arrange()` D. `count()` and `summarize()` --- # Still thinking about the `midwest` data... You want to calculate the average percentage of college-educated adults (`percollege`) by state. Which verbs do you need? A. `filter()` and `mutate()` <span style="background-color: #FFD580; padding: 2px 4px;">B. `group_by()` and `summarize()` ✅</span> C. `mutate()` and `arrange()` D. `count()` and `summarize()` --- # Still thinking about the `midwest` data... You want to keep only counties where less than 10% of the population lives in poverty (`percbelowpoverty`). Which verb do you use? A. `mutate()` B. `filter()` C. `group_by()` D. `summarize()` --- # Still thinking about the `midwest` data... You want to keep only counties where less than 10% of the population lives in poverty (`percbelowpoverty`). Which verb do you use? A. `mutate()` <span style="background-color: #FFD580; padding: 2px 4px;">B. `filter()` ✅</span> C. `group_by()` D. `summarize()` --- # Still thinking about the `midwest` data... You want to create a new variable for the proportion of children in the population, using `perc_child = popchild / poptotal`. Which verb should you use? A. `summarize()` B. `filter()` C. `mutate()` D. `count()` --- # Still thinking about the `midwest` data... You want to create a new variable for the proportion of children in the population, using `perc_child = popchild / poptotal`. Which verb should you use? A. `summarize()` B. `filter()` <span style="background-color: #FFD580; padding: 2px 4px;">C. `mutate()` ✅</span> D. `count()` --- # Still thinking about the `midwest` data... How many rows will the resulting dataset have? ``` r midwest |> group_by(state) |> mutate(mean_poverty = mean(percbelowpoverty)) ``` A. One per state B. One per value of `percbelowpoverty` C. Same number as the original dataset D. Depends on missing values --- # Still thinking about the `midwest` data... How many rows will the resulting dataset have? ``` r midwest |> group_by(state) |> mutate(mean_poverty = mean(percbelowpoverty)) ``` A. One per state B. One per value of `percbelowpoverty` <span style="background-color: #FFD580; padding: 2px 4px;">C. Same number as the original dataset ✅</span> D. Depends on missing values --- # Still thinking about the `midwest` data... You want to find out which state has the highest average percentage of people in poverty. Which pipeline would give you that result? ``` r # A midwest |> summarize(max(perpoverty)) # B midwest |> group_by(state) |> summarize(avg_pov = mean(percbelowpoverty)) |> arrange(desc(avg_pov)) # C midwest |> mutate(avg_pov = mean(percbelowpoverty)) |> max() # D midwest |> count(percbelowpoverty) |> arrange() ``` --- # Still thinking about the `midwest` data... You want to find out which state has the highest average percentage of people in poverty. Which pipeline would give you that result? ``` r # A midwest |> summarize(max(perpoverty)) # B - correct answer midwest |> group_by(state) |> summarize(avg_pov = mean(percbelowpoverty)) |> arrange(desc(avg_pov)) ✅< # C midwest |> mutate(avg_pov = mean(percbelowpoverty)) |> max() # D midwest |> count(percbelowpoverty) |> arrange() ``` --- class: middle # Data Wrangling: Joining multiple datasets --- class: middle # .hand[We...] .huge[.green[have]] .hand[multiple data frames] .huge[.pink[want]] .hand[to bring them together] --- ## Data: Women in science Information on "10 women in science who changed the world" .small[ |name | |:------------------| |Ada Lovelace | |Marie Curie | |Janaki Ammal | |Chien-Shiung Wu | |Katherine Johnson | |Rosalind Franklin | |Vera Rubin | |Gladys West | |Flossie Wong-Staal | |Jennifer Doudna | ] .footnote[ Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world) ] --- ## Inputs .panelset[ .panel[.panel-name[professions] ``` r professions ``` ``` ## # A tibble: 10 × 2 ## name profession ## <chr> <chr> ## 1 Ada Lovelace Mathematician ## 2 Marie Curie Physicist and Chemist ## 3 Janaki Ammal Botanist ## 4 Chien-Shiung Wu Physicist ## 5 Katherine Johnson Mathematician ## 6 Rosalind Franklin Chemist ## 7 Vera Rubin Astronomer ## 8 Gladys West Mathematician ## 9 Flossie Wong-Staal Virologist and Molecular Biologist ## 10 Jennifer Doudna Biochemist ``` ] .panel[.panel-name[dates] ``` r dates ``` ``` ## # A tibble: 8 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Rosalind Franklin 1920 1958 ## 5 Vera Rubin 1928 2016 ## 6 Gladys West 1930 NA ## 7 Flossie Wong-Staal 1947 NA ## 8 Jennifer Doudna 1964 NA ``` ] .panel[.panel-name[works] ``` r works ``` ``` ## # A tibble: 9 × 2 ## name known_for ## <chr> <chr> ## 1 Ada Lovelace first computer algorithm ## 2 Marie Curie theory of radioactivity, discovery of elem… ## 3 Janaki Ammal hybrid species, biodiversity protection ## 4 Chien-Shiung Wu confim and refine theory of radioactive bet… ## 5 Katherine Johnson calculations of orbital mechanics critical … ## 6 Vera Rubin existence of dark matter ## 7 Gladys West mathematical modeling of the shape of the E… ## 8 Flossie Wong-Staal first scientist to clone HIV and create a m… ## 9 Jennifer Doudna one of the primary developers of CRISPR, a … ``` ] ] --- ## Desired output ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematic… NA NA first co… ## 2 Marie Curie Physicist … NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematic… 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematic… 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist… 1947 NA first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t… ``` --- ## Inputs, reminder .pull-left[ ``` r names(professions) ``` ``` ## [1] "name" "profession" ``` ``` r names(dates) ``` ``` ## [1] "name" "birth_year" "death_year" ``` ``` r names(works) ``` ``` ## [1] "name" "known_for" ``` ] .pull-right[ ``` r nrow(professions) ``` ``` ## [1] 10 ``` ``` r nrow(dates) ``` ``` ## [1] 8 ``` ``` r nrow(works) ``` ``` ## [1] 9 ``` ] --- class: middle # Joining data frames --- ## Joining data frames ``` r something_join(x, y) x |> something_join(y) ``` #### Mutating joins (add variables in) - `left_join()`: keeps all rows in x - `right_join()`: keeps all rows in y - `full_join()`: keeps all rows in x or y - `inner_join()`: only keeps rows from x that have a matching key in y --- ## Joining data frames ``` r something_join(x, y) x |> something_join(y) ``` #### Filtering joins (remove rows based on matches, don't bring in new variables) - `semi_join()`: only keeps rows from x that have a match in y; does not bring in y's columns - `anti_join()`: only keeps rows from x that do NOT have a match in y; does not bring in y's columns --- ## Setup For the next few slides... .pull-left[ ``` r x ``` ``` ## # A tibble: 3 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ] .pull-right[ ``` r y ``` ``` ## # A tibble: 3 × 2 ## id value_y ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] --- ## `left_join()` .pull-left[ <img src="img/left-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r left_join(x, y) ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] --- ## `left_join()` What do you expect? How many rows and columns? ``` r professions |> * left_join(dates) ``` -- ``` ## # A tibble: 10 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Ada Lovelace Mathematician NA NA ## 2 Marie Curie Physicist and Chemist NA NA ## 3 Janaki Ammal Botanist 1897 1984 ## 4 Chien-Shiung Wu Physicist 1912 1997 ## 5 Katherine Johnson Mathematician 1918 2020 ## 6 Rosalind Franklin Chemist 1920 1958 ## 7 Vera Rubin Astronomer 1928 2016 ## 8 Gladys West Mathematician 1930 NA ## 9 Flossie Wong-Staal Virologist and Molec… 1947 NA ## 10 Jennifer Doudna Biochemist 1964 NA ``` --- ## `right_join()` .pull-left[ <img src="img/right-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r right_join(x, y) ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] --- ## `right_join()` What do you expect? How many rows and columns? ``` r professions |> * right_join(dates) ``` -- ``` ## # A tibble: 8 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Janaki Ammal Botanist 1897 1984 ## 2 Chien-Shiung Wu Physicist 1912 1997 ## 3 Katherine Johnson Mathematician 1918 2020 ## 4 Rosalind Franklin Chemist 1920 1958 ## 5 Vera Rubin Astronomer 1928 2016 ## 6 Gladys West Mathematician 1930 NA ## 7 Flossie Wong-Staal Virologist and Molecu… 1947 NA ## 8 Jennifer Doudna Biochemist 1964 NA ``` --- ## `full_join()` .pull-left[ <img src="img/full-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r full_join(x, y) ``` ``` ## # A tibble: 4 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] --- ## `full_join()` ``` r dates |> * full_join(works) ``` ``` ## # A tibble: 10 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biod… ## 2 Chien-Shiung Wu 1912 1997 confim and refine th… ## 3 Katherine Johnson 1918 2020 calculations of orbi… ## 4 Rosalind Franklin 1920 1958 <NA> ## 5 Vera Rubin 1928 2016 existence of dark ma… ## 6 Gladys West 1930 NA mathematical modelin… ## 7 Flossie Wong-Staal 1947 NA first scientist to c… ## 8 Jennifer Doudna 1964 NA one of the primary d… ## 9 Ada Lovelace NA NA first computer algor… ## 10 Marie Curie NA NA theory of radioactiv… ``` --- ## `inner_join()` .pull-left[ <img src="img/inner-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r inner_join(x, y) ``` ``` ## # A tibble: 2 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] --- ## `inner_join()` ``` r dates |> * inner_join(works) ``` ``` ## # A tibble: 7 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodi… ## 2 Chien-Shiung Wu 1912 1997 confim and refine the… ## 3 Katherine Johnson 1918 2020 calculations of orbit… ## 4 Vera Rubin 1928 2016 existence of dark mat… ## 5 Gladys West 1930 NA mathematical modeling… ## 6 Flossie Wong-Staal 1947 NA first scientist to cl… ## 7 Jennifer Doudna 1964 NA one of the primary de… ``` --- ## `semi_join()` .pull-left[ <img src="img/semi-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r semi_join(x, y) ``` ``` ## # A tibble: 2 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ``` ] --- ## `semi_join()` What do you expect? How many rows and columns? ``` r dates |> * semi_join(works) ``` -- ``` ## # A tibble: 7 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Vera Rubin 1928 2016 ## 5 Gladys West 1930 NA ## 6 Flossie Wong-Staal 1947 NA ## 7 Jennifer Doudna 1964 NA ``` --- ## `anti_join()` .pull-left[ <img src="img/anti-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r anti_join(x, y) ``` ``` ## # A tibble: 1 × 2 ## id value_x ## <dbl> <chr> ## 1 3 x3 ``` ] --- ## `anti_join()` ``` r dates |> * anti_join(works) ``` ``` ## # A tibble: 1 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Rosalind Franklin 1920 1958 ``` --- ## Putting it altogether ``` r professions |> left_join(dates) |> left_join(works) ``` ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematic… NA NA first co… ## 2 Marie Curie Physicist … NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematic… 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematic… 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist… 1947 NA first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t… ``` --- class: middle # Case study: Student records --- ## Student records - Have: - Enrolment: official university enrolment records - Survey: Student provided info; missing students who never filled it out and including students who filled it out but dropped the class - Want: Survey info for all enrolled in class -- .pull-left[ ``` r enrolment ``` ``` ## # A tibble: 3 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday ## 2 2 Hermine ## 3 3 Sura Selvarajah ``` ] .pull-right[ ``` r survey ``` ``` ## # A tibble: 4 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 2 Hermine bakealongwithhermine ## 2 3 Sura surasbakes ## 3 4 Peter peter_bakes ## 4 5 Mark thebakingbuddha ``` ] --- # Quiz Which code is guaranteed to retain rows for all students enrolled in the course, but not students who dropped (i.e. are not currently enrolled)? A. `enrolment |> left_join(survey, by = "id")` B. `enrolment |> inner_join(survey, by = "id")` C. `enrolment |> semi_join(survey, by = "id")` D. `enrolment |> full_join(survey, by = "id")` --- # Quiz Which code is guaranteed to retain rows for all students enrolled in the course, but not students who dropped (i.e. are not currently enrolled)? <span style="color:orange; font-weight:bold;">A. `enrolment |> left_join(survey, by = "id")` ✅</span> B. `enrolment |> inner_join(survey, by = "id")` C. `enrolment |> semi_join(survey, by = "id")` D. `enrolment |> full_join(survey, by = "id")` --- # Quiz Which code will return rows for students who did not fill out the survey? A. `enrolment |> semi_join(survey, by = "id")` B. `enrolment |> anti_join(survey, by = "id")` C. `survey |> semi_join(enrolment, by = "id")` D. `survey |> anti_join(enrolment, by = "id")` --- # Quiz Which code will return rows for students who did not fill out the survey? A. `enrolment |> semi_join(survey, by = "id")` <span style="color:orange; font-weight:bold;">B. `enrolment |> anti_join(survey, by = "id")` ✅</span> C. `survey |> semi_join(enrolment, by = "id")` D. `survey |> anti_join(enrolment, by = "id")` --- # Quiz Which code will return rows for students who filled out the survey but dropped the course? A. `enrolment |> semi_join(survey, by = "id")` B. `enrolment |> anti_join(survey, by = "id")` C. `survey |> semi_join(enrolment, by = "id")` D. `survey |> anti_join(enrolment, by = "id")` --- # Quiz Which code will return rows for students who filled out the survey but dropped the course? A. `enrolment |> semi_join(survey, by = "id")` B. `enrolment |> anti_join(survey, by = "id")` C. `survey |> semi_join(enrolment, by = "id")` <span style="color:orange; font-weight:bold;">D. `survey |> anti_join(enrolment, by = "id")` ✅</span> --- ## Student records .panelset[ .panel[.panel-name[In class] ``` r enrolment |> * left_join(survey, by = "id") ``` ``` ## # A tibble: 3 × 4 ## id name.x name.y username ## <dbl> <chr> <chr> <chr> ## 1 1 Dave Friday <NA> <NA> ## 2 2 Hermine Hermine bakealongwithhermine ## 3 3 Sura Selvarajah Sura surasbakes ``` ] .panel[.panel-name[Survey missing] ``` r enrolment |> * anti_join(survey, by = "id") ``` ``` ## # A tibble: 1 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday ``` ] .panel[.panel-name[Dropped] ``` r survey |> * anti_join(enrolment, by = "id") ``` ``` ## # A tibble: 2 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 4 Peter peter_bakes ## 2 5 Mark thebakingbuddha ``` ] ] --- # Your turn: week_09.qmd Recall: `flights` contains data about every flight that departed La Guardia, JFK, or Newark airports in 2013 <img src="img/nycflights13.png" width="60%" style="display: block; margin: auto;" /> Exercises 1 - 5 --- # How could we create this? .pull-left[ In groups, brainstorm the data wrangling required to create this visualizaton. + What does the data need to look like? What does one row of the data represent? What columns do you need? Sketch out a few "dummy" rows of what you expect the data to contain. + Which of the above columns already exist (in which datasets), and which ones need to be created? + What joining and/or aggregation steps need to take place? Write some psuedocode for this. ] .pull-right[  ]