class: center, middle, inverse, title-slide .title[ # Data Wrangling II ] .subtitle[ ## STAT 4380 ] .author[ ### Katie Fitzgerald, adpated from datasciencebox.org ] --- layout: true <div class="my-footer"> <span> <a href="https://nova-stat-4380.netlify.app" target="_blank">nova-stat-4380.netlify.app</a> </span> </div> --- class: middle # Data Wrangling: Joining multiple datasets --- class: middle # .hand[We...] .huge[.green[have]] .hand[multiple data frames] .huge[.pink[want]] .hand[to bring them together] --- ## Data: Women in science Information on "10 women in science who changed the world" .small[ |name | |:------------------| |Ada Lovelace | |Marie Curie | |Janaki Ammal | |Chien-Shiung Wu | |Katherine Johnson | |Rosalind Franklin | |Vera Rubin | |Gladys West | |Flossie Wong-Staal | |Jennifer Doudna | ] .footnote[ Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world) ] --- ## Inputs .panelset[ .panel[.panel-name[professions] ``` r professions ``` ``` ## # A tibble: 10 × 2 ## name profession ## <chr> <chr> ## 1 Ada Lovelace Mathematician ## 2 Marie Curie Physicist and Chemist ## 3 Janaki Ammal Botanist ## 4 Chien-Shiung Wu Physicist ## 5 Katherine Johnson Mathematician ## 6 Rosalind Franklin Chemist ## 7 Vera Rubin Astronomer ## 8 Gladys West Mathematician ## 9 Flossie Wong-Staal Virologist and Molecular Biologist ## 10 Jennifer Doudna Biochemist ``` ] .panel[.panel-name[dates] ``` r dates ``` ``` ## # A tibble: 8 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Rosalind Franklin 1920 1958 ## 5 Vera Rubin 1928 2016 ## 6 Gladys West 1930 NA ## 7 Flossie Wong-Staal 1947 NA ## 8 Jennifer Doudna 1964 NA ``` ] .panel[.panel-name[works] ``` r works ``` ``` ## # A tibble: 9 × 2 ## name known_for ## <chr> <chr> ## 1 Ada Lovelace first computer algorithm ## 2 Marie Curie theory of radioactivity, discovery of elem… ## 3 Janaki Ammal hybrid species, biodiversity protection ## 4 Chien-Shiung Wu confim and refine theory of radioactive bet… ## 5 Katherine Johnson calculations of orbital mechanics critical … ## 6 Vera Rubin existence of dark matter ## 7 Gladys West mathematical modeling of the shape of the E… ## 8 Flossie Wong-Staal first scientist to clone HIV and create a m… ## 9 Jennifer Doudna one of the primary developers of CRISPR, a … ``` ] ] --- ## Desired output ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematic… NA NA first co… ## 2 Marie Curie Physicist … NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematic… 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematic… 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist… 1947 NA first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t… ``` --- ## Inputs, reminder .pull-left[ ``` r names(professions) ``` ``` ## [1] "name" "profession" ``` ``` r names(dates) ``` ``` ## [1] "name" "birth_year" "death_year" ``` ``` r names(works) ``` ``` ## [1] "name" "known_for" ``` ] .pull-right[ ``` r nrow(professions) ``` ``` ## [1] 10 ``` ``` r nrow(dates) ``` ``` ## [1] 8 ``` ``` r nrow(works) ``` ``` ## [1] 9 ``` ] --- class: middle # Joining data frames --- ## Joining data frames ``` r something_join(x, y) x |> something_join(y) ``` #### Mutating joins (add variables in) - `left_join()`: keeps all rows in x - `right_join()`: keeps all rows in y - `full_join()`: keeps all rows in x or y - `inner_join()`: only keeps rows from x that have a matching key in y --- ## Joining data frames ``` r something_join(x, y) x |> something_join(y) ``` #### Filtering joins (remove rows based on matches, don't bring in new variables) - `semi_join()`: only keeps rows from x that have a match in y; does not bring in y's columns - `anti_join()`: only keeps rows from x that do NOT have a match in y; does not bring in y's columns --- ## Setup For the next few slides... .pull-left[ ``` r x ``` ``` ## # A tibble: 3 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ] .pull-right[ ``` r y ``` ``` ## # A tibble: 3 × 2 ## id value_y ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] --- ## `left_join()` .pull-left[ <img src="img/left-join.gif" alt="" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r left_join(x, y) ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] --- ## `left_join()` What do you expect? How many rows and columns? ``` r professions |> * left_join(dates) ``` -- ``` ## # A tibble: 10 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Ada Lovelace Mathematician NA NA ## 2 Marie Curie Physicist and Chemist NA NA ## 3 Janaki Ammal Botanist 1897 1984 ## 4 Chien-Shiung Wu Physicist 1912 1997 ## 5 Katherine Johnson Mathematician 1918 2020 ## 6 Rosalind Franklin Chemist 1920 1958 ## 7 Vera Rubin Astronomer 1928 2016 ## 8 Gladys West Mathematician 1930 NA ## 9 Flossie Wong-Staal Virologist and Molec… 1947 NA ## 10 Jennifer Doudna Biochemist 1964 NA ``` --- ## `right_join()` .pull-left[ <img src="img/right-join.gif" alt="" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r right_join(x, y) ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] --- ## `right_join()` What do you expect? How many rows and columns? ``` r professions |> * right_join(dates) ``` -- ``` ## # A tibble: 8 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Janaki Ammal Botanist 1897 1984 ## 2 Chien-Shiung Wu Physicist 1912 1997 ## 3 Katherine Johnson Mathematician 1918 2020 ## 4 Rosalind Franklin Chemist 1920 1958 ## 5 Vera Rubin Astronomer 1928 2016 ## 6 Gladys West Mathematician 1930 NA ## 7 Flossie Wong-Staal Virologist and Molecu… 1947 NA ## 8 Jennifer Doudna Biochemist 1964 NA ``` --- ## `full_join()` .pull-left[ <img src="img/full-join.gif" alt="" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r full_join(x, y) ``` ``` ## # A tibble: 4 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] --- ## `full_join()` ``` r dates |> * full_join(works) ``` ``` ## # A tibble: 10 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biod… ## 2 Chien-Shiung Wu 1912 1997 confim and refine th… ## 3 Katherine Johnson 1918 2020 calculations of orbi… ## 4 Rosalind Franklin 1920 1958 <NA> ## 5 Vera Rubin 1928 2016 existence of dark ma… ## 6 Gladys West 1930 NA mathematical modelin… ## 7 Flossie Wong-Staal 1947 NA first scientist to c… ## 8 Jennifer Doudna 1964 NA one of the primary d… ## 9 Ada Lovelace NA NA first computer algor… ## 10 Marie Curie NA NA theory of radioactiv… ``` --- ## `inner_join()` .pull-left[ <img src="img/inner-join.gif" alt="" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r inner_join(x, y) ``` ``` ## # A tibble: 2 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] --- ## `inner_join()` ``` r dates |> * inner_join(works) ``` ``` ## # A tibble: 7 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodi… ## 2 Chien-Shiung Wu 1912 1997 confim and refine the… ## 3 Katherine Johnson 1918 2020 calculations of orbit… ## 4 Vera Rubin 1928 2016 existence of dark mat… ## 5 Gladys West 1930 NA mathematical modeling… ## 6 Flossie Wong-Staal 1947 NA first scientist to cl… ## 7 Jennifer Doudna 1964 NA one of the primary de… ``` --- ## `semi_join()` .pull-left[ <img src="img/semi-join.gif" alt="" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r semi_join(x, y) ``` ``` ## # A tibble: 2 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ``` ] --- ## `semi_join()` What do you expect? How many rows and columns? ``` r dates |> * semi_join(works) ``` -- ``` ## # A tibble: 7 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Vera Rubin 1928 2016 ## 5 Gladys West 1930 NA ## 6 Flossie Wong-Staal 1947 NA ## 7 Jennifer Doudna 1964 NA ``` --- ## `anti_join()` .pull-left[ <img src="img/anti-join.gif" alt="" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ``` r anti_join(x, y) ``` ``` ## # A tibble: 1 × 2 ## id value_x ## <dbl> <chr> ## 1 3 x3 ``` ] --- ## `anti_join()` ``` r dates |> * anti_join(works) ``` ``` ## # A tibble: 1 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Rosalind Franklin 1920 1958 ``` --- ## Putting it altogether ``` r professions |> left_join(dates) |> left_join(works) ``` ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematic… NA NA first co… ## 2 Marie Curie Physicist … NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematic… 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematic… 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist… 1947 NA first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t… ``` --- class: middle # Case study: Student records --- ## Student records - Have: - Enrolment: official university enrolment records - Survey: Student provided info; missing students who never filled it out and including students who filled it out but dropped the class - Want: Survey info for all enrolled in class -- .pull-left[ ``` r enrolment ``` ``` ## # A tibble: 3 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday ## 2 2 Hermine ## 3 3 Sura Selvarajah ``` ] .pull-right[ ``` r survey ``` ``` ## # A tibble: 4 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 2 Hermine bakealongwithhermine ## 2 3 Sura surasbakes ## 3 4 Peter peter_bakes ## 4 5 Mark thebakingbuddha ``` ] --- # Quiz Which code is guaranteed to retain rows for all students enrolled in the course, but not students who dropped (i.e. are not currently enrolled)? A. `enrolment |> left_join(survey, by = "id")` B. `enrolment |> inner_join(survey, by = "id")` C. `enrolment |> semi_join(survey, by = "id")` D. `enrolment |> full_join(survey, by = "id")` --- # Quiz Which code will return rows for students who did not fill out the survey? A. `enrolment |> semi_join(survey, by = "id")` B. `enrolment |> anti_join(survey, by = "id")` C. `survey |> semi_join(enrolment, by = "id")` D. `survey |> anti_join(enrolment, by = "id")` --- # Quiz Which code will return rows for students who filled out the survey but dropped the course? A. `enrolment |> semi_join(survey, by = "id")` B. `enrolment |> anti_join(survey, by = "id")` C. `survey |> semi_join(enrolment, by = "id")` D. `survey |> anti_join(enrolment, by = "id")` --- ## Student records .panelset[ .panel[.panel-name[In class] ``` r enrolment |> * left_join(survey, by = "id") ``` ``` ## # A tibble: 3 × 4 ## id name.x name.y username ## <dbl> <chr> <chr> <chr> ## 1 1 Dave Friday <NA> <NA> ## 2 2 Hermine Hermine bakealongwithhermine ## 3 3 Sura Selvarajah Sura surasbakes ``` ] .panel[.panel-name[Survey missing] ``` r enrolment |> * anti_join(survey, by = "id") ``` ``` ## # A tibble: 1 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday ``` ] .panel[.panel-name[Dropped] ``` r survey |> * anti_join(enrolment, by = "id") ``` ``` ## # A tibble: 2 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 4 Peter peter_bakes ## 2 5 Mark thebakingbuddha ``` ] ]