class: center, middle, inverse, title-slide .title[ # Data Wrangling III ] .subtitle[ ## STAT 7500 ] .author[ ### Katie Fitzgerald, adpated from datasciencebox.org ] --- layout: true <div class="my-footer"> <span> <a href="https://kgfitzgerald.github.io/stat-7500" target="_blank">kgfitzgerald.github.io/stat-7500</a> </span> </div> --- # Goals Today you will practice: + pivoting data frames to wide or long format + planning a data wrangling pipeline to accomplished a desired task/visualization + using computing (conditional) proportions using `dplyr` verbs --- class: middle # Case study: Grocery sales --- ## Grocery sales - Have: - Purchases: One row per customer per item, listing purchases they made - Prices: One row per item in the store, listing their prices - Want: Total revenue -- .pull-left[ ``` r purchases ``` ``` ## # A tibble: 5 × 2 ## customer_id item ## <dbl> <chr> ## 1 1 bread ## 2 1 milk ## 3 1 banana ## 4 2 milk ## 5 2 toilet paper ``` ] .pull-right[ ``` r prices ``` ``` ## # A tibble: 5 × 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3 ``` ] --- ## Grocery sales .panelset[ .panel[.panel-name[Total revenue] .pull-left[ ``` r purchases |> * left_join(prices) ``` ``` ## # A tibble: 5 × 3 ## customer_id item price ## <dbl> <chr> <dbl> ## 1 1 bread 1 ## 2 1 milk 0.8 ## 3 1 banana 0.15 ## 4 2 milk 0.8 ## 5 2 toilet paper 3 ``` ] .pull-right[ ``` r purchases |> left_join(prices) |> * summarise(total_revenue = sum(price)) ``` ``` ## # A tibble: 1 × 1 ## total_revenue ## <dbl> ## 1 5.75 ``` ] ] .panel[.panel-name[Revenue per customer] .pull-left[ ``` r purchases |> left_join(prices) ``` ``` ## # A tibble: 5 × 3 ## customer_id item price ## <dbl> <chr> <dbl> ## 1 1 bread 1 ## 2 1 milk 0.8 ## 3 1 banana 0.15 ## 4 2 milk 0.8 ## 5 2 toilet paper 3 ``` ] .pull-right[ ``` r purchases |> left_join(prices) |> * group_by(customer_id) |> summarise(total_revenue = sum(price)) ``` ``` ## # A tibble: 2 × 2 ## customer_id total_revenue ## <dbl> <dbl> ## 1 1 1.95 ## 2 2 3.8 ``` ] ] ] --- class: middle # .hand[We...] .huge[.green[have]] .hand[data organised in an unideal way for our analysis] .huge[.pink[want]] .hand[to reorganise the data to carry on with our analysis] --- ## Data: Sales How much revenue was brought in per item (e.g. milk)? <br> .pull-left[ ### .green[We have...] ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[We want...] ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## A grammar of data tidying .pull-left[ <img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ The goal of tidyr is to help you tidy your data via - pivoting for going between wide and long data - splitting and combining character columns - nesting and unnesting columns - clarifying how `NA`s should be treated ] --- class: middle # Pivoting data --- ## Not this... <img src="img/pivot.gif" width="70%" style="display: block; margin: auto;" /> --- ## but this! .center[ <img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] --- ## Wider vs. longer .pull-left[ ### .green[wider] more columns ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[longer] more rows ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) ] .pull-right[ ``` r pivot_longer( * data, cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format ] .pull-right[ ``` r pivot_longer( data, * cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) ] .pull-right[ ``` r pivot_longer( data, cols, * names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) - `values_to`: name of the column where data in pivoted variables go (character string) ] .pull-right[ ``` r pivot_longer( data, cols, names_to = "name", * values_to = "value" ) ``` ] --- ## Customers `\(\rightarrow\)` purchases ``` r purchases <- customers |> * pivot_longer( * cols = item_1:item_3, # variables item_1 to item_3 * names_to = "item_no", # column names -> new column called item_no * values_to = "item" # values in columns -> new column called item * ) purchases ``` ``` ## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` --- ## Why pivot? Most likely, because the next step of your analysis needs it -- .pull-left[ ``` r prices ``` ``` ## # A tibble: 5 × 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3 ``` ] .pull-right[ ``` r purchases |> * left_join(prices) ``` ``` ## # A tibble: 6 × 4 ## customer_id item_no item price ## <dbl> <chr> <chr> <dbl> ## 1 1 item_1 bread 1 ## 2 1 item_2 milk 0.8 ## 3 1 item_3 banana 0.15 ## 4 2 item_1 milk 0.8 ## 5 2 item_2 toilet paper 3 ## 6 2 item_3 <NA> NA ``` ] --- ## Purchases `\(\rightarrow\)` customers .pull-left-narrow[ - `data` (as usual) - `names_from`: which column in the long format contains the what should be column names in the wide format - `values_from`: which column in the long format contains the what should be values in the new columns in the wide format ] .pull-right-wide[ ``` r purchases |> * pivot_wider( * names_from = item_no, * values_from = item * ) ``` ``` ## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] --- class: middle # Case study: Approval rating of Donald Trump --- <img src="img/trump-approval.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/trump-approval-ratings/adults/) ] --- ## Goal Write psuedocode required to create this visualization <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-26-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Data .pull-left-wide[ ``` r trump ``` ``` ## # A tibble: 2,702 × 4 ## subgroup date approval disapproval ## <chr> <date> <dbl> <dbl> ## 1 Voters 2020-10-04 44.7 52.2 ## 2 Adults 2020-10-04 43.2 52.6 ## 3 Adults 2020-10-03 43.2 52.6 ## 4 Voters 2020-10-03 45.0 51.7 ## 5 Adults 2020-10-02 43.3 52.4 ## 6 Voters 2020-10-02 44.5 52.1 ## 7 Voters 2020-10-01 44.1 52.8 ## 8 Adults 2020-10-01 42.7 53.3 ## 9 Adults 2020-09-30 42.2 53.7 ## 10 Voters 2020-09-30 44.2 52.7 ## # ℹ 2,692 more rows ``` ] -- .pull-right-narrow[ **Aesthetic mappings:** ✅ x = `date` ❌ y = `rating_value` ❌ color = `rating_type` **Facet:** ✅ `subgroup` (Adults and Voters) ] --- ## Goal <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Pivot ``` r trump_longer <- trump |> pivot_longer( cols = c(approval, disapproval), names_to = "rating_type", values_to = "rating_value" ) trump_longer ``` ``` ## # A tibble: 5,404 × 4 ## subgroup date rating_type rating_value ## <chr> <date> <chr> <dbl> ## 1 Voters 2020-10-04 approval 44.7 ## 2 Voters 2020-10-04 disapproval 52.2 ## 3 Adults 2020-10-04 approval 43.2 ## 4 Adults 2020-10-04 disapproval 52.6 ## 5 Adults 2020-10-03 approval 43.2 ## 6 Adults 2020-10-03 disapproval 52.6 ## 7 Voters 2020-10-03 approval 45.0 ## 8 Voters 2020-10-03 disapproval 51.7 ... ``` --- ## Plot ``` r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) ``` <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-30-1.png" width="60%" style="display: block; margin: auto;" /> --- .panelset[ .panel[.panel-name[Code] ``` r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) + * scale_color_manual(values = c("darkgreen", "orange")) + * labs( * x = "Date", y = "Rating", * color = NULL, * title = "How (un)popular is Donald Trump?", * subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", * caption = "Source: FiveThirtyEight modeling estimates" * ) ``` ] .panel[.panel-name[Plot] <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-31-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- .panelset[ .panel[.panel-name[Code] ``` r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) + scale_color_manual(values = c("darkgreen", "orange")) + labs( x = "Date", y = "Rating", color = NULL, title = "How (un)popular is Donald Trump?", subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", caption = "Source: FiveThirtyEight modeling estimates" ) + * theme_minimal() + * theme(legend.position = "bottom") ``` ] .panel[.panel-name[Plot] <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-32-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- class: middle Case study: Berkeley admission data --- ## Berkeley admission data - Study carried out by the Graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a gender bias in graduate admissions. -- - The data come from six departments. For confidentiality we'll call them A-F. -- - We have information on whether the applicant was male or female and whether they were admitted or rejected. -- - First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department. --- ## Data .pull-left[ ``` ## # A tibble: 4,526 × 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## 11 Admitted Male A ## 12 Admitted Male A ## 13 Admitted Male A ## 14 Admitted Male A ## 15 Admitted Male A ## # ℹ 4,511 more rows ``` ] .pull-right[ ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ``` ## # A tibble: 6 × 2 ## dept n ## <ord> <int> ## 1 A 933 ## 2 B 585 ## 3 C 918 ## 4 D 792 ## 5 E 584 ## 6 F 714 ``` ``` ## # A tibble: 2 × 2 ## admit n ## <fct> <int> ## 1 Rejected 2771 ## 2 Admitted 1755 ``` ] --- .question[ What can you say about the overall gender distribution? Hint: Calculate the following probabilities: `\(P(Admit | Male)\)` and `\(P(Admit | Female)\)`. ] ``` r ucbadmit |> count(gender, admit) ``` ``` ## # A tibble: 4 × 3 ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` --- ``` r ucbadmit |> count(gender, admit) |> group_by(gender) |> mutate(prop_admit = n / sum(n)) ``` ``` ## # A tibble: 4 × 4 ## # Groups: gender [2] ## gender admit n prop_admit ## <fct> <fct> <int> <dbl> ## 1 Female Rejected 1278 0.696 ## 2 Female Admitted 557 0.304 ## 3 Male Rejected 1493 0.555 ## 4 Male Admitted 1198 0.445 ``` - `\(P(Admit | Female)\)` = 0.304 - `\(P(Admit | Male)\)` = 0.445 --- ## Overall gender distribution .panelset[ .panel[.panel-name[Plot] <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-36-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(ucbadmit, aes(y = gender, fill = admit)) + geom_bar(position = "fill") + scale_fill_viridis_d() + labs(title = "Admit by gender", y = NULL, x = NULL) ``` ] ] --- .question[ How could we investigate the gender distribution by department? ] ``` r ucbadmit |> count(dept, gender, admit) ``` ``` ## # A tibble: 24 × 4 ## dept gender admit n ## <ord> <fct> <fct> <int> ## 1 A Female Rejected 19 ## 2 A Female Admitted 89 ## 3 A Male Rejected 313 ## 4 A Male Admitted 512 ## 5 B Female Rejected 8 ## 6 B Female Admitted 17 ## 7 B Male Rejected 207 ## 8 B Male Admitted 353 ## 9 C Female Rejected 391 ## 10 C Female Admitted 202 ## # ℹ 14 more rows ``` --- ``` r ucbadmit |> count(dept, gender, admit) |> pivot_wider(names_from = dept, values_from = n) ``` ``` ## # A tibble: 4 × 8 ## gender admit A B C D E F ## <fct> <fct> <int> <int> <int> <int> <int> <int> ## 1 Female Rejected 19 8 391 244 299 317 ## 2 Female Admitted 89 17 202 131 94 24 ## 3 Male Rejected 313 207 205 279 138 351 ## 4 Male Admitted 512 353 120 138 53 22 ``` --- ## Gender distribution, by department .panelset[ .panel[.panel-name[Plot] <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-40-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(ucbadmit, aes(y = gender, fill = admit)) + geom_bar(position = "fill") + facet_wrap(. ~ dept) + scale_x_continuous(labels = label_percent()) + scale_fill_viridis_d() + labs(title = "Admissions by gender and department", x = NULL, y = NULL, fill = NULL) + theme(legend.position = "bottom") ``` ] ] --- ## Case for gender discrimination? .pull-left[ <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-41-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-42-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Closer look at departments .panelset[ .panel[.panel-name[Output] ``` ## # A tibble: 12 × 5 ## # Groups: dept, gender [12] ## dept gender n_admitted n_applied prop_admit ## <ord> <fct> <int> <int> <dbl> ## 1 A Female 89 108 0.824 ## 2 A Male 512 825 0.621 ## 3 B Female 17 25 0.68 ## 4 B Male 353 560 0.630 ## 5 C Female 202 593 0.341 ## 6 C Male 120 325 0.369 ## 7 D Female 131 375 0.349 ## 8 D Male 138 417 0.331 ## 9 E Female 94 393 0.239 ## 10 E Male 53 191 0.277 ## 11 F Female 24 341 0.0704 ## 12 F Male 22 373 0.0590 ``` ] .panel[.panel-name[Code] ``` r ucbadmit |> count(dept, gender, admit) |> group_by(dept, gender) |> mutate( n_applied = sum(n), prop_admit = n / n_applied ) |> filter(admit == "Admitted") |> rename(n_admitted = n) |> select(-admit) |> print(n = 12) ``` ] ] --- class: middle # Simpson's paradox --- ## Relationship between two variables .pull-left[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right[ <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-45-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Relationship between two variables .pull-left[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right[ <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-47-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Considering a third variable .pull-left[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right[ <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-49-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Relationship between three variables .pull-left[ ``` ## # A tibble: 8 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 2 4 A ## 2 3 3 A ## 3 4 2 A ## 4 5 1 A ## 5 6 11 B ## 6 7 10 B ## 7 8 9 B ## 8 9 8 B ``` ] .pull-right[ <img src="05-data-wrangling-3_files/figure-html/unnamed-chunk-51-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Simpson's paradox - Not considering an important variable when studying a relationship can result in **Simpson's paradox** - Simpson's paradox illustrates the effect that omission of an explanatory variable can have on the measure of association between another explanatory variable and a response variable - The inclusion of a third variable in the analysis can change the apparent relationship between the other two variables [TED-Ed: How statistics can be misleading](https://www.youtube.com/watch?v=sxYrzzy3cq8) --- class: middle # Aside: `group_by()` and `count()` --- ## What does group_by() do? `group_by()` takes an existing data frame and converts it into a grouped data frame where subsequent operations are performed "once per group" .pull-left[ ``` r ucbadmit ``` ``` ## # A tibble: 4,526 × 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] .pull-right[ ``` r ucbadmit |> group_by(gender) ``` ``` ## # A tibble: 4,526 × 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] --- ## What does group_by() not do? `group_by()` does not sort the data, `arrange()` does .pull-left[ ``` r ucbadmit |> group_by(gender) ``` ``` ## # A tibble: 4,526 × 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] .pull-right[ ``` r ucbadmit |> arrange(gender) ``` ``` ## # A tibble: 4,526 × 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Female A ## 2 Admitted Female A ## 3 Admitted Female A ## 4 Admitted Female A ## 5 Admitted Female A ## 6 Admitted Female A ## 7 Admitted Female A ## 8 Admitted Female A ## 9 Admitted Female A ## 10 Admitted Female A ## # ℹ 4,516 more rows ``` ] --- ## What does group_by() not do? `group_by()` does not create frequency tables, `count()` does .pull-left[ ``` r ucbadmit |> group_by(gender) ``` ``` ## # A tibble: 4,526 × 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ℹ 4,516 more rows ``` ] .pull-right[ ``` r ucbadmit |> count(gender) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] --- ## Undo grouping with ungroup() .pull-left[ ``` r ucbadmit |> count(gender, admit) |> group_by(gender) |> mutate(prop_admit = n / sum(n)) |> select(gender, prop_admit) ``` ``` ## # A tibble: 4 × 2 ## # Groups: gender [2] ## gender prop_admit ## <fct> <dbl> ## 1 Female 0.696 ## 2 Female 0.304 ## 3 Male 0.555 ## 4 Male 0.445 ``` ] .pull-right[ ``` r ucbadmit |> count(gender, admit) |> group_by(gender) |> mutate(prop_admit = n / sum(n)) |> select(gender, prop_admit) |> ungroup() ``` ``` ## # A tibble: 4 × 2 ## gender prop_admit ## <fct> <dbl> ## 1 Female 0.696 ## 2 Female 0.304 ## 3 Male 0.555 ## 4 Male 0.445 ``` ] --- ## count() is a short-hand `count()` is a short-hand for `group_by()` and then `summarise()` to count the number of observations in each group .pull-left[ ``` r ucbadmit |> group_by(gender) |> summarise(n = n()) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] .pull-right[ ``` r ucbadmit |> count(gender) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] --- ## count can take multiple arguments .pull-left[ ``` r ucbadmit |> group_by(gender, admit) |> summarise(n = n()) ``` ``` ## # A tibble: 4 × 3 ## # Groups: gender [2] ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` ] .pull-right[ ``` r ucbadmit |> count(gender, admit) ``` ``` ## # A tibble: 4 × 3 ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` ] --- ## `summarise()` after `group_by()` - `count()` ungroups after itself - `summarise()` peels off one layer of grouping by default, or you can specify a different behaviour ``` r ucbadmit |> group_by(gender, admit) |> summarise(n = n()) ``` ``` ## `summarise()` has grouped output by 'gender'. You can override ## using the `.groups` argument. ``` ``` ## # A tibble: 4 × 3 ## # Groups: gender [2] ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ```