Data Wrangling II

.title[
# Data Wrangling II
]
.subtitle[
## STAT 7500
]
.author[
### Katie Fitzgerald, adpated from datasciencebox.org
]

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://kgfitzgerald.github.io/stat-7500" target="_blank">kgfitzgerald.github.io/stat-7500</a>
</span>
</div>

---

# Goals

Today you will practice:

+ data wrangling with `dplyr` verbs
+ joining multiple data frames
+ pivoting data frames to wide or long format
+ planning a data wrangling pipeline to accomplished a desired task/visualization

---

# A quick refresh of `dplyr` verbs in R

Which of the following allows you to keep rows that meet a certain criteria?

A. `keep()`
  
  B. `where()`
  
  C. `filter()`
  
  D. `select()`
  
  E. `mutate()`

---

# A quick refresh of `dplyr` verbs in R

Which of the following allows you to keep specified columns?

A. `keep()`
  
  B. `colname()`
  
  C. `filter()`
  
  D. `select()`
  
  E. `mutate()`

---

# A quick refresh of `dplyr` verbs in R

Recall the `midwest` dataset you used for homework (dimensions: 437 x 2), where each row is a `county`, and there is data on counties from 5 Midwestern states.

How many rows and columns will the following code produce?

``` r
midwest |> count(state)
```

A. 437 x 3

B. 5 x 2

C. 1 x 1

D. 5 x 28

---

# Still thinking about the `midwest` data...

You want to calculate the average percentage of college-educated adults (percollege) by state.
Which verbs do you need?

A. `filter()` and mutate()

B. `group_by()` and `summarize()`

C. `mutate()` and `arrange()`

D. `count()` and `summarize()`

---

# Still thinking about the `midwest` data...

You want to keep only counties where less than 10% of the population lives in poverty `(percbelowpoverty)` .
Which verb do you use?

A. `mutate()`

B. `filter()`

C. `group_by()`

D. `summarize()`

---

# Still thinking about the `midwest` data...

You want to create a new variable for the proportion of children in the population, using
`perc_child = popchild / poptotal`.
 Which verb should you use?

A. `summarize()`

B. `filter()`

C. `mutate()`

D. `count()`

---

# Still thinking about the `midwest` data...

How many rows will the resulting dataset have?

``` r
midwest |> 
  group_by(state) |> 
  mutate(mean_poverty = mean(percbelowpoverty))
```

A. One per state

B. One per value of `percbelowpoverty`

C. Same number as the original dataset

D. Depends on missing values

---

# Still thinking about the `midwest` data...

You want to find out which state has the highest average percentage of people in poverty. Which pipeline would give you that result?

``` r
#A
midwest |> summarize(max(perpoverty))

#B
midwest |> 
      group_by(state) |> 
      summarize(avg_pov = mean(percbelowpoverty)) |> 
      arrange(desc(avg_pov))

#C
midwest |> mutate(avg_pov = mean(percbelowpoverty)) |> max()

#D 
midwest |> count(percbelowpoverty) |> arrange()
```

---

# Data Wrangling: Joining multiple datasets

---

# .hand[We...]

---

## Data: Women in science

Information on "10 women in science who changed the world"

|name               |
|:------------------|
|Ada Lovelace       |
|Marie Curie        |
|Janaki Ammal       |
|Chien-Shiung Wu    |
|Katherine Johnson  |
|Rosalind Franklin  |
|Vera Rubin         |
|Gladys West        |
|Flossie Wong-Staal |
|Jennifer Doudna    |
]

.footnote[
Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world)
]

---

## Inputs

``` r
professions
```

```
## # A tibble: 10 × 2
##    name               profession                        
##    <chr>              <chr>                             
##  1 Ada Lovelace       Mathematician                     
##  2 Marie Curie        Physicist and Chemist             
##  3 Janaki Ammal       Botanist                          
##  4 Chien-Shiung Wu    Physicist                         
##  5 Katherine Johnson  Mathematician                     
##  6 Rosalind Franklin  Chemist                           
##  7 Vera Rubin         Astronomer                        
##  8 Gladys West        Mathematician                     
##  9 Flossie Wong-Staal Virologist and Molecular Biologist
## 10 Jennifer Doudna    Biochemist
```
]

``` r
dates
```

```
## # A tibble: 8 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Rosalind Franklin        1920       1958
## 5 Vera Rubin               1928       2016
## 6 Gladys West              1930         NA
## 7 Flossie Wong-Staal       1947         NA
## 8 Jennifer Doudna          1964         NA
```
]

``` r
works
```

```
## # A tibble: 9 × 2
##   name               known_for                                   
##   <chr>              <chr>                                       
## 1 Ada Lovelace       first computer algorithm                    
## 2 Marie Curie        theory of radioactivity,  discovery of elem…
## 3 Janaki Ammal       hybrid species, biodiversity protection     
## 4 Chien-Shiung Wu    confim and refine theory of radioactive bet…
## 5 Katherine Johnson  calculations of orbital mechanics critical …
## 6 Vera Rubin         existence of dark matter                    
## 7 Gladys West        mathematical modeling of the shape of the E…
## 8 Flossie Wong-Staal first scientist to clone HIV and create a m…
## 9 Jennifer Doudna    one of the primary developers of CRISPR, a …
```
]

]

---

## Desired output

```
## # A tibble: 10 × 5
##    name               profession  birth_year death_year known_for
##    <chr>              <chr>            <dbl>      <dbl> <chr>    
##  1 Ada Lovelace       Mathematic…         NA         NA first co…
##  2 Marie Curie        Physicist …         NA         NA theory o…
##  3 Janaki Ammal       Botanist          1897       1984 hybrid s…
##  4 Chien-Shiung Wu    Physicist         1912       1997 confim a…
##  5 Katherine Johnson  Mathematic…       1918       2020 calculat…
##  6 Rosalind Franklin  Chemist           1920       1958 <NA>     
##  7 Vera Rubin         Astronomer        1928       2016 existenc…
##  8 Gladys West        Mathematic…       1930         NA mathemat…
##  9 Flossie Wong-Staal Virologist…       1947         NA first sc…
## 10 Jennifer Doudna    Biochemist        1964         NA one of t…
```

---

## Inputs, reminder

``` r
names(professions)
```

```
## [1] "name"       "profession"
```

``` r
names(dates)
```

```
## [1] "name"       "birth_year" "death_year"
```

``` r
names(works)
```

```
## [1] "name"      "known_for"
```
]
.pull-right[

``` r
nrow(professions)
```

```
## [1] 10
```

``` r
nrow(dates)
```

```
## [1] 8
```

``` r
nrow(works)
```

```
## [1] 9
```
]

---

# Joining data frames

---

## Joining data frames

``` r
something_join(x, y)

x |> 
  something_join(y)
```

#### Mutating joins (add variables in)

- `left_join()`: keeps all rows in x
- `right_join()`: keeps all rows in y
- `full_join()`: keeps all rows in x or y
- `inner_join()`: only keeps rows from x that have a matching key in y

---

## Joining data frames

``` r
something_join(x, y)

x |> 
  something_join(y)
```

#### Filtering joins (remove rows based on matches, don't bring in new variables)
- `semi_join()`: only keeps rows from x that have a match in y; does not bring in y's columns
- `anti_join()`: only keeps rows from x that do NOT have a match in y; does not bring in y's columns

---

## Setup

For the next few slides...

``` r
x
```

```
## # A tibble: 3 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2     
## 3     3 x3
```
]
.pull-right[

``` r
y
```

```
## # A tibble: 3 × 2
##      id value_y
##   <dbl> <chr>  
## 1     1 y1     
## 2     2 y2     
## 3     4 y4
```
]

---

## `left_join()`

.pull-left[
<img src="img/left-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
left_join(x, y)
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>
```
]

---

## `left_join()`

What do you expect? How many rows and columns?

``` r
professions |>
* left_join(dates)
```

```
## # A tibble: 10 × 4
##    name               profession            birth_year death_year
##    <chr>              <chr>                      <dbl>      <dbl>
##  1 Ada Lovelace       Mathematician                 NA         NA
##  2 Marie Curie        Physicist and Chemist         NA         NA
##  3 Janaki Ammal       Botanist                    1897       1984
##  4 Chien-Shiung Wu    Physicist                   1912       1997
##  5 Katherine Johnson  Mathematician               1918       2020
##  6 Rosalind Franklin  Chemist                     1920       1958
##  7 Vera Rubin         Astronomer                  1928       2016
##  8 Gladys West        Mathematician               1930         NA
##  9 Flossie Wong-Staal Virologist and Molec…       1947         NA
## 10 Jennifer Doudna    Biochemist                  1964         NA
```

---

## `right_join()`

.pull-left[
<img src="img/right-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
right_join(x, y)
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     4 <NA>    y4
```
]

---

## `right_join()`

What do you expect? How many rows and columns?

``` r
professions |>
* right_join(dates)
```

```
## # A tibble: 8 × 4
##   name               profession             birth_year death_year
##   <chr>              <chr>                       <dbl>      <dbl>
## 1 Janaki Ammal       Botanist                     1897       1984
## 2 Chien-Shiung Wu    Physicist                    1912       1997
## 3 Katherine Johnson  Mathematician                1918       2020
## 4 Rosalind Franklin  Chemist                      1920       1958
## 5 Vera Rubin         Astronomer                   1928       2016
## 6 Gladys West        Mathematician                1930         NA
## 7 Flossie Wong-Staal Virologist and Molecu…       1947         NA
## 8 Jennifer Doudna    Biochemist                   1964         NA
```

---

## `full_join()`

.pull-left[
<img src="img/full-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
full_join(x, y)
```

```
## # A tibble: 4 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>   
## 4     4 <NA>    y4
```
]

---

## `full_join()`

``` r
dates |>
* full_join(works)
```

```
## # A tibble: 10 × 4
##    name               birth_year death_year known_for            
##    <chr>                   <dbl>      <dbl> <chr>                
##  1 Janaki Ammal             1897       1984 hybrid species, biod…
##  2 Chien-Shiung Wu          1912       1997 confim and refine th…
##  3 Katherine Johnson        1918       2020 calculations of orbi…
##  4 Rosalind Franklin        1920       1958 <NA>                 
##  5 Vera Rubin               1928       2016 existence of dark ma…
##  6 Gladys West              1930         NA mathematical modelin…
##  7 Flossie Wong-Staal       1947         NA first scientist to c…
##  8 Jennifer Doudna          1964         NA one of the primary d…
##  9 Ada Lovelace               NA         NA first computer algor…
## 10 Marie Curie                NA         NA theory of radioactiv…
```

---

## `inner_join()`

.pull-left[
<img src="img/inner-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
inner_join(x, y)
```

```
## # A tibble: 2 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2
```
]

---

## `inner_join()`

``` r
dates |>
* inner_join(works)
```

```
## # A tibble: 7 × 4
##   name               birth_year death_year known_for             
##   <chr>                   <dbl>      <dbl> <chr>                 
## 1 Janaki Ammal             1897       1984 hybrid species, biodi…
## 2 Chien-Shiung Wu          1912       1997 confim and refine the…
## 3 Katherine Johnson        1918       2020 calculations of orbit…
## 4 Vera Rubin               1928       2016 existence of dark mat…
## 5 Gladys West              1930         NA mathematical modeling…
## 6 Flossie Wong-Staal       1947         NA first scientist to cl…
## 7 Jennifer Doudna          1964         NA one of the primary de…
```

---

## `semi_join()`

.pull-left[
<img src="img/semi-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
semi_join(x, y)
```

```
## # A tibble: 2 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2
```
]

---

## `semi_join()`

What do you expect? How many rows and columns?

``` r
dates |>
* semi_join(works)
```

```
## # A tibble: 7 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Vera Rubin               1928       2016
## 5 Gladys West              1930         NA
## 6 Flossie Wong-Staal       1947         NA
## 7 Jennifer Doudna          1964         NA
```

---

## `anti_join()`

.pull-left[
<img src="img/anti-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
anti_join(x, y)
```

```
## # A tibble: 1 × 2
##      id value_x
##   <dbl> <chr>  
## 1     3 x3
```
]

---

## `anti_join()`

``` r
dates |>
* anti_join(works)
```

```
## # A tibble: 1 × 3
##   name              birth_year death_year
##   <chr>                  <dbl>      <dbl>
## 1 Rosalind Franklin       1920       1958
```

---

## Putting it altogether

``` r
professions |>
  left_join(dates) |>
  left_join(works)
```

---

# Case study: Student records

---

## Student records

- Have:
  - Enrolment: official university enrolment records
  - Survey: Student provided info; missing students who never filled it out and including students who filled it out but dropped the class
- Want: Survey info for all enrolled in class

``` r
enrolment
```

```
## # A tibble: 3 × 2
##      id name           
##   <dbl> <chr>          
## 1     1 Dave Friday    
## 2     2 Hermine        
## 3     3 Sura Selvarajah
```
]
.pull-right[

``` r
survey
```

```
## # A tibble: 4 × 3
##      id name    username            
##   <dbl> <chr>   <chr>               
## 1     2 Hermine bakealongwithhermine
## 2     3 Sura    surasbakes          
## 3     4 Peter   peter_bakes         
## 4     5 Mark    thebakingbuddha
```
]

---

# Quiz

Which code is guaranteed to retain rows for all students enrolled in the course, but not students who dropped (i.e. are not currently enrolled)?

A. `enrolment |> left_join(survey, by = "id")`

B. `enrolment |> inner_join(survey, by = "id")`

C. `enrolment |> semi_join(survey, by = "id")`

D. `enrolment |> full_join(survey, by = "id")`

---

# Quiz

Which code will return rows for students who did not fill out the survey?

A. `enrolment |> semi_join(survey, by = "id")`

B. `enrolment |> anti_join(survey, by = "id")`

C. `survey |> semi_join(enrolment, by = "id")`

D. `survey |> anti_join(enrolment, by = "id")`

---

# Quiz

Which code will return rows for students who filled out the survey but dropped the course?

A. `enrolment |> semi_join(survey, by = "id")`

B. `enrolment |> anti_join(survey, by = "id")`

C. `survey |> semi_join(enrolment, by = "id")`

D. `survey |> anti_join(enrolment, by = "id")`

---

## Student records

``` r
enrolment |> 
* left_join(survey, by = "id")
```

```
## # A tibble: 3 × 4
##      id name.x          name.y  username            
##   <dbl> <chr>           <chr>   <chr>               
## 1     1 Dave Friday     <NA>    <NA>                
## 2     2 Hermine         Hermine bakealongwithhermine
## 3     3 Sura Selvarajah Sura    surasbakes
```
]

``` r
enrolment |> 
* anti_join(survey, by = "id")
```

```
## # A tibble: 1 × 2
##      id name       
##   <dbl> <chr>      
## 1     1 Dave Friday
```
]

``` r
survey |> 
* anti_join(enrolment, by = "id")
```

```
## # A tibble: 2 × 3
##      id name  username       
##   <dbl> <chr> <chr>          
## 1     4 Peter peter_bakes    
## 2     5 Mark  thebakingbuddha
```
]

]

---

# Your turn: week_09.qmd

Recall: `flights` contains data about every flight that departed La Guardia, JFK, or Newark airports in 2013

Exercises 1 - 5

---

# How could we create this?

In groups, brainstorm the data wrangling required to create this visualizaton.

+ What does the data need to look like? What does one row of the data represent? What columns do you need? Sketch out a few "dummy" rows of what you expect the data to contain.
+ Which of the above columns already exist (in which datasets), and which ones need to be created? 
+ What joining and/or aggregation steps need to take place? Write some psuedocode for this.
]

![](img/flights_map.png)
]

---

# Case study: Grocery sales

---

## Grocery sales

- Have:
  - Purchases: One row per customer per item, listing purchases they made
  - Prices: One row per item in the store, listing their prices
- Want: Total revenue

``` r
purchases
```

```
## # A tibble: 5 × 2
##   customer_id item        
##         <dbl> <chr>       
## 1           1 bread       
## 2           1 milk        
## 3           1 banana      
## 4           2 milk        
## 5           2 toilet paper
```
]
.pull-right[

``` r
prices
```

```
## # A tibble: 5 × 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3
```
]

---

## Grocery sales

``` r
purchases |> 
* left_join(prices)
```

```
## # A tibble: 5 × 3
##   customer_id item         price
##         <dbl> <chr>        <dbl>
## 1           1 bread         1   
## 2           1 milk          0.8 
## 3           1 banana        0.15
## 4           2 milk          0.8 
## 5           2 toilet paper  3
```
]
.pull-right[

``` r
purchases |> 
  left_join(prices) |>
* summarise(total_revenue = sum(price))
```

```
## # A tibble: 1 × 1
##   total_revenue
##           <dbl>
## 1          5.75
```
]
]

``` r
purchases |> 
  left_join(prices)
```

``` r
purchases |> 
  left_join(prices) |>
* group_by(customer_id) |>
  summarise(total_revenue = sum(price))
```

```
## # A tibble: 2 × 2
##   customer_id total_revenue
##         <dbl>         <dbl>
## 1           1          1.95
## 2           2          3.8
```
]

]

]
---

# .hand[We...]

---

## Data: Sales

<br>

```
## # A tibble: 2 × 4
##   customer_id item_1 item_2       item_3
##         <dbl> <chr>  <chr>        <chr> 
## 1           1 bread  milk         banana
## 2           2 milk   toilet paper <NA>
```
]

--
.pull-right[
### .pink[We want...]

```
## # A tibble: 6 × 3
##   customer_id item_no item        
##         <dbl> <chr>   <chr>       
## 1           1 item_1  bread       
## 2           1 item_2  milk        
## 3           1 item_3  banana      
## 4           2 item_1  milk        
## 5           2 item_2  toilet paper
## 6           2 item_3  <NA>
```
]

---

## A grammar of data tidying

.pull-left[
<img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" />
]
.pull-right[
The goal of tidyr is to help you tidy your data via

- pivoting for going between wide and long data
- splitting and combining character columns
- nesting and unnesting columns
- clarifying how `NA`s should be treated
]

---

# Pivoting data

---

## Not this...

---

## but this!

.center[
<img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]

---

## Wider vs. longer

--
.pull-right[
### .pink[longer]
more rows

---

## `pivot_longer()`

``` r
pivot_longer(
* data,
  cols, 
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

``` r
pivot_longer(
  data, 
* cols,
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
]
.pull-right[

``` r
pivot_longer(
  data, 
  cols, 
* names_to = "name",
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
- `values_to`: name of the column where data in pivoted variables go (character string)
]
.pull-right[

``` r
pivot_longer(
  data, 
  cols, 
  names_to = "name", 
* values_to = "value"
  )
```
]

---

## Customers `\(\rightarrow\)` purchases

``` r
purchases <- customers |>
* pivot_longer(
*   cols = item_1:item_3,  # variables item_1 to item_3
*   names_to = "item_no",  # column names -> new column called item_no
*   values_to = "item"     # values in columns -> new column called item
*   )

purchases
```

---

## Why pivot?

Most likely, because the next step of your analysis needs it

``` r
prices
```

```
## # A tibble: 5 × 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3
```
]
.pull-right[

``` r
purchases |>
* left_join(prices)
```

```
## # A tibble: 6 × 4
##   customer_id item_no item         price
##         <dbl> <chr>   <chr>        <dbl>
## 1           1 item_1  bread         1   
## 2           1 item_2  milk          0.8 
## 3           1 item_3  banana        0.15
## 4           2 item_1  milk          0.8 
## 5           2 item_2  toilet paper  3   
## 6           2 item_3  <NA>         NA
```
]

---

## Purchases `\(\rightarrow\)` customers

.pull-left-narrow[
- `data` (as usual)
- `names_from`: which column in the long format contains the what should be column names in the wide format
- `values_from`: which column in the long format contains the what should be values in the new columns in the wide format
]
.pull-right-wide[

``` r
purchases |>
* pivot_wider(
*   names_from = item_no,
*   values_from = item
* )
```

---

# Case study: Approval rating of Donald Trump

---

.footnote[
Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/trump-approval-ratings/adults/)
]

---

## Goal

Write psuedocode required to create this visualization

---

## Data

``` r
trump
```

```
## # A tibble: 2,702 × 4
##    subgroup date       approval disapproval
##    <chr>    <date>        <dbl>       <dbl>
##  1 Voters   2020-10-04     44.7        52.2
##  2 Adults   2020-10-04     43.2        52.6
##  3 Adults   2020-10-03     43.2        52.6
##  4 Voters   2020-10-03     45.0        51.7
##  5 Adults   2020-10-02     43.3        52.4
##  6 Voters   2020-10-02     44.5        52.1
##  7 Voters   2020-10-01     44.1        52.8
##  8 Adults   2020-10-01     42.7        53.3
##  9 Adults   2020-09-30     42.2        53.7
## 10 Voters   2020-09-30     44.2        52.7
## # ℹ 2,692 more rows
```

]

.pull-right-narrow[
**Aesthetic mappings:**  
✅  x = `date`  
❌      y = `rating_value`  
❌      color = `rating_type`

**Facet:**  
✅  `subgroup` (Adults and Voters)
]

---

## Goal

---

## Pivot

``` r
trump_longer <- trump |>
  pivot_longer(
    cols = c(approval, disapproval),
    names_to = "rating_type",
    values_to = "rating_value"
  )

trump_longer
```

```
## # A tibble: 5,404 × 4
##    subgroup date       rating_type rating_value
##    <chr>    <date>     <chr>              <dbl>
##  1 Voters   2020-10-04 approval            44.7
##  2 Voters   2020-10-04 disapproval         52.2
##  3 Adults   2020-10-04 approval            43.2
##  4 Adults   2020-10-04 disapproval         52.6
##  5 Adults   2020-10-03 approval            43.2
##  6 Adults   2020-10-03 disapproval         52.6
##  7 Voters   2020-10-03 approval            45.0
##  8 Voters   2020-10-03 disapproval         51.7
...
```

---

## Plot

``` r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup)
```

---

``` r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
* scale_color_manual(values = c("darkgreen", "orange")) +
* labs(
*   x = "Date", y = "Rating",
*   color = NULL,
*   title = "How (un)popular is Donald Trump?",
*   subtitle = "Estimates based on polls of all adults and polls of likely/registered voters",
*   caption = "Source: FiveThirtyEight modeling estimates"
* )
```
]

.panel[.panel-name[Plot]
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-76-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

``` r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular is Donald Trump?", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  ) + 
* theme_minimal() +
* theme(legend.position = "bottom")
```
]

.panel[.panel-name[Plot]
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-77-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

# Extra slides: Case study: Berkeley admission data

---

## Berkeley admission data

- Study carried out by the Graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a gender bias in graduate admissions.
- The data come from six departments. For confidentiality we'll call them A-F. 
- We have information on whether the applicant was male or female and whether they were admitted or rejected. 
- First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department.

---

## Data

```
## # A tibble: 4,526 × 3
##    admit    gender dept 
##    <fct>    <fct>  <ord>
##  1 Admitted Male   A    
##  2 Admitted Male   A    
##  3 Admitted Male   A    
##  4 Admitted Male   A    
##  5 Admitted Male   A    
##  6 Admitted Male   A    
##  7 Admitted Male   A    
##  8 Admitted Male   A    
##  9 Admitted Male   A    
## 10 Admitted Male   A    
## 11 Admitted Male   A    
## 12 Admitted Male   A    
## 13 Admitted Male   A    
## 14 Admitted Male   A    
## 15 Admitted Male   A    
## # ℹ 4,511 more rows
```
]
.pull-right[

```
## # A tibble: 2 × 2
##   gender     n
##   <fct>  <int>
## 1 Female  1835
## 2 Male    2691
```

```
## # A tibble: 6 × 2
##   dept      n
##   <ord> <int>
## 1 A       933
## 2 B       585
## 3 C       918
## 4 D       792
## 5 E       584
## 6 F       714
```

```
## # A tibble: 2 × 2
##   admit        n
##   <fct>    <int>
## 1 Rejected  2771
## 2 Admitted  1755
```
]

---

.question[
What can you say about the overall gender distribution? Hint: Calculate the following probabilities: `\(P(Admit | Male)\)` and `\(P(Admit | Female)\)`.
]

``` r
ucbadmit |>
  count(gender, admit)
```

```
## # A tibble: 4 × 3
##   gender admit        n
##   <fct>  <fct>    <int>
## 1 Female Rejected  1278
## 2 Female Admitted   557
## 3 Male   Rejected  1493
## 4 Male   Admitted  1198
```

---

``` r
ucbadmit |>
  count(gender, admit) |>
  group_by(gender) |>
  mutate(prop_admit = n / sum(n))
```

```
## # A tibble: 4 × 4
## # Groups:   gender [2]
##   gender admit        n prop_admit
##   <fct>  <fct>    <int>      <dbl>
## 1 Female Rejected  1278      0.696
## 2 Female Admitted   557      0.304
## 3 Male   Rejected  1493      0.555
## 4 Male   Admitted  1198      0.445
```

- `\(P(Admit | Female)\)` = 0.304
- `\(P(Admit | Male)\)` = 0.445

---

## Overall gender distribution

.panel[.panel-name[Plot]
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-81-1.png" width="60%" style="display: block; margin: auto;" />
]

``` r
ggplot(ucbadmit, aes(y = gender, fill = admit)) +
  geom_bar(position = "fill") + 
  scale_fill_viridis_d() +
  labs(title = "Admit by gender",
       y = NULL, x = NULL)
```
]

]

---

``` r
ucbadmit |>
  count(dept, gender, admit)
```

```
## # A tibble: 24 × 4
##    dept  gender admit        n
##    <ord> <fct>  <fct>    <int>
##  1 A     Female Rejected    19
##  2 A     Female Admitted    89
##  3 A     Male   Rejected   313
##  4 A     Male   Admitted   512
##  5 B     Female Rejected     8
##  6 B     Female Admitted    17
##  7 B     Male   Rejected   207
##  8 B     Male   Admitted   353
##  9 C     Female Rejected   391
## 10 C     Female Admitted   202
## # ℹ 14 more rows
```

---

``` r
ucbadmit |>
  count(dept, gender, admit) |>
  pivot_wider(names_from = dept, values_from = n)
```

```
## # A tibble: 4 × 8
##   gender admit        A     B     C     D     E     F
##   <fct>  <fct>    <int> <int> <int> <int> <int> <int>
## 1 Female Rejected    19     8   391   244   299   317
## 2 Female Admitted    89    17   202   131    94    24
## 3 Male   Rejected   313   207   205   279   138   351
## 4 Male   Admitted   512   353   120   138    53    22
```

---

## Gender distribution, by department

.panel[.panel-name[Plot]
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-84-1.png" width="60%" style="display: block; margin: auto;" />
]

``` r
ggplot(ucbadmit, aes(y = gender, fill = admit)) +
  geom_bar(position = "fill") +
  facet_wrap(. ~ dept) +
  scale_x_continuous(labels = label_percent()) +
  scale_fill_viridis_d() +
  labs(title = "Admissions by gender and department",
       x = NULL, y = NULL, fill = NULL) +
  theme(legend.position = "bottom")
```
]

]

---

## Case for gender discrimination?

.pull-left[
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-85-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-86-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Closer look at departments

```
## # A tibble: 12 × 5
## # Groups:   dept, gender [12]
##    dept  gender n_admitted n_applied prop_admit
##    <ord> <fct>       <int>     <int>      <dbl>
##  1 A     Female         89       108     0.824 
##  2 A     Male          512       825     0.621 
##  3 B     Female         17        25     0.68  
##  4 B     Male          353       560     0.630 
##  5 C     Female        202       593     0.341 
##  6 C     Male          120       325     0.369 
##  7 D     Female        131       375     0.349 
##  8 D     Male          138       417     0.331 
##  9 E     Female         94       393     0.239 
## 10 E     Male           53       191     0.277 
## 11 F     Female         24       341     0.0704
## 12 F     Male           22       373     0.0590
```
]

]

---

# Simpson's paradox

---

## Relationship between two variables

```
## # A tibble: 8 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     2     4 A    
## 2     3     3 A    
## 3     4     2 A    
## 4     5     1 A    
## 5     6    11 B    
## 6     7    10 B    
## 7     8     9 B    
## 8     9     8 B
```
]
.pull-right[
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-89-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Relationship between two variables

```
## # A tibble: 8 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     2     4 A    
## 2     3     3 A    
## 3     4     2 A    
## 4     5     1 A    
## 5     6    11 B    
## 6     7    10 B    
## 7     8     9 B    
## 8     9     8 B
```
]
.pull-right[
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-91-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Considering a third variable

```
## # A tibble: 8 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     2     4 A    
## 2     3     3 A    
## 3     4     2 A    
## 4     5     1 A    
## 5     6    11 B    
## 6     7    10 B    
## 7     8     9 B    
## 8     9     8 B
```
]
.pull-right[
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-93-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Relationship between three variables

```
## # A tibble: 8 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     2     4 A    
## 2     3     3 A    
## 3     4     2 A    
## 4     5     1 A    
## 5     6    11 B    
## 6     7    10 B    
## 7     8     9 B    
## 8     9     8 B
```
]
.pull-right[
<img src="04-data-wrangling-2_files/figure-html/unnamed-chunk-95-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Simpson's paradox

- Not considering an important variable when studying a relationship can result in **Simpson's paradox**
- Simpson's paradox illustrates the effect that omission of an explanatory variable can have on the measure of association between another explanatory variable and a response variable
- The inclusion of a third variable in the analysis can change the apparent relationship between the other two variables

---

# Aside: `group_by()` and `count()`

---

## What does group_by() do?

`group_by()` takes an existing data frame and converts it into a grouped data frame where subsequent operations are performed "once per group"

``` r
ucbadmit
```

``` r
ucbadmit |> 
  group_by(gender)
```

```
## # A tibble: 4,526 × 3
## # Groups:   gender [2]
##    admit    gender dept 
##    <fct>    <fct>  <ord>
##  1 Admitted Male   A    
##  2 Admitted Male   A    
##  3 Admitted Male   A    
##  4 Admitted Male   A    
##  5 Admitted Male   A    
##  6 Admitted Male   A    
##  7 Admitted Male   A    
##  8 Admitted Male   A    
##  9 Admitted Male   A    
## 10 Admitted Male   A    
## # ℹ 4,516 more rows
```
]

---

## What does group_by() not do?

`group_by()` does not sort the data, `arrange()` does

``` r
ucbadmit |> 
  group_by(gender)
```

``` r
ucbadmit |> 
  arrange(gender)
```

```
## # A tibble: 4,526 × 3
##    admit    gender dept 
##    <fct>    <fct>  <ord>
##  1 Admitted Female A    
##  2 Admitted Female A    
##  3 Admitted Female A    
##  4 Admitted Female A    
##  5 Admitted Female A    
##  6 Admitted Female A    
##  7 Admitted Female A    
##  8 Admitted Female A    
##  9 Admitted Female A    
## 10 Admitted Female A    
## # ℹ 4,516 more rows
```
]

---

## What does group_by() not do?

`group_by()` does not create frequency tables, `count()` does

``` r
ucbadmit |> 
  group_by(gender)
```

``` r
ucbadmit |> 
  count(gender)
```

```
## # A tibble: 2 × 2
##   gender     n
##   <fct>  <int>
## 1 Female  1835
## 2 Male    2691
```
]

---

## Undo grouping with ungroup()

``` r
ucbadmit |>
  count(gender, admit) |>
  group_by(gender) |>
  mutate(prop_admit = n / sum(n)) |>
  select(gender, prop_admit)
```

```
## # A tibble: 4 × 2
## # Groups:   gender [2]
##   gender prop_admit
##   <fct>       <dbl>
## 1 Female      0.696
## 2 Female      0.304
## 3 Male        0.555
## 4 Male        0.445
```
]
.pull-right[

```
## # A tibble: 4 × 2
##   gender prop_admit
##   <fct>       <dbl>
## 1 Female      0.696
## 2 Female      0.304
## 3 Male        0.555
## 4 Male        0.445
```
]

---

## count() is a short-hand

`count()` is a short-hand for `group_by()` and then `summarise()` to count the number of observations in each group

``` r
ucbadmit |>
  group_by(gender) |>
  summarise(n = n()) 
```

```
## # A tibble: 2 × 2
##   gender     n
##   <fct>  <int>
## 1 Female  1835
## 2 Male    2691
```
]
.pull-right[

``` r
ucbadmit |>
  count(gender)
```

```
## # A tibble: 2 × 2
##   gender     n
##   <fct>  <int>
## 1 Female  1835
## 2 Male    2691
```
]

---

## count can take multiple arguments

``` r
ucbadmit |>
  group_by(gender, admit) |>
  summarise(n = n()) 
```

```
## # A tibble: 4 × 3
## # Groups:   gender [2]
##   gender admit        n
##   <fct>  <fct>    <int>
## 1 Female Rejected  1278
## 2 Female Admitted   557
## 3 Male   Rejected  1493
## 4 Male   Admitted  1198
```
]
.pull-right[

``` r
ucbadmit |>
  count(gender, admit)
```

```
## # A tibble: 4 × 3
##   gender admit        n
##   <fct>  <fct>    <int>
## 1 Female Rejected  1278
## 2 Female Admitted   557
## 3 Male   Rejected  1493
## 4 Male   Admitted  1198
```
]

---

## `summarise()` after `group_by()`

- `count()` ungroups after itself
- `summarise()` peels off one layer of grouping by default, or you can specify a different behaviour

``` r
ucbadmit |>
  group_by(gender, admit) |>
  summarise(n = n()) 
```

```
## `summarise()` has grouped output by 'gender'. You can override
## using the `.groups` argument.
```

```
## # A tibble: 4 × 3
## # Groups:   gender [2]
##   gender admit        n
##   <fct>  <fct>    <int>
## 1 Female Rejected  1278
## 2 Female Admitted   557
## 3 Male   Rejected  1493
## 4 Male   Admitted  1198
```