Data Wrangling II

.title[
# Data Wrangling II
]
.subtitle[
## STAT 7500
]
.author[
### Katie Fitzgerald, adpated from datasciencebox.org
]

---

layout: true
 
<div class="my-footer">

<a href="https://kgfitzgerald.github.io/stat-7500" target="_blank">kgfitzgerald.github.io/stat-7500</a>

</div>

---

# Goals

Today you will practice:

+ data wrangling with `dplyr` verbs
+ joining multiple data frames
+ pivoting data frames to wide or long format
+ planning a data wrangling pipeline to accomplished a desired task/visualization

---

# A quick refresh of `dplyr` verbs in R

Which of the following allows you to keep rows that meet a certain criteria?

A. `keep()`  
B. `where()`  
C. `filter()`  
D. `select()`  
E. `mutate()`

---

# A quick refresh of `dplyr` verbs in R

Which of the following allows you to keep rows that meet a certain criteria?

A. `keep()` 
B. `where()` 
C. `filter()` ✅ 
D. `select()` 
E. `mutate()`

---

# A quick refresh of `dplyr` verbs in R

Which of the following allows you to keep specified columns?

A. `keep()`  
B. `colname()`  
C. `filter()`  
D. `select()`  
E. `mutate()`

---

# A quick refresh of `dplyr` verbs in R

Which of the following allows you to keep specified columns?

A. `keep()` 
B. `colname()` 
C. `filter()` 
D. `select()` ✅ 
E. `mutate()`

---

# A quick refresh of `dplyr` verbs in R

Recall the `midwest` dataset you used for homework (dimensions: 437 × 28),  
where each row is a county, and there is data on counties from 5 Midwestern states.

How many rows and columns will the following code produce?

``` r
midwest |> count(state)
```

A. 437 × 3  
B. 5 × 2  
C. 1 × 1  
D. 5 × 28

---

# A quick refresh of `dplyr` verbs in R

Recall the `midwest` dataset you used for homework (dimensions: 437 × 28),  
where each row is a county, and there is data on counties from 5 Midwestern states.

How many rows and columns will the following code produce?

``` r
midwest |> count(state)
```

A. 437 × 3 
B. 5 × 2 ✅ 
C. 1 × 1 
D. 5 × 28

---

# Still thinking about the `midwest` data...

You want to calculate the average percentage of college-educated adults (`percollege`) by state.  
Which verbs do you need?

A. `filter()` and `mutate()`  
B. `group_by()` and `summarize()`  
C. `mutate()` and `arrange()`  
D. `count()` and `summarize()`

---

# Still thinking about the `midwest` data...

You want to calculate the average percentage of college-educated adults (`percollege`) by state.  
Which verbs do you need?

A. `filter()` and `mutate()` 
B. `group_by()` and `summarize()` ✅ 
C. `mutate()` and `arrange()` 
D. `count()` and `summarize()`

---

# Still thinking about the `midwest` data...

You want to keep only counties where less than 10% of the population lives in poverty (`percbelowpoverty`).  
Which verb do you use?

A. `mutate()`  
B. `filter()`  
C. `group_by()`  
D. `summarize()`

---

# Still thinking about the `midwest` data...

You want to keep only counties where less than 10% of the population lives in poverty (`percbelowpoverty`).  
Which verb do you use?

A. `mutate()` 
B. `filter()` ✅ 
C. `group_by()` 
D. `summarize()`

---

# Still thinking about the `midwest` data...

You want to create a new variable for the proportion of children in the population, using  
`perc_child = popchild / poptotal`.  
Which verb should you use?

A. `summarize()`  
B. `filter()`  
C. `mutate()`  
D. `count()`

---

# Still thinking about the `midwest` data...

You want to create a new variable for the proportion of children in the population, using  
`perc_child = popchild / poptotal`.  
Which verb should you use?

A. `summarize()` 
B. `filter()` 
C. `mutate()` ✅ 
D. `count()`

---

# Still thinking about the `midwest` data...

How many rows will the resulting dataset have?

``` r
midwest |> 
  group_by(state) |> 
  mutate(mean_poverty = mean(percbelowpoverty))
```

A. One per state  
B. One per value of `percbelowpoverty`  
C. Same number as the original dataset  
D. Depends on missing values

---

# Still thinking about the `midwest` data...

How many rows will the resulting dataset have?

``` r
midwest |> 
  group_by(state) |> 
  mutate(mean_poverty = mean(percbelowpoverty))
```

A. One per state 
B. One per value of `percbelowpoverty` 
C. Same number as the original dataset ✅ 
D. Depends on missing values

---

# Still thinking about the `midwest` data...

You want to find out which state has the highest average percentage of people in poverty.  
Which pipeline would give you that result?

``` r
# A
midwest |> summarize(max(perpoverty))

# B
midwest |> 
  group_by(state) |> 
  summarize(avg_pov = mean(percbelowpoverty)) |> 
  arrange(desc(avg_pov))

# C
midwest |> mutate(avg_pov = mean(percbelowpoverty)) |> max()

# D 
midwest |> count(percbelowpoverty) |> arrange()
```

---

# Still thinking about the `midwest` data...

You want to find out which state has the highest average percentage of people in poverty.  
Which pipeline would give you that result?

``` r
# A
midwest |> summarize(max(perpoverty))

# B - correct answer
midwest |> 
 group_by(state) |> 
 summarize(avg_pov = mean(percbelowpoverty)) |> 
 arrange(desc(avg_pov)) ✅<

# C
midwest |> mutate(avg_pov = mean(percbelowpoverty)) |> max()

# D 
midwest |> count(percbelowpoverty) |> arrange()
```
---

# Data Wrangling: Joining multiple datasets

---

# .hand[We...]

---

## Data: Women in science

Information on "10 women in science who changed the world"

|name               |
|:------------------|
|Ada Lovelace       |
|Marie Curie        |
|Janaki Ammal       |
|Chien-Shiung Wu    |
|Katherine Johnson  |
|Rosalind Franklin  |
|Vera Rubin         |
|Gladys West        |
|Flossie Wong-Staal |
|Jennifer Doudna    |
]

.footnote[
Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world)
]

---

## Inputs

``` r
professions
```

```
## # A tibble: 10 × 2
## name profession 
## <chr> <chr> 
## 1 Ada Lovelace Mathematician 
## 2 Marie Curie Physicist and Chemist 
## 3 Janaki Ammal Botanist 
## 4 Chien-Shiung Wu Physicist 
## 5 Katherine Johnson Mathematician 
## 6 Rosalind Franklin Chemist 
## 7 Vera Rubin Astronomer 
## 8 Gladys West Mathematician 
## 9 Flossie Wong-Staal Virologist and Molecular Biologist
## 10 Jennifer Doudna Biochemist
```
]

``` r
dates
```

```
## # A tibble: 8 × 3
## name birth_year death_year
## <chr> <dbl> <dbl>
## 1 Janaki Ammal 1897 1984
## 2 Chien-Shiung Wu 1912 1997
## 3 Katherine Johnson 1918 2020
## 4 Rosalind Franklin 1920 1958
## 5 Vera Rubin 1928 2016
## 6 Gladys West 1930 NA
## 7 Flossie Wong-Staal 1947 NA
## 8 Jennifer Doudna 1964 NA
```
]

``` r
works
```

```
## # A tibble: 9 × 2
## name known_for 
## <chr> <chr> 
## 1 Ada Lovelace first computer algorithm 
## 2 Marie Curie theory of radioactivity, discovery of elem…
## 3 Janaki Ammal hybrid species, biodiversity protection 
## 4 Chien-Shiung Wu confim and refine theory of radioactive bet…
## 5 Katherine Johnson calculations of orbital mechanics critical …
## 6 Vera Rubin existence of dark matter 
## 7 Gladys West mathematical modeling of the shape of the E…
## 8 Flossie Wong-Staal first scientist to clone HIV and create a m…
## 9 Jennifer Doudna one of the primary developers of CRISPR, a …
```
]

]

---

## Desired output

```
## # A tibble: 10 × 5
## name profession birth_year death_year known_for
## <chr> <chr> <dbl> <dbl> <chr> 
## 1 Ada Lovelace Mathematic… NA NA first co…
## 2 Marie Curie Physicist … NA NA theory o…
## 3 Janaki Ammal Botanist 1897 1984 hybrid s…
## 4 Chien-Shiung Wu Physicist 1912 1997 confim a…
## 5 Katherine Johnson Mathematic… 1918 2020 calculat…
## 6 Rosalind Franklin Chemist 1920 1958 <NA> 
## 7 Vera Rubin Astronomer 1928 2016 existenc…
## 8 Gladys West Mathematic… 1930 NA mathemat…
## 9 Flossie Wong-Staal Virologist… 1947 NA first sc…
## 10 Jennifer Doudna Biochemist 1964 NA one of t…
```

---

## Inputs, reminder

``` r
names(professions)
```

```
## [1] "name"       "profession"
```

``` r
names(dates)
```

```
## [1] "name"       "birth_year" "death_year"
```

``` r
names(works)
```

```
## [1] "name"      "known_for"
```
]
.pull-right[

``` r
nrow(professions)
```

```
## [1] 10
```

``` r
nrow(dates)
```

```
## [1] 8
```

``` r
nrow(works)
```

```
## [1] 9
```
]

---

# Joining data frames

---

## Joining data frames

``` r
something_join(x, y)

x |> 
  something_join(y)
```

#### Mutating joins (add variables in)

- `left_join()`: keeps all rows in x
- `right_join()`: keeps all rows in y
- `full_join()`: keeps all rows in x or y
- `inner_join()`: only keeps rows from x that have a matching key in y

---

## Joining data frames

``` r
something_join(x, y)

x |> 
  something_join(y)
```

#### Filtering joins (remove rows based on matches, don't bring in new variables)
- `semi_join()`: only keeps rows from x that have a match in y; does not bring in y's columns
- `anti_join()`: only keeps rows from x that do NOT have a match in y; does not bring in y's columns

---

## Setup

For the next few slides...

``` r
x
```

```
## # A tibble: 3 × 2
## id value_x
## <dbl> <chr> 
## 1 1 x1 
## 2 2 x2 
## 3 3 x3
```
]
.pull-right[

``` r
y
```

```
## # A tibble: 3 × 2
## id value_y
## <dbl> <chr> 
## 1 1 y1 
## 2 2 y2 
## 3 4 y4
```
]

---

## `left_join()`

.pull-left[
<img src="img/left-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
left_join(x, y)
```

```
## # A tibble: 3 × 3
## id value_x value_y
## <dbl> <chr> <chr> 
## 1 1 x1 y1 
## 2 2 x2 y2 
## 3 3 x3 <NA>
```
]

---

## `left_join()`

What do you expect? How many rows and columns?

``` r
professions |>
* left_join(dates)
```

```
## # A tibble: 10 × 4
## name profession birth_year death_year
## <chr> <chr> <dbl> <dbl>
## 1 Ada Lovelace Mathematician NA NA
## 2 Marie Curie Physicist and Chemist NA NA
## 3 Janaki Ammal Botanist 1897 1984
## 4 Chien-Shiung Wu Physicist 1912 1997
## 5 Katherine Johnson Mathematician 1918 2020
## 6 Rosalind Franklin Chemist 1920 1958
## 7 Vera Rubin Astronomer 1928 2016
## 8 Gladys West Mathematician 1930 NA
## 9 Flossie Wong-Staal Virologist and Molec… 1947 NA
## 10 Jennifer Doudna Biochemist 1964 NA
```

---

## `right_join()`

.pull-left[
<img src="img/right-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
right_join(x, y)
```

```
## # A tibble: 3 × 3
## id value_x value_y
## <dbl> <chr> <chr> 
## 1 1 x1 y1 
## 2 2 x2 y2 
## 3 4 <NA> y4
```
]

---

## `right_join()`

What do you expect? How many rows and columns?

``` r
professions |>
* right_join(dates)
```

```
## # A tibble: 8 × 4
## name profession birth_year death_year
## <chr> <chr> <dbl> <dbl>
## 1 Janaki Ammal Botanist 1897 1984
## 2 Chien-Shiung Wu Physicist 1912 1997
## 3 Katherine Johnson Mathematician 1918 2020
## 4 Rosalind Franklin Chemist 1920 1958
## 5 Vera Rubin Astronomer 1928 2016
## 6 Gladys West Mathematician 1930 NA
## 7 Flossie Wong-Staal Virologist and Molecu… 1947 NA
## 8 Jennifer Doudna Biochemist 1964 NA
```

---

## `full_join()`

.pull-left[
<img src="img/full-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
full_join(x, y)
```

```
## # A tibble: 4 × 3
## id value_x value_y
## <dbl> <chr> <chr> 
## 1 1 x1 y1 
## 2 2 x2 y2 
## 3 3 x3 <NA> 
## 4 4 <NA> y4
```
]

---

## `full_join()`

``` r
dates |>
* full_join(works)
```

```
## # A tibble: 10 × 4
## name birth_year death_year known_for 
## <chr> <dbl> <dbl> <chr> 
## 1 Janaki Ammal 1897 1984 hybrid species, biod…
## 2 Chien-Shiung Wu 1912 1997 confim and refine th…
## 3 Katherine Johnson 1918 2020 calculations of orbi…
## 4 Rosalind Franklin 1920 1958 <NA> 
## 5 Vera Rubin 1928 2016 existence of dark ma…
## 6 Gladys West 1930 NA mathematical modelin…
## 7 Flossie Wong-Staal 1947 NA first scientist to c…
## 8 Jennifer Doudna 1964 NA one of the primary d…
## 9 Ada Lovelace NA NA first computer algor…
## 10 Marie Curie NA NA theory of radioactiv…
```

---

## `inner_join()`

.pull-left[
<img src="img/inner-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
inner_join(x, y)
```

```
## # A tibble: 2 × 3
## id value_x value_y
## <dbl> <chr> <chr> 
## 1 1 x1 y1 
## 2 2 x2 y2
```
]

---

## `inner_join()`

``` r
dates |>
* inner_join(works)
```

```
## # A tibble: 7 × 4
## name birth_year death_year known_for 
## <chr> <dbl> <dbl> <chr> 
## 1 Janaki Ammal 1897 1984 hybrid species, biodi…
## 2 Chien-Shiung Wu 1912 1997 confim and refine the…
## 3 Katherine Johnson 1918 2020 calculations of orbit…
## 4 Vera Rubin 1928 2016 existence of dark mat…
## 5 Gladys West 1930 NA mathematical modeling…
## 6 Flossie Wong-Staal 1947 NA first scientist to cl…
## 7 Jennifer Doudna 1964 NA one of the primary de…
```

---

## `semi_join()`

.pull-left[
<img src="img/semi-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
semi_join(x, y)
```

```
## # A tibble: 2 × 2
## id value_x
## <dbl> <chr> 
## 1 1 x1 
## 2 2 x2
```
]

---

## `semi_join()`

What do you expect? How many rows and columns?

``` r
dates |>
* semi_join(works)
```

```
## # A tibble: 7 × 3
## name birth_year death_year
## <chr> <dbl> <dbl>
## 1 Janaki Ammal 1897 1984
## 2 Chien-Shiung Wu 1912 1997
## 3 Katherine Johnson 1918 2020
## 4 Vera Rubin 1928 2016
## 5 Gladys West 1930 NA
## 6 Flossie Wong-Staal 1947 NA
## 7 Jennifer Doudna 1964 NA
```

---

## `anti_join()`

.pull-left[
<img src="img/anti-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]
.pull-right[

``` r
anti_join(x, y)
```

```
## # A tibble: 1 × 2
## id value_x
## <dbl> <chr> 
## 1 3 x3
```
]

---

## `anti_join()`

``` r
dates |>
* anti_join(works)
```

```
## # A tibble: 1 × 3
## name birth_year death_year
## <chr> <dbl> <dbl>
## 1 Rosalind Franklin 1920 1958
```

---

## Putting it altogether

``` r
professions |>
  left_join(dates) |>
  left_join(works)
```

---

# Case study: Student records

---

## Student records

- Have:
  - Enrolment: official university enrolment records
  - Survey: Student provided info; missing students who never filled it out and including students who filled it out but dropped the class
- Want: Survey info for all enrolled in class

``` r
enrolment
```

```
## # A tibble: 3 × 2
## id name 
## <dbl> <chr> 
## 1 1 Dave Friday 
## 2 2 Hermine 
## 3 3 Sura Selvarajah
```
]
.pull-right[

``` r
survey
```

```
## # A tibble: 4 × 3
## id name username 
## <dbl> <chr> <chr> 
## 1 2 Hermine bakealongwithhermine
## 2 3 Sura surasbakes 
## 3 4 Peter peter_bakes 
## 4 5 Mark thebakingbuddha
```
]

---

# Quiz

Which code is guaranteed to retain rows for all students enrolled in the course, but not students who dropped (i.e. are not currently enrolled)?

A. `enrolment |> left_join(survey, by = "id")`

B. `enrolment |> inner_join(survey, by = "id")`

C. `enrolment |> semi_join(survey, by = "id")`

D. `enrolment |> full_join(survey, by = "id")`

---

# Quiz

Which code is guaranteed to retain rows for all students enrolled in the course, but not students who dropped (i.e. are not currently enrolled)?

A. `enrolment |> left_join(survey, by = "id")` ✅ 
B. `enrolment |> inner_join(survey, by = "id")` 
C. `enrolment |> semi_join(survey, by = "id")` 
D. `enrolment |> full_join(survey, by = "id")`

---

# Quiz

Which code will return rows for students who did not fill out the survey?

A. `enrolment |> semi_join(survey, by = "id")`

B. `enrolment |> anti_join(survey, by = "id")`

C. `survey |> semi_join(enrolment, by = "id")`

D. `survey |> anti_join(enrolment, by = "id")`

---

# Quiz

Which code will return rows for students who did not fill out the survey?

A. `enrolment |> semi_join(survey, by = "id")` 
B. `enrolment |> anti_join(survey, by = "id")` ✅ 
C. `survey |> semi_join(enrolment, by = "id")` 
D. `survey |> anti_join(enrolment, by = "id")`

---

# Quiz

Which code will return rows for students who filled out the survey but dropped the course?

A. `enrolment |> semi_join(survey, by = "id")`

B. `enrolment |> anti_join(survey, by = "id")`

C. `survey |> semi_join(enrolment, by = "id")`

D. `survey |> anti_join(enrolment, by = "id")`

---

# Quiz

Which code will return rows for students who filled out the survey but dropped the course?

A. `enrolment |> semi_join(survey, by = "id")` 
B. `enrolment |> anti_join(survey, by = "id")` 
C. `survey |> semi_join(enrolment, by = "id")` 
D. `survey |> anti_join(enrolment, by = "id")` ✅

---

## Student records

``` r
enrolment |> 
* left_join(survey, by = "id")
```

```
## # A tibble: 3 × 4
## id name.x name.y username 
## <dbl> <chr> <chr> <chr> 
## 1 1 Dave Friday <NA> <NA> 
## 2 2 Hermine Hermine bakealongwithhermine
## 3 3 Sura Selvarajah Sura surasbakes
```
]

``` r
enrolment |> 
* anti_join(survey, by = "id")
```

```
## # A tibble: 1 × 2
## id name 
## <dbl> <chr> 
## 1 1 Dave Friday
```
]

``` r
survey |> 
* anti_join(enrolment, by = "id")
```

```
## # A tibble: 2 × 3
## id name username 
## <dbl> <chr> <chr> 
## 1 4 Peter peter_bakes 
## 2 5 Mark thebakingbuddha
```
]

]

---

# Your turn: week_09.qmd

Recall: `flights` contains data about every flight that departed La Guardia, JFK, or Newark airports in 2013

Exercises 1 - 5

---

# How could we create this?

In groups, brainstorm the data wrangling required to create this visualizaton.

+ What does the data need to look like? What does one row of the data represent? What columns do you need? Sketch out a few "dummy" rows of what you expect the data to contain.
+ Which of the above columns already exist (in which datasets), and which ones need to be created? 
+ What joining and/or aggregation steps need to take place? Write some psuedocode for this.
]

![](img/flights_map.png)
]