Web Scraping (brief intro)

class: center, middle, inverse, title-slide

.title[
# Web Scraping (brief intro)
]
.subtitle[
## STAT 7500
]
.author[
### Katie Fitzgerald, adpated from datasciencebox.org & Michael Posner
]

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://kgfitzgerald.github.io/stat-7500" target="_blank">kgfitzgerald.github.io/stat-7500</a>
</span>
</div>

---

# Data Wrangling Overview

- Isolating data – `filter`, `select`, `arrange`
- Piping operator – `|>`
- Deriving information – `summarize`, `group_by`, `mutate`
- Combining datasets – `bind_rows`, joins
- Tidy data – `pivot_longer`, `pivot_wider`
- Working with strings – `stringr`
- Scraping data – `rvest`

---

# Data Scraping (from the internet)

+ Reference a file on the internet

+ Download a file from the internet

+ Scrape a table from an HTML file

+ The selector gadget

--
+ Using Twitter’s Application Programming Interface (API)

---

# Example 1: FiveThirtyEight

Nate Silver --> ABC News

+ Sports, Politics, More
+ The Signal & The Noise (Book)

[fivethirtyeight.com](fivethirtyeight.com) now redirects to ABC news, but [https://github.com/fivethirtyeight](https://github.com/fivethirtyeight) has a treasure trove of data.

---

# Example 1: FiveThirtyEight

College major data:  
https://github.com/fivethirtyeight/data/blob/master/college-majors/all-ages.csv

``` r
collmaj538_file <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv"
collmaj <- read_csv(collmaj538_file)
head(collmaj)
```

```
## # A tibble: 6 × 11
##   Major_code Major                 Major_category  Total Employed
##        <dbl> <chr>                 <chr>           <dbl>    <dbl>
## 1       1100 GENERAL AGRICULTURE   Agriculture &… 128148    90245
## 2       1101 AGRICULTURE PRODUCTI… Agriculture &…  95326    76865
## 3       1102 AGRICULTURAL ECONOMI… Agriculture &…  33955    26321
## 4       1103 ANIMAL SCIENCES       Agriculture &… 103549    81177
## 5       1104 FOOD SCIENCE          Agriculture &…  24280    17281
## 6       1105 PLANT SCIENCE AND AG… Agriculture &…  79409    63043
## # ℹ 6 more variables: Employed_full_time_year_round <dbl>,
## #   Unemployed <dbl>, Unemployment_rate <dbl>, Median <dbl>,
## #   P25th <dbl>, P75th <dbl>
```

---

# Explore the Data — Median Salary

``` r
collmaj |>
  select(Major, Median) |>
  arrange(desc(Median)) |>
  slice(c(1:5, (nrow(collmaj)-4):nrow(collmaj)))
```

```
## # A tibble: 10 × 2
##    Major                                               Median
##    <chr>                                                <dbl>
##  1 PETROLEUM ENGINEERING                               125000
##  2 PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION 106000
##  3 NAVAL ARCHITECTURE AND MARINE ENGINEERING            97000
##  4 METALLURGICAL ENGINEERING                            96000
##  5 NUCLEAR ENGINEERING                                  95000
##  6 COUNSELING PSYCHOLOGY                                39000
##  7 HUMAN SERVICES AND COMMUNITY ORGANIZATION            38000
##  8 STUDIO ARTS                                          37600
##  9 EARLY CHILDHOOD EDUCATION                            35300
## 10 NEUROSCIENCE                                         35000
```

---

# Explore the Data — Unemployment

``` r
collmaj |>
  select(Major, Unemployment_rate) |>
  arrange(desc(Unemployment_rate)) |>
  slice(c(1:5, (nrow(collmaj)-4):nrow(collmaj))) |> 
  print(n = 10)
```

```
## # A tibble: 10 × 2
##    Major                                      Unemployment_rate
##    <chr>                                                  <dbl>
##  1 MISCELLANEOUS FINE ARTS                               0.156 
##  2 CLINICAL PSYCHOLOGY                                   0.103 
##  3 MILITARY TECHNOLOGIES                                 0.102 
##  4 SCHOOL STUDENT COUNSELING                             0.102 
##  5 LIBRARY SCIENCE                                       0.0948
##  6 MATHEMATICS AND COMPUTER SCIENCE                      0.0249
##  7 MATERIALS SCIENCE                                     0.0223
##  8 PHARMACOLOGY                                          0.0161
##  9 EDUCATIONAL ADMINISTRATION AND SUPERVISION            0     
## 10 GEOLOGICAL AND GEOPHYSICAL ENGINEERING                0
```

---

## Explore the Data — Unemployment vs Median

``` r
ggplot(collmaj, aes(x = Unemployment_rate, y = Median)) +
  geom_point() + geom_smooth()
```

---

## Explore — Stat/CS Majors

``` r
collmaj |>
  filter(Major_code %in% c(2101, 2102, 3700, 3701, 3702, 4005)) |>
  arrange(desc(Median)) |>
  select(Major_code, Major, Total, Unemployment_rate, Median)
```

```
## # A tibble: 6 × 5
##   Major_code Major                 Total Unemployment_rate Median
##        <dbl> <chr>                 <dbl>             <dbl>  <dbl>
## 1       4005 MATHEMATICS AND COM…   7184            0.0249  92000
## 2       2102 COMPUTER SCIENCE     783292            0.0495  78000
## 3       3701 APPLIED MATHEMATICS   19112            0.0557  70000
## 4       3702 STATISTICS AND DECI…  24806            0.0571  70000
## 5       3700 MATHEMATICS          432806            0.0529  66000
## 6       2101 COMPUTER PROGRAMMIN…  29317            0.0903  60000
```

---

# Is This Good Coding?

Problems:
- Website or structure changes → code breaks
- Data updates → not reproducible

Solutions:
- Record date/time (`Sys.time()`)
- Save local copy

---

# Download a File

``` r
collmaj538_file <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv"
download.file(collmaj538_file, "collmaj538.csv")
```

*Try this code - best to do in R script (not qmd) when retrieving data from internet*

+ See also the `RCurl` package

---

# Example 2: Social Security data

https://www.ssa.gov/oact/babynames/numberUSbirths.html

+ View source code – right-click (in most browsers)
+ How could we extract this??

---

class: middle

# Web Scraping with rvest

---

## Hypertext Markup Language

- Most of the data on the web is still largely available as HTML 
- It is structured (hierarchical / tree based), but it's often not available in a form useful for analysis (flat / tidy).

```html
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>
```

+ To learn more anatomy of HTML files - see w3schools (or DesignShack or Mozilla Developer Network)

---

## rvest

.pull-left[
- The **rvest** package makes basic processing and manipulation of HTML data straight forward
- It's designed to work with pipelines built with `|>`
]
.pull-right[
<img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" />
]

---

## Core rvest functions

- `read_html`   - Read HTML data from a url or character string
- `html_node `  - Select a specified node from HTML document
- `html_nodes`  - Select specified nodes from HTML document
- `html_table`  - Parse an HTML table into a data frame
- `html_text`   - Extract tag pairs' content
- `html_name`   - Extract tags' names
- `html_attrs`  - Extract all of each tag's attributes
- `html_attr`   - Extract tags' attribute value by name

---

# Scrape an HTML Table

``` r
birth_file <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html"

birth_file |>
  read_html() |>
  html_nodes("table") |>
  pluck(1) |>
  html_table()
```

*Try this code - in R script. Add comments!*

*What should we explore?*

---

# Example 3: State populations (Wikipedia)

Source: https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population

``` r
statepop_file <- "https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population"
```

Let's try:

+ Reading in the html table
+ Selecting & renaming columns of interest
+ Creating a population heat map

---

# Example 3: State populations

.panelset[

.panel[.panel-name[Plot]
<img src="08-scraping_files/figure-html/map-plot-1.png" width="60%" style="display: block; margin: auto;" />
]

.panel[.panel-name[Code]

``` r
library(maps)

states <- map_data("state")

statepop |>
  mutate(state = str_to_lower(state)) |>
  right_join(states, by = c("state" = "region")) |>
  mutate(
    pop = as.numeric(str_remove_all(pop_2020, ",")),
    pop_mil = round(pop / 1000000, 1)
  ) |>
  ggplot() +
  geom_polygon(aes(long, lat, fill = pop_mil, group = state))
```
]

]

---

# Example 4: PSSA Results

+ Pennsylvania Department of Education: [www.education.pa.gov](www.education.pa.gov)
+ PSSA is  the Pennsylvania System of School Assessment
--

+ Measures student achievement “according to Pennsylvania's world-class academic standards”

--
> “By using these standards, educators, parents and administrators can evaluate their students' strengths and weaknesses to increase students' achievement scores”

--
+ Subjects - reading & mathematics
--
+ Scored as “Below Basic”, “Basic”, “Proficient” or “Advanced” (sometimes the latter two categories are combined)
--
+ Replaced by the Keystone exams for high school which are required for graduation…or maybe not!

Data page:  
https://www.education.pa.gov/DataAndReporting/Assessments/Pages/PSSA-Results.aspx

---

# Example 5: Villanova on Reddit

``` r
library(httr)
library(jsonlite)
library(dplyr)
library(tibble)

# Reddit API search for "Villanova"
url <- "https://www.reddit.com/search.json?q=Villanova&limit=100"

res <- GET(url, user_agent("vu-class-app"))
txt <- content(res, as = "text", encoding = "UTF-8")
dat <- fromJSON(txt)

# Extract post data
VU_posts <- dat$data$children$data
```

---

# One more example: Zipcodes (SAS MP)

``` r
zip15003 <- "https://www.zip-codes.com/zip-code/15003/zip-code-15003.asp"
thisURL <- zip15003

# Function to read one ZIP code page
read_zip_url <- function(thisURL) {
  this_zip <- thisURL %>% 
    read_html() %>% 
    html_nodes("table") %>% 
    html_table(fill = TRUE) %>% 
    keep(~ ncol(.x) == 2) %>% 
    bind_rows()
  
  thisZip <- data.frame(t(this_zip %>% select(2)))
  names(thisZip) <- make.names(t(this_zip %>% select(1)), unique = TRUE)
  
  return(thisZip)
}
```

---

# One more example: Zipcodes (SAS MP)

``` r
# Vectorized function that creates one row of data per zip code provided
read_zip_vector <- function(zip_vector) {
  map_dfr(zip_vector, function(z) {   # here z is the current ZIP code
    url <- paste0("https://www.zip-codes.com/zip-code/", z, "/zip-code-", z, ".asp")
    df <- read_zip_url(url)
    
    # Correct: use z here, not zipcode
    df <- df %>% mutate(ZipCode = z)
    return(df)
  })
}

# Example usage
zip_vector <- c("75116", "60201", "91101")
zip_df <- read_zip_vector(zip_vector)
```

---

## SelectorGadget

.pull-left-narrow[
- Open source tool that eases CSS selector generation and discovery
- Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) 
- Find out more on the [SelectorGadget vignette](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html)
]
.pull-right-wide[
<img src="img/selector-gadget/selector-gadget.png" width="75%" style="display: block; margin: auto;" />
]

---

## Using the SelectorGadget

---

---

---

---

---

---

## Using the SelectorGadget

Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs

---

# Your Turn

Scrape the 100m Olympic record progression tables and create the following plot  
https://en.wikipedia.org/wiki/100_metres_at_the_Olympics

---