NYC Flights - data wrangling II

Author

SOLUTIONS

Published

October 30, 2025

We will again work with data from the nycflights13 package.

library(nycflights13)
library(tidyverse)

Exercise 1

Examine the documentation for the datasets airports, flights, and planes. What are the dimensions of each? How are these datasets related?

There are 1458 rows and 8 columns in the airports dataset.

There are 336776 rows and 19 columns in the flights dataset.

There are 3322 rows and 9 columns in the planes dataset.

EXAMPLE

Suppose you wanted to make a map of the route of every flight. What variables would you need from which datasets? You need the geographic location of the airports (from airports) and the path of flights (i.e., which airports were involved) from flights.

We want to join flights to airports. Note these two datasets have no variables in common so we will have to specify the variable to join by using by =. Check out the documentation for more information.

flights |> 
  left_join(airports, by = c("dest" = "faa"))
# A tibble: 336,776 × 26
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 18 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>,
#   lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

Exercise 2

Which airports are in flights but not in airports? Google to find out what these airports are.

flights |> 
  count(dest) |> 
  anti_join(airports, by = c("dest" = "faa"))
# A tibble: 4 × 2
  dest      n
  <chr> <int>
1 BQN     896
2 PSE     365
3 SJU    5819
4 STT     522
#alternatively
flights |> 
  distinct(dest) |> 
  anti_join(airports, by = c("dest" = "faa"))
# A tibble: 4 × 1
  dest 
  <chr>
1 BQN  
2 SJU  
3 STT  
4 PSE  

Puerto Rico & Virgin Islands

Which airports are in airports but not in flights? What does this tell us about these airports (at least in 2013)?

airports |> 
  anti_join(flights, by = c("faa" = "dest"))
# A tibble: 1,357 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,347 more rows

There were no flights that went from NYC to these locations - likely all small airports

Exercise 3

Starting with the flights dataset, create a new dataset dest_delays with the median arrival delay for each destination. Note, this question does not require you to use joins. Make sure to add na.rm = TRUE when computing the median. Check: dest_delays should have dimensions 105 x 2.

dest_delays <- flights |> 
  group_by(dest) |> 
  summarize(med_arr_delay = median(arr_delay,
                                   na.rm = TRUE))

Exercise 4

Join the columns in airports to dest_delays (preserving all rows in dest_delays. Check: delays_by_airport should have dimensions 105 x 9.

delays_by_airport <- dest_delays |> 
  left_join(airports, by = c("dest" = "faa"))

Based on your answer to Exercise 2, how many rows in delays_by_airport do you expect to be missing latitude and longitude information?

4, due to the Puero Rico & Virgin Islands not being in airports dataset.

Exercise 5

Is there a relationship between the age of a plane and its delays? The plane tail number is given in the tailnum variable in the flights dataset. The year the plane was manufactured is given in the year variable in the planes dataset.

Start by finding the median arrival delay for each plane and store the resulting dataset in plane_delays. Check: plane_delays should have dimensions 4044 x 2

plane_delays <- flights |> 
  group_by(tailnum) |> 
  summarize(med_arr_delay = median(arr_delay,
                                   na.rm = TRUE))

Join plane_delays to the planes data using an appropriate join and then use mutate to create an age variable. Note this data is from 2013.

planes <- planes |> 
  left_join(plane_delays) |> 
  mutate(age = 2013 - year)
Joining with `by = join_by(tailnum)`

Finally, create an effective visualization of the data to investigate if there a relationship between the age of a plane and its delays. Comment on your conclusions.

ggplot(planes, aes(x = age, y = med_arr_delay)) +
  geom_jitter() + #same as geom_point but adds a small amount of random noise to points - helps when there are a lot of overlapping points
  geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 76 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(planes, aes(x = age, y = med_arr_delay)) +
  geom_hex() + #another alternative that helps see "density" of overlapping points
  geom_smooth()
Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_binhex()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_smooth()`).

(FOR LATER) Extra mapping fun

Try re-creating the visualization below and/or exploring other ways to visualized the flights data spatially. Note, to view the image in your html, download the “flights_map.png” file and place it in the images subfolder of your STAT_7500 folder.

Some starter code is provided below for creating a basic US map.

library(maps)
library(mapdata)

state <- map_data("state")

ggplot() + 
  geom_polygon(data = state, 
               aes(x = long, y = lat, group = group),
               fill = 'lightblue', color = "white") +
  theme_void() +
  coord_fixed(1.3)

Adding on the flight trajectories will require some creative data wrangling. You can use a geom_curve() layer to create the flight curves; the above plot has curvature = 0.2. Note, you can use different datasets and aesthetic mappings in different geom layers (see geom_polygon() layer above).

dest_counts <- flights |> 
  count(dest) |> 
  left_join(airports, 
            by = c("dest" = "faa")) |> 
  select(dest, n, lat, lon) |> 
  filter(!(dest %in% c("HNL", "ANC")))

path_counts <- flights |> 
  count(dest, origin) |> 
  left_join(airports, by = c("dest" = "faa")) |> 
  select(dest, origin, n, lat_dest = lat, 
         lon_dest = lon)  |> 
  left_join(airports |> select(faa, lat_origin = lat, 
                               lon_origin = lon), 
            by = c("origin" = "faa"))  |> 
  filter(!(dest %in% c("HNL", "ANC")))
  

ggplot() + 
  geom_polygon(data = state, 
               aes(x = long, y = lat, group = group),
               fill = 'lightblue', color = "white") +
  theme_void() +
  coord_fixed(1.3) +
  geom_point(data = dest_counts, 
             aes(x = lon, y = lat, 
                 size = n)) +
  geom_curve(data = path_counts, aes(x = lon_origin,
                                     xend = lon_dest,
                                     y = lat_origin,
                                     yend = lat_dest,
                                     alpha = n),
             curvature = 0.2) +
  theme(legend.position = "bottom") +
  labs(title = "Flights from NYC to contiguous 48 states",
       subtitle = "from nycflights data",
       size = "Number of flights to destination") +
  guides(alpha = "none")
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_curve()`).