library(nycflights13)
library(tidyverse)NYC Flights - data wrangling II
We will again work with data from the nycflights13 package.
Exercise 1
Examine the documentation for the datasets airports, flights, and planes. What are the dimensions of each? How are these datasets related?
There are 1458 rows and 8 columns in the
airportsdataset.
There are 336776 rows and 19 columns in the
flightsdataset.
There are 3322 rows and 9 columns in the
planesdataset.
EXAMPLE
Suppose you wanted to make a map of the route of every flight. What variables would you need from which datasets? You need the geographic location of the airports (from airports) and the path of flights (i.e., which airports were involved) from flights.
We want to join flights to airports. Note these two datasets have no variables in common so we will have to specify the variable to join by using by =. Check out the documentation for more information.
flights |>
left_join(airports, by = c("dest" = "faa"))# A tibble: 336,776 × 26
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 18 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>,
# lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
Exercise 2
Which airports are in flights but not in airports? Google to find out what these airports are.
flights |>
count(dest) |>
anti_join(airports, by = c("dest" = "faa"))# A tibble: 4 × 2
dest n
<chr> <int>
1 BQN 896
2 PSE 365
3 SJU 5819
4 STT 522
#alternatively
flights |>
distinct(dest) |>
anti_join(airports, by = c("dest" = "faa"))# A tibble: 4 × 1
dest
<chr>
1 BQN
2 SJU
3 STT
4 PSE
Puerto Rico & Virgin Islands
Which airports are in airports but not in flights? What does this tell us about these airports (at least in 2013)?
airports |>
anti_join(flights, by = c("faa" = "dest"))# A tibble: 1,357 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
# ℹ 1,347 more rows
There were no flights that went from NYC to these locations - likely all small airports
Exercise 3
Starting with the flights dataset, create a new dataset dest_delays with the median arrival delay for each destination. Note, this question does not require you to use joins. Make sure to add na.rm = TRUE when computing the median. Check: dest_delays should have dimensions 105 x 2.
dest_delays <- flights |>
group_by(dest) |>
summarize(med_arr_delay = median(arr_delay,
na.rm = TRUE))Exercise 4
Join the columns in airports to dest_delays (preserving all rows in dest_delays. Check: delays_by_airport should have dimensions 105 x 9.
delays_by_airport <- dest_delays |>
left_join(airports, by = c("dest" = "faa"))Based on your answer to Exercise 2, how many rows in delays_by_airport do you expect to be missing latitude and longitude information?
4, due to the Puero Rico & Virgin Islands not being in
airportsdataset.
Exercise 5
Is there a relationship between the age of a plane and its delays? The plane tail number is given in the tailnum variable in the flights dataset. The year the plane was manufactured is given in the year variable in the planes dataset.
Start by finding the median arrival delay for each plane and store the resulting dataset in plane_delays. Check: plane_delays should have dimensions 4044 x 2
plane_delays <- flights |>
group_by(tailnum) |>
summarize(med_arr_delay = median(arr_delay,
na.rm = TRUE))Join plane_delays to the planes data using an appropriate join and then use mutate to create an age variable. Note this data is from 2013.
planes <- planes |>
left_join(plane_delays) |>
mutate(age = 2013 - year)Joining with `by = join_by(tailnum)`
Finally, create an effective visualization of the data to investigate if there a relationship between the age of a plane and its delays. Comment on your conclusions.
ggplot(planes, aes(x = age, y = med_arr_delay)) +
geom_jitter() + #same as geom_point but adds a small amount of random noise to points - helps when there are a lot of overlapping points
geom_smooth()`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 76 rows containing missing values or values outside the scale range
(`geom_point()`).
ggplot(planes, aes(x = age, y = med_arr_delay)) +
geom_hex() + #another alternative that helps see "density" of overlapping points
geom_smooth()Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_binhex()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_smooth()`).
(FOR LATER) Extra mapping fun
Try re-creating the visualization below and/or exploring other ways to visualized the flights data spatially. Note, to view the image in your html, download the “flights_map.png” file and place it in the images subfolder of your STAT_7500 folder.

Some starter code is provided below for creating a basic US map.
library(maps)
library(mapdata)
state <- map_data("state")
ggplot() +
geom_polygon(data = state,
aes(x = long, y = lat, group = group),
fill = 'lightblue', color = "white") +
theme_void() +
coord_fixed(1.3)Adding on the flight trajectories will require some creative data wrangling. You can use a geom_curve() layer to create the flight curves; the above plot has curvature = 0.2. Note, you can use different datasets and aesthetic mappings in different geom layers (see geom_polygon() layer above).
dest_counts <- flights |>
count(dest) |>
left_join(airports,
by = c("dest" = "faa")) |>
select(dest, n, lat, lon) |>
filter(!(dest %in% c("HNL", "ANC")))
path_counts <- flights |>
count(dest, origin) |>
left_join(airports, by = c("dest" = "faa")) |>
select(dest, origin, n, lat_dest = lat,
lon_dest = lon) |>
left_join(airports |> select(faa, lat_origin = lat,
lon_origin = lon),
by = c("origin" = "faa")) |>
filter(!(dest %in% c("HNL", "ANC")))
ggplot() +
geom_polygon(data = state,
aes(x = long, y = lat, group = group),
fill = 'lightblue', color = "white") +
theme_void() +
coord_fixed(1.3) +
geom_point(data = dest_counts,
aes(x = lon, y = lat,
size = n)) +
geom_curve(data = path_counts, aes(x = lon_origin,
xend = lon_dest,
y = lat_origin,
yend = lat_dest,
alpha = n),
curvature = 0.2) +
theme(legend.position = "bottom") +
labs(title = "Flights from NYC to contiguous 48 states",
subtitle = "from nycflights data",
size = "Number of flights to destination") +
guides(alpha = "none")Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_curve()`).