AE 05: Wrangling flights

Application exercise
Important

This AE is due Friday, Sep 15 at 11:59pm.

To demonstrate data wrangling we will use flights, a tibble in the nycflights13 R package. It includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.

Note: As we go through the AE, practicing thinking in steps, and reading your code as sentences

The data frame has over 336,000 observations (rows), 336776 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data. The number of columns in the data set is 19

glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Data wrangling with dplyr

dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the dplyr cheatsheet.

Quick summary of key dplyr functions1:

Rows:

  • filter():chooses rows based on column values.
  • slice(): chooses rows based on location.
  • arrange(): changes the order of the rows
  • sample_n(): take a random subset of the rows

Columns:

  • select(): changes whether or not a column is included.
  • rename(): changes the name of columns.
  • mutate(): changes the values of columns and creates new columns.

Groups of rows:

  • summarise(): collapses a group into a single row.
  • count(): count unique values of one or more variables.
  • group_by(): perform calculations separately for each value of a variable

Pick Up Here

  • Your turn (5 minutes): Create a data frame that only includes the plane tail number (tailnum), carrier (carrier), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum) for this flight?
flights |>
  select(tailnum , carrier , dep_delay) |>
  arrange(desc(dep_delay)) |>
  slice(1)
# A tibble: 1 × 3
  tailnum carrier dep_delay
  <chr>   <chr>       <dbl>
1 N384HA  HA           1301

filter()

  • Demo: Filter the data frame by selecting the rows where the destination airport is RDU. Comment the code below.
flights |> # I have my data and then
  filter(dest == "RDU") #subset for values that equal RDU
# A tibble: 8,163 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1240           1235         5     1415           1415
 9  2013     1     1     1317           1325        -8     1454           1505
10  2013     1     1     1449           1450        -1     1651           1640
# ℹ 8,153 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
#note: == represents equality

Now, run the following code with one equals sign instead of two. Does it still work?

Explanation:

(=) is a Assignment operator while (==) is a Equal to operator. (=) is used for assigning the values from right to left while (==) is used for showing equality between values.

  • Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0. As we’ve learned, conditions within functions are separated by a ,.
flights |>
  filter(dest == "RDU" , arr_delay < 0) |>
  select(dest, arr_delay)
# A tibble: 4,232 × 2
   dest  arr_delay
   <chr>     <dbl>
 1 RDU          -6
 2 RDU         -24
 3 RDU          -4
 4 RDU         -16
 5 RDU         -11
 6 RDU          -6
 7 RDU          -3
 8 RDU         -11
 9 RDU          -1
10 RDU          -6
# ℹ 4,222 more rows

We can do more complex tasks using logical operators:

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?
x & y is x AND y?
x | y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

  • Your turn (4 minutes): Describe what the code is doing in words.
flights |> #and then
  filter(dest %in% "RDU", #matches between dest and RDU (similar to == when only have one condition)
         arr_delay < 0 | dep_delay < 0 # OR
         ) |>
  select(dest, arr_delay, dep_delay) #select 3 columns
# A tibble: 5,308 × 3
   dest  arr_delay dep_delay
   <chr>     <dbl>     <dbl>
 1 RDU          -6       -10
 2 RDU         -24        -8
 3 RDU          -4         0
 4 RDU         -16        -3
 5 RDU         -11        -6
 6 RDU          -6        -2
 7 RDU          -3        -8
 8 RDU         -11        -8
 9 RDU          11        -1
10 RDU          -1        -5
# ℹ 5,298 more rows
  • What if we want to like at destinations of RDU and GSO? How does the below code change?

We now have a list of more than one character string to match for.

flights |>
  filter(dest %in% c("RDU","GSO"),
         arr_delay < 0 | dep_delay < 0
         ) 
# A tibble: 6,203 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1317           1325        -8     1454           1505
 9  2013     1     1     1449           1450        -1     1651           1640
10  2013     1     1     1505           1510        -5     1654           1655
# ℹ 6,193 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

This data set has 6,203 rows.

Why c?

c stands for combine and we use this with a list larger than 1

What happens if we change %in% to == ?

My new data set is 3,146 rows. It looks for equality of RDU in row 1; GSO in row 2; RDU in row 3 and throws out anything else

count()

  • Demo: Create a frequency table of the destination locations for flights from New York.
flights |>
  count(dest)
# A tibble: 105 × 2
   dest      n
   <chr> <int>
 1 ABQ     254
 2 ACK     265
 3 ALB     439
 4 ANC       8
 5 ATL   17215
 6 AUS    2439
 7 AVL     275
 8 BDL     443
 9 BGR     375
10 BHM     297
# ℹ 95 more rows
  • Demo: In which month was there the fewest number of flights? How many flights were there in that month? Hint: Type ?min into the console.
flights |>
  count(month) |>
  filter(n == min(n))
# A tibble: 1 × 2
  month     n
  <int> <int>
1     2 24951
  • On which date (month + day) was there the largest number of flights? How many flights were there on that day? Comment the code below.

There were 1014 flights on this day.

flights |> 
  count(month,day) |> #make a freq table based on the combination of month and dat
  filter(n == max(n)) #take the max value from the column n
# A tibble: 1 × 3
  month   day     n
  <int> <int> <int>
1    11    27  1014

mutate()

Use mutate() to create a new variable.

  • Demo: In the code chunk below, air_time (minutes in the air) is converted to hours, and then new variable mph is created, corresponding to the miles per hour of the flight. Comment each line of code below.
flights |> # and then I want to create new variables
  mutate(hours = air_time / 60, #var name = conditions
         mph = distance / hours) |>
select(air_time, distance, hours, mph)
# A tibble: 336,776 × 4
   air_time distance hours   mph
      <dbl>    <dbl> <dbl> <dbl>
 1      227     1400 3.78   370.
 2      227     1416 3.78   374.
 3      160     1089 2.67   408.
 4      183     1576 3.05   517.
 5      116      762 1.93   394.
 6      150      719 2.5    288.
 7      158     1065 2.63   404.
 8       53      229 0.883  259.
 9      140      944 2.33   405.
10      138      733 2.3    319.
# ℹ 336,766 more rows
  • Your turn (4 minutes): Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
flights |>
  count(month) |>
  mutate(perc = n / sum(n) * 100 )
# A tibble: 12 × 3
   month     n  perc
   <int> <int> <dbl>
 1     1 27004  8.02
 2     2 24951  7.41
 3     3 28834  8.56
 4     4 28330  8.41
 5     5 28796  8.55
 6     6 28243  8.39
 7     7 29425  8.74
 8     8 29327  8.71
 9     9 27574  8.19
10    10 28889  8.58
11    11 27268  8.10
12    12 28135  8.35

changing variable type

Suppose we wanted to make year a factor…

flights |> 
  mutate(year = as.factor(year)) |>
  select(year)
# A tibble: 336,776 × 1
   year 
   <fct>
 1 2013 
 2 2013 
 3 2013 
 4 2013 
 5 2013 
 6 2013 
 7 2013 
 8 2013 
 9 2013 
10 2013 
# ℹ 336,766 more rows

If you want to go back from a fct to int/dbl you can use as.numeric. But be carful! If R can’t figure out how to turn something into a number, it will auto generate NAs.

summarize()

summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.

Be sure to name your columns!

flights |> 
  summarize(mean_dep_delay = mean(dep_delay))
# A tibble: 1 × 1
  mean_dep_delay
           <dbl>
1             NA

Question: Why did this code return NA?

There are NA’s in the dep_delay variable

Let’s fix it! We can use na.rm to remove NAs.

flights |> 
  summarize(mean_dep_delay = mean(dep_delay, na.rm = T))
# A tibble: 1 × 1
  mean_dep_delay
           <dbl>
1           12.6

group_by()

group_by() is used for grouped operations. It’s very powerful when paired with summarise() to calculate summary statistics by group.

Here we find the mean and standard deviation of departure delay for each month. Comment each line of code below.

flights |> 
  group_by(month) |>
  summarize(mean_dep_delay = mean(dep_delay, na.rm=T),
            sd_dep_delay = sd(dep_delay, na.rm=T)
            )
# A tibble: 12 × 3
   month mean_dep_delay sd_dep_delay
   <int>          <dbl>        <dbl>
 1     1          10.0          36.4
 2     2          10.8          36.3
 3     3          13.2          40.1
 4     4          13.9          43.0
 5     5          13.0          39.4
 6     6          20.8          51.5
 7     7          21.7          51.6
 8     8          12.6          37.7
 9     9           6.72         35.6
10    10           6.24         29.7
11    11           5.44         27.6
12    12          16.6          41.9
  • Your turn (4 minutes): What is the median departure delay for each airports around NYC (origin)? Which airport has the shortest median departure delay?
flights |>
  group_by(origin) |>
  summarize(med_dep_delay = median(dep_delay, na.rm = T))
# A tibble: 3 × 2
  origin med_dep_delay
  <chr>          <dbl>
1 EWR               -1
2 JFK               -1
3 LGA               -3

Optional

  1. Create a new dataset that only contains flights that do not have a missing departure time. Include the columns year, month, day, dep_time, dep_delay, and dep_delay_hours (the departure delay in hours). Hint: Note you may need to use mutate() to make one or more of these variables.
new.data <- flights |>
  filter(!is.na(dep_time)) |>
  select(dep_time, dep_delay) |>
  mutate(dep_delay_hours = dep_delay / 60)
  1. For each airplane (uniquely identified by tailnum), use a group_by() paired with summarize() to find the sample size, mean, and standard deviation of flight distances. Then include only the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight in the final data frame.
flights |>
  group_by(tailnum) |>
  summarize(n = n(),
            mean = mean(distance),
            sd = sd(distance)) |>
  arrange(desc(mean)) |>
  slice(c(1:5, (n()-4):n()))
# A tibble: 10 × 4
   tailnum     n  mean    sd
   <chr>   <int> <dbl> <dbl>
 1 N380HA     40 4983    0  
 2 N381HA     25 4983    0  
 3 N382HA     26 4983    0  
 4 N383HA     26 4983    0  
 5 N384HA     33 4983    0  
 6 N945UW    285  176.  31.2
 7 N956UW    222  174.  31.4
 8 N959UW    213  174.  34.3
 9 N948UW    232  174.  32.7
10 N955UW    225  173.  32.9

Footnotes

  1. From dplyr vignette↩︎