AE 05: Wrangling flights
This AE is due Friday, Sep 15 at 11:59pm.
To demonstrate data wrangling we will use flights
, a tibble in the nycflights13 R package. It includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.
Note: As we go through the AE, practicing thinking in steps, and reading your code as sentences
The data frame has over 336,000 observations (rows), 336776 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data. The number of columns in the data set is 19
glimpse(flights)
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
Data wrangling with dplyr
dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the dplyr cheatsheet.
Quick summary of key dplyr functions1:
Rows:
-
filter()
:chooses rows based on column values. -
slice()
: chooses rows based on location. -
arrange()
: changes the order of the rows -
sample_n()
: take a random subset of the rows
Columns:
-
select()
: changes whether or not a column is included. -
rename()
: changes the name of columns. -
mutate()
: changes the values of columns and creates new columns.
Groups of rows:
-
summarise()
: collapses a group into a single row. -
count()
: count unique values of one or more variables. -
group_by()
: perform calculations separately for each value of a variable
Pick Up Here
-
Your turn (5 minutes): Create a data frame that only includes the plane tail number (
tailnum
), carrier (carrier
), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum
) for this flight?
filter()
- Demo: Filter the data frame by selecting the rows where the destination airport is RDU. Comment the code below.
flights |> # I have my data and then
filter(dest == "RDU") #subset for values that equal RDU
# A tibble: 8,163 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1240 1235 5 1415 1415
9 2013 1 1 1317 1325 -8 1454 1505
10 2013 1 1 1449 1450 -1 1651 1640
# ℹ 8,153 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
#note: == represents equality
Now, run the following code with one equals sign instead of two. Does it still work?
Explanation:
(=) is a Assignment operator while (==) is a Equal to operator. (=) is used for assigning the values from right to left while (==) is used for showing equality between values.
- Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0. As we’ve learned, conditions within functions are separated by a
,
.
# A tibble: 4,232 × 2
dest arr_delay
<chr> <dbl>
1 RDU -6
2 RDU -24
3 RDU -4
4 RDU -16
5 RDU -11
6 RDU -6
7 RDU -3
8 RDU -11
9 RDU -1
10 RDU -6
# ℹ 4,222 more rows
We can do more complex tasks using logical operators:
operator | definition |
---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? |
The final operator only makes sense if x
is logical (TRUE / FALSE).
- Your turn (4 minutes): Describe what the code is doing in words.
flights |> #and then
filter(dest %in% "RDU", #matches between dest and RDU (similar to == when only have one condition)
arr_delay < 0 | dep_delay < 0 # OR
) |>
select(dest, arr_delay, dep_delay) #select 3 columns
# A tibble: 5,308 × 3
dest arr_delay dep_delay
<chr> <dbl> <dbl>
1 RDU -6 -10
2 RDU -24 -8
3 RDU -4 0
4 RDU -16 -3
5 RDU -11 -6
6 RDU -6 -2
7 RDU -3 -8
8 RDU -11 -8
9 RDU 11 -1
10 RDU -1 -5
# ℹ 5,298 more rows
What if we want to like at destinations of RDU and GSO? How does the below code change?
We now have a list of more than one character string to match for.
# A tibble: 6,203 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1317 1325 -8 1454 1505
9 2013 1 1 1449 1450 -1 1651 1640
10 2013 1 1 1505 1510 -5 1654 1655
# ℹ 6,193 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
This data set has 6,203 rows.
Why c
?
c stands for combine and we use this with a list larger than 1
What happens if we change %in% to == ?
My new data set is 3,146 rows. It looks for equality of RDU in row 1; GSO in row 2; RDU in row 3 and throws out anything else
count()
- Demo: Create a frequency table of the destination locations for flights from New York.
flights |>
count(dest)
# A tibble: 105 × 2
dest n
<chr> <int>
1 ABQ 254
2 ACK 265
3 ALB 439
4 ANC 8
5 ATL 17215
6 AUS 2439
7 AVL 275
8 BDL 443
9 BGR 375
10 BHM 297
# ℹ 95 more rows
- Demo: In which month was there the fewest number of flights? How many flights were there in that month? Hint: Type
?min
into the console.
- On which date (month + day) was there the largest number of flights? How many flights were there on that day? Comment the code below.
There were 1014 flights on this day.
mutate()
Use mutate()
to create a new variable.
- Demo: In the code chunk below,
air_time
(minutes in the air) is converted to hours, and then new variablemph
is created, corresponding to the miles per hour of the flight. Comment each line of code below.
flights |> # and then I want to create new variables
mutate(hours = air_time / 60, #var name = conditions
mph = distance / hours) |>
select(air_time, distance, hours, mph)
# A tibble: 336,776 × 4
air_time distance hours mph
<dbl> <dbl> <dbl> <dbl>
1 227 1400 3.78 370.
2 227 1416 3.78 374.
3 160 1089 2.67 408.
4 183 1576 3.05 517.
5 116 762 1.93 394.
6 150 719 2.5 288.
7 158 1065 2.63 404.
8 53 229 0.883 259.
9 140 944 2.33 405.
10 138 733 2.3 319.
# ℹ 336,766 more rows
- Your turn (4 minutes): Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
changing variable type
Suppose we wanted to make year a factor…
# A tibble: 336,776 × 1
year
<fct>
1 2013
2 2013
3 2013
4 2013
5 2013
6 2013
7 2013
8 2013
9 2013
10 2013
# ℹ 336,766 more rows
If you want to go back from a fct to int/dbl you can use as.numeric. But be carful! If R can’t figure out how to turn something into a number, it will auto generate NAs
.
summarize()
summarize()
collapses the rows into summary statistics and removes columns irrelevant to the calculation.
Be sure to name your columns!
Question: Why did this code return NA
?
There are NA’s in the dep_delay variable
Let’s fix it! We can use na.rm
to remove NA
s.
group_by()
group_by()
is used for grouped operations. It’s very powerful when paired with summarise()
to calculate summary statistics by group.
Here we find the mean and standard deviation of departure delay for each month. Comment each line of code below.
flights |>
group_by(month) |>
summarize(mean_dep_delay = mean(dep_delay, na.rm=T),
sd_dep_delay = sd(dep_delay, na.rm=T)
)
# A tibble: 12 × 3
month mean_dep_delay sd_dep_delay
<int> <dbl> <dbl>
1 1 10.0 36.4
2 2 10.8 36.3
3 3 13.2 40.1
4 4 13.9 43.0
5 5 13.0 39.4
6 6 20.8 51.5
7 7 21.7 51.6
8 8 12.6 37.7
9 9 6.72 35.6
10 10 6.24 29.7
11 11 5.44 27.6
12 12 16.6 41.9
-
Your turn (4 minutes): What is the median departure delay for each airports around NYC (
origin
)? Which airport has the shortest median departure delay?
Optional
- Create a new dataset that only contains flights that do not have a missing departure time. Include the columns
year
,month
,day
,dep_time
,dep_delay
, anddep_delay_hours
(the departure delay in hours). Hint: Note you may need to usemutate()
to make one or more of these variables.
- For each airplane (uniquely identified by
tailnum
), use agroup_by()
paired withsummarize()
to find the sample size, mean, and standard deviation of flight distances. Then include only the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight in the final data frame.
flights |>
group_by(tailnum) |>
summarize(n = n(),
mean = mean(distance),
sd = sd(distance)) |>
arrange(desc(mean)) |>
slice(c(1:5, (n()-4):n()))
# A tibble: 10 × 4
tailnum n mean sd
<chr> <int> <dbl> <dbl>
1 N380HA 40 4983 0
2 N381HA 25 4983 0
3 N382HA 26 4983 0
4 N383HA 26 4983 0
5 N384HA 33 4983 0
6 N945UW 285 176. 31.2
7 N956UW 222 174. 31.4
8 N959UW 213 174. 34.3
9 N948UW 232 174. 32.7
10 N955UW 225 173. 32.9
Footnotes
From dplyr vignette↩︎