Wrangling flights
To demonstrate data wrangling we will use flights, a tibble in the nycflights13 R package. It includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.
Note: As we go through the AE, practicing thinking in steps, and reading your code as sentences
Now, let’s take a glimpse of the data set! Also, pull up the help file for the flights dataset to see the data index.
glimpse(flights)Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
As a reminder, we can use the names() function to get a quick reminder on the names of the variables in our data set. We typically do not include this type of code in our professional document. However, we are this time so you can reference this function!
names(flights) [1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
We can also get a more traditional view of the data set using functions like head()and tail(). We can also View() our data set, but will NOT include View() in our rendered document. Let’s show why.
head(flights) # top few (6) rows# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
tail(flights) # last few (6) rows# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1842 NA NA 2019
2 2013 9 30 NA 1455 NA NA 1634
3 2013 9 30 NA 2200 NA NA 2312
4 2013 9 30 NA 1210 NA NA 1330
5 2013 9 30 NA 1159 NA NA 1344
6 2013 9 30 NA 840 NA NA 1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Data wrangling with dplyr
dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click hereto download the dplyr cheat sheet.
Quick summary of key dplyr functions1:
Rows:
-
filter():chooses rows based on column values. -
slice(): chooses rows based on location. -
arrange(): changes the order of the rows -
sample_n(): take a random subset of the rows
Columns:
-
select(): changes whether or not a column is included. -
rename(): changes the name of columns. -
mutate(): changes the values of columns and creates new columns.
Groups of rows:
-
summarise(): collapses a group into a single row. -
count(): count unique values of one or more variables. -
group_by(): perform calculations separately for each value of a variable
Activities
select()
- Demo: Make a data frame that only contains the variables
dep_delayandarr_delay.
flights |>
select(dep_delay, arr_delay)# A tibble: 336,776 × 2
dep_delay arr_delay
<dbl> <dbl>
1 2 11
2 4 20
3 2 33
4 -1 -18
5 -6 -25
6 -4 12
7 -5 19
8 -3 -14
9 -3 -8
10 -2 8
# ℹ 336,766 more rows
- Demo: Make a data frame that keeps every variable except
dep_delay. Call the new data framenew.data
new.data <- flights |>
select(-dep_delay)Think about how we used
:before. We can use this idea as a tool to subset data! Instead of numbers, let’s do this with names.Demo: Make a data frame that includes all variables between
yearthroughdep_delay(inclusive). These are all variables that provide information about the departure of each flight.
flights |>
select(year:dep_delay)# A tibble: 336,776 × 6
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 2
2 2013 1 1 533 529 4
3 2013 1 1 542 540 2
4 2013 1 1 544 545 -1
5 2013 1 1 554 600 -6
6 2013 1 1 554 558 -4
7 2013 1 1 555 600 -5
8 2013 1 1 557 600 -3
9 2013 1 1 557 600 -3
10 2013 1 1 558 600 -2
# ℹ 336,766 more rows
- Demo: Use
selectandcontains()to make a data frame that includes the variables associated with the arrival, i.e., contains the string"arr_"in the name. Reminder: Thinking about code as sentences can help make nesting functions more intuitive.
Note: There should not be a backslash after arr. Quarto puts a backslash (sometimes) to indicate that the underscore is just text.
# A tibble: 336,776 × 3
arr_time sched_arr_time arr_delay
<int> <int> <dbl>
1 830 819 11
2 850 830 20
3 923 850 33
4 1004 1022 -18
5 812 837 -25
6 740 728 12
7 913 854 19
8 709 723 -14
9 838 846 -8
10 753 745 8
# ℹ 336,766 more rows
- Why is arr_ in quotes?
because it’s a character string! Not a variable name.
slice()
- Demo: Display the first five rows of the
flightsdata frame.
flights |>
slice(1:5)# A tibble: 5 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Demo: Display the last two rows of the
flightsdata frame. Hint:n()produces the number of the last row in the data set.
Solves this problem with n(). Solve this problem with slice_tail().
# A tibble: 2 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1159 NA NA 1344
2 2013 9 30 NA 840 NA NA 1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
slice_tail(n = 2)# A tibble: 2 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1159 NA NA 1344
2 2013 9 30 NA 840 NA NA 1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
arrange()
- Demo: Arrange the
flightsdata set using thearrange()function, bydep_delay. How are the data ordered?
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 6 15 1432 1935 1137 1607 2120
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 9 20 1139 1845 1014 1457 2210
5 2013 7 22 845 1600 1005 1044 1815
6 2013 4 10 1100 1900 960 1342 2211
7 2013 3 17 2321 810 911 135 1020
8 2013 6 27 959 1900 899 1236 2226
9 2013 7 22 2257 759 898 121 1026
10 2013 12 5 756 1700 896 1058 2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Demo: Now let’s arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top. Hint, run
?descin the console.
filter()
- Demo: Filter the data frame by selecting the rows where the destination airport is RDU. Comment the code below.
flights |>
filter(dest != "RDU",
dep_time > 500)# A tibble: 319,248 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 319,238 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Now, run the following code with one equals sign instead of two. Does it still work?
Explanation:
(=) is a Assignment operator while (==) is a Equal to operator. (=) is used for assigning the values from right to left while (==) is used for showing equality between values.
We can do more complex tasks using logical operators:
| operator | definition |
|---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? |
The final operator only makes sense if x is logical (TRUE / FALSE).
- Your turn (4 minutes): Describe what the code is doing in words.
# A tibble: 5,308 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1317 1325 -8 1454 1505
9 2013 1 1 1449 1450 -1 1651 1640
10 2013 1 1 1505 1510 -5 1654 1655
# ℹ 5,298 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
What if we want to like at destinations of RDU or GSO? How does the below code change from above?
# A tibble: 6,203 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1317 1325 -8 1454 1505
9 2013 1 1 1449 1450 -1 1651 1640
10 2013 1 1 1505 1510 -5 1654 1655
# ℹ 6,193 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Why c?
What happens if we change %in% to == ?
== is equality between ONLY two elements. It takes the first element and compares it to row 1; the second element and compares it to row 2; etc. NOT what we want.
count()
- Demo: Create a frequency table of the destination (
dest) locations for flights from New York.
flights |>
count(dest)# A tibble: 105 × 2
dest n
<chr> <int>
1 ABQ 254
2 ACK 265
3 ALB 439
4 ANC 8
5 ATL 17215
6 AUS 2439
7 AVL 275
8 BDL 443
9 BGR 375
10 BHM 297
# ℹ 95 more rows
- Demo: In which month was there the fewest number of flights? How many flights were there in that month? Hint: Type
?mininto the console and use thefilter()function!
On which date (month + day) was there the largest number of flights? How many flights were there on that day? Comment the code below.
mutate()
Use mutate() to create a new variable.
- Demo: In the code chunk below,
air_time(minutes in the air) is converted to hours, and then new variablemphis created, corresponding to the miles per hour of the flight. Comment each line of code below.
flights |> # my data "and then"
mutate(hours = air_time / 60, # creating hours!
mph = distance / hours, # creating mph
miles = mph,
index = 1:n()) |> # making an index col
select(air_time, distance, hours, mph, miles, index) # selecing cols# A tibble: 336,776 × 6
air_time distance hours mph miles index
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 227 1400 3.78 370. 370. 1
2 227 1416 3.78 374. 374. 2
3 160 1089 2.67 408. 408. 3
4 183 1576 3.05 517. 517. 4
5 116 762 1.93 394. 394. 5
6 150 719 2.5 288. 288. 6
7 158 1065 2.63 404. 404. 7
8 53 229 0.883 259. 259. 8
9 140 944 2.33 405. 405. 9
10 138 733 2.3 319. 319. 10
# ℹ 336,766 more rows
changing variable type
Suppose we wanted to make year a factor…
group_by() + summarise()
summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.
group_by() is used for grouped operations. It’s very powerful when paired with summarise() to calculate summary statistics by group.
Find the mean and standard deviation of departure delay for each month. Comment each line of code below.
flights |>
group_by(month) |>
summarise(mean_dp = mean(dep_delay, na.rm = T),
sd = sd(dep_delay, na.rm = T))# A tibble: 12 × 3
month mean_dp sd
<int> <dbl> <dbl>
1 1 10.0 36.4
2 2 10.8 36.3
3 3 13.2 40.1
4 4 13.9 43.0
5 5 13.0 39.4
6 6 20.8 51.5
7 7 21.7 51.6
8 8 12.6 37.7
9 9 6.72 35.6
10 10 6.24 29.7
11 11 5.44 27.6
12 12 16.6 41.9
rename()
rename() changes the names of individual variables using new_name = old_name
Let’s practice using this function be renaming dep_time to departure time.
flights |>
rename("departure time" = dep_time)# A tibble: 336,776 × 19
year month day `departure time` sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Pivots
l <- tibble(
state = rep(c("MT", "NC" , "SC"),2),
group = c(rep("C", 3), rep("D", 3)),
obs = c(1:6)
)
l# A tibble: 6 × 3
state group obs
<chr> <chr> <int>
1 MT C 1
2 NC C 2
3 SC C 3
4 MT D 4
5 NC D 5
6 SC D 6
Pivot these data so that the data are wide. i.e. Each state should be it’s own unique observation (row). Save this new data set as w.
w <- l |>
pivot_wider(
names_from = group,
values_from = obs
)
w# A tibble: 3 × 3
state C D
<chr> <int> <int>
1 MT 1 4
2 NC 2 5
3 SC 3 6
Now, let’s change it back. Introducing pivot_longer. There are three things we need to consider with pivot_longer:
w |>
pivot_longer(
cols = !state,
names_to = "group",
values_to = "obs"
)# A tibble: 6 × 3
state group obs
<chr> <chr> <int>
1 MT C 1
2 MT D 4
3 NC C 2
4 NC D 5
5 SC C 3
6 SC D 6
Joins
Often instead of being provided the data you need for your analysis in a single data frame, you will need to bring information from multiple datasets together into a data frame yourself. These datasets will be linked to each other via a column (usually an identifier, something that links the two datasets together) that you can use to join them together.
There are many possible types of joins. All have the format something_join(x, y).
x <- tibble(
value = c(1, 2, 3),
xcol = c("x1", "x2", "x3")
)
y <- tibble(
value = c(1, 2, 4),
ycol = c("y1", "y2", "y4")
)
x# A tibble: 3 × 2
value xcol
<dbl> <chr>
1 1 x1
2 2 x2
3 3 x3
We will demonstrate each of the joins on these small, toy datasets.
Note: These functions below know to join x and y by value because each dataset has value as a column. See for yourself!
inner_join()
Joins where the values of x and y are the same (1 and 2)
inner_join(x, y)Joining with `by = join_by(value)`
# A tibble: 2 × 3
value xcol ycol
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
left_join()
Keep all the value of x, and attach y
left_join(x , y)Joining with `by = join_by(value)`
# A tibble: 3 × 3
value xcol ycol
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
right_join()
Keep all the value of y, and attach x
right_join(x, y) Joining with `by = join_by(value)`
# A tibble: 3 × 3
value xcol ycol
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y4
full_join()
keep both x and y
full_join(x, y)Joining with `by = join_by(value)`
# A tibble: 4 × 3
value xcol ycol
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
4 4 <NA> y4
semi_join()
only keep x that matches with y
semi_join(x, y)Joining with `by = join_by(value)`
# A tibble: 2 × 2
value xcol
<dbl> <chr>
1 1 x1
2 2 x2
anti_join()
only keep x that DO NOT match with y
Joining by more than one variable
x2 <- tibble(
value = c(1, 2, 3),
value2 = c(4,5,6),
xcol = c("x1", "x2", "x3")
)
y2<- tibble(
value = c(1, 2, 4),
value2 = c(4,5,7),
ycol = c("y1", "y2", "y4")
)
x2# A tibble: 3 × 3
value value2 xcol
<dbl> <dbl> <chr>
1 1 4 x1
2 2 5 x2
3 3 6 x3
y2# A tibble: 3 × 3
value value2 ycol
<dbl> <dbl> <chr>
1 1 4 y1
2 2 5 y2
3 4 7 y4
inner_join()
inner_join(x2, y2, by = c("value"))# A tibble: 2 × 5
value value2.x xcol value2.y ycol
<dbl> <dbl> <chr> <dbl> <chr>
1 1 4 x1 4 y1
2 2 5 x2 5 y2
inner_join(x2, y2 , by = c("value" , "value2"))# A tibble: 2 × 4
value value2 xcol ycol
<dbl> <dbl> <chr> <chr>
1 1 4 x1 y1
2 2 5 x2 y2
left_join()
# A tibble: 3 × 5
value value2.x xcol value2.y ycol
<dbl> <dbl> <chr> <dbl> <chr>
1 1 4 x1 4 y1
2 2 5 x2 5 y2
3 3 6 x3 NA <NA>
# A tibble: 3 × 4
value value2 xcol ycol
<dbl> <dbl> <chr> <chr>
1 1 4 x1 y1
2 2 5 x2 y2
3 3 6 x3 <NA>
Footnotes
From dplyr vignette↩︎
