Wrangling flights

Author

solutions

To demonstrate data wrangling we will use flights, a tibble in the nycflights13 R package. It includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.

Note: As we go through the AE, practicing thinking in steps, and reading your code as sentences

library(tidyverse)
library(nycflights13)

Now, let’s take a glimpse of the data set! Also, pull up the help file for the flights dataset to see the data index.

glimpse(flights)

Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

As a reminder, we can use the names() function to get a quick reminder on the names of the variables in our data set. We typically do not include this type of code in our professional document. However, we are this time so you can reference this function!

names(flights)

 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"

We can also get a more traditional view of the data set using functions like head()and tail(). We can also View() our data set, but will NOT include View() in our rendered document. Let’s show why.

head(flights) # top few (6) rows

# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

tail(flights) # last few (6) rows

# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     9    30       NA           1842        NA       NA           2019
2  2013     9    30       NA           1455        NA       NA           1634
3  2013     9    30       NA           2200        NA       NA           2312
4  2013     9    30       NA           1210        NA       NA           1330
5  2013     9    30       NA           1159        NA       NA           1344
6  2013     9    30       NA            840        NA       NA           1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Data wrangling with `dplyr`

dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click hereto download the dplyr cheat sheet.

Quick summary of key dplyr functions¹:

Rows:

filter():chooses rows based on column values.
slice(): chooses rows based on location.
arrange(): changes the order of the rows
sample_n(): take a random subset of the rows

Columns:

select(): changes whether or not a column is included.
rename(): changes the name of columns.
mutate(): changes the values of columns and creates new columns.

Groups of rows:

summarise(): collapses a group into a single row.
count(): count unique values of one or more variables.
group_by(): perform calculations separately for each value of a variable

Activities

`select()`

Demo: Make a data frame that only contains the variables dep_delay and arr_delay.

flights |>
  select(dep_delay, arr_delay)

# A tibble: 336,776 × 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# ℹ 336,766 more rows

Demo: Make a data frame that keeps every variable except dep_delay. Call the new data frame new.data

new.data <- flights |>
  select(-dep_delay)

Think about how we used : before. We can use this idea as a tool to subset data! Instead of numbers, let’s do this with names.
Demo: Make a data frame that includes all variables between year through dep_delay (inclusive). These are all variables that provide information about the departure of each flight.

flights |>
  select(year:dep_delay)

# A tibble: 336,776 × 6
    year month   day dep_time sched_dep_time dep_delay
   <int> <int> <int>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2
 2  2013     1     1      533            529         4
 3  2013     1     1      542            540         2
 4  2013     1     1      544            545        -1
 5  2013     1     1      554            600        -6
 6  2013     1     1      554            558        -4
 7  2013     1     1      555            600        -5
 8  2013     1     1      557            600        -3
 9  2013     1     1      557            600        -3
10  2013     1     1      558            600        -2
# ℹ 336,766 more rows

Demo: Use select and contains() to make a data frame that includes the variables associated with the arrival, i.e., contains the string "arr_" in the name. Reminder: Thinking about code as sentences can help make nesting functions more intuitive.

Note: There should not be a backslash after arr. Quarto puts a backslash (sometimes) to indicate that the underscore is just text.

flights |>
  select(contains("arr_"))

# A tibble: 336,776 × 3
   arr_time sched_arr_time arr_delay
      <int>          <int>     <dbl>
 1      830            819        11
 2      850            830        20
 3      923            850        33
 4     1004           1022       -18
 5      812            837       -25
 6      740            728        12
 7      913            854        19
 8      709            723       -14
 9      838            846        -8
10      753            745         8
# ℹ 336,766 more rows

Why is arr_ in quotes?

because it’s a character string! Not a variable name.

`slice()`

Demo: Display the first five rows of the flights data frame.

flights |>
  slice(1:5)

# A tibble: 5 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Demo: Display the last two rows of the flights data frame. Hint: n() produces the number of the last row in the data set.

Solves this problem with n(). Solve this problem with slice_tail().

flights |>
  slice((n()-1):n())

# A tibble: 2 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     9    30       NA           1159        NA       NA           1344
2  2013     9    30       NA            840        NA       NA           1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |>
  slice_tail(n = 2)

# A tibble: 2 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     9    30       NA           1159        NA       NA           1344
2  2013     9    30       NA            840        NA       NA           1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

`arrange()`

Demo: Arrange the flights data set using the arrange() function, by dep_delay. How are the data ordered?

flights |>
  arrange(desc(dep_delay))

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     6    27      959           1900       899     1236           2226
 9  2013     7    22     2257            759       898      121           1026
10  2013    12     5      756           1700       896     1058           2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Demo: Now let’s arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top. Hint, run ?desc in the console.

flights |>
  arrange(desc(dep_delay)) |>
  select(dep_delay, year)

# A tibble: 336,776 × 2
   dep_delay  year
       <dbl> <int>
 1      1301  2013
 2      1137  2013
 3      1126  2013
 4      1014  2013
 5      1005  2013
 6       960  2013
 7       911  2013
 8       899  2013
 9       898  2013
10       896  2013
# ℹ 336,766 more rows

`filter()`

Demo: Filter the data frame by selecting the rows where the destination airport is RDU. Comment the code below.

flights |> 
  filter(dest != "RDU",
         dep_time > 500)

# A tibble: 319,248 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 319,238 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Now, run the following code with one equals sign instead of two. Does it still work?

Explanation:

(=) is a Assignment operator while (==) is a Equal to operator. (=) is used for assigning the values from right to left while (==) is used for showing equality between values.

We can do more complex tasks using logical operators:

operator	definition
`<`	is less than?
`<=`	is less than or equal to?
`>`	is greater than?
`>=`	is greater than or equal to?
`==`	is exactly equal to?
`!=`	is not equal to?
`x & y`	is x AND y?
`x \| y`	is x OR y?
`is.na(x)`	is x NA?
`!is.na(x)`	is x not NA?
`x %in% y`	is x in y?
`!(x %in% y)`	is x not in y?
`!x`	is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

Your turn (4 minutes): Describe what the code is doing in words.

flights |>
  filter(dest %in% "RDU",
         arr_delay < 0 | dep_delay < 0
         )

# A tibble: 5,308 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1317           1325        -8     1454           1505
 9  2013     1     1     1449           1450        -1     1651           1640
10  2013     1     1     1505           1510        -5     1654           1655
# ℹ 5,298 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

What if we want to like at destinations of RDU or GSO? How does the below code change from above?

flights |>
  filter(dest %in% c("RDU", "GSO"),
         arr_delay < 0 | dep_delay < 0
         )

# A tibble: 6,203 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      800            810       -10      949            955
 2  2013     1     1      832            840        -8     1006           1030
 3  2013     1     1      851            851         0     1032           1036
 4  2013     1     1      917            920        -3     1052           1108
 5  2013     1     1     1024           1030        -6     1204           1215
 6  2013     1     1     1127           1129        -2     1303           1309
 7  2013     1     1     1157           1205        -8     1342           1345
 8  2013     1     1     1317           1325        -8     1454           1505
 9  2013     1     1     1449           1450        -1     1651           1640
10  2013     1     1     1505           1510        -5     1654           1655
# ℹ 6,193 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Why c?

What happens if we change %in% to == ?

== is equality between ONLY two elements. It takes the first element and compares it to row 1; the second element and compares it to row 2; etc. NOT what we want.

`count()`

Demo: Create a frequency table of the destination (dest) locations for flights from New York.

flights |>
  count(dest)

# A tibble: 105 × 2
   dest      n
   <chr> <int>
 1 ABQ     254
 2 ACK     265
 3 ALB     439
 4 ANC       8
 5 ATL   17215
 6 AUS    2439
 7 AVL     275
 8 BDL     443
 9 BGR     375
10 BHM     297
# ℹ 95 more rows

Demo: In which month was there the fewest number of flights? How many flights were there in that month? Hint: Type ?min into the console and use the filter() function!

flights |>
  count(month) |>
  filter(n == min(n))

# A tibble: 1 × 2
  month     n
  <int> <int>
1     2 24951

On which date (month + day) was there the largest number of flights? How many flights were there on that day? Comment the code below.

flights |> 
  count(month,day) |> # contingency table between month and day
  filter(n == max(n)) # grab the row with the max n

# A tibble: 1 × 3
  month   day     n
  <int> <int> <int>
1    11    27  1014

`mutate()`

Use mutate() to create a new variable.

Demo: In the code chunk below, air_time (minutes in the air) is converted to hours, and then new variable mph is created, corresponding to the miles per hour of the flight. Comment each line of code below.

flights |> # my data "and then"
  mutate(hours = air_time / 60, # creating hours!
         mph = distance / hours, # creating mph
         miles = mph,
         index = 1:n()) |> # making an index col
  select(air_time, distance, hours, mph, miles, index) # selecing cols

# A tibble: 336,776 × 6
   air_time distance hours   mph miles index
      <dbl>    <dbl> <dbl> <dbl> <dbl> <int>
 1      227     1400 3.78   370.  370.     1
 2      227     1416 3.78   374.  374.     2
 3      160     1089 2.67   408.  408.     3
 4      183     1576 3.05   517.  517.     4
 5      116      762 1.93   394.  394.     5
 6      150      719 2.5    288.  288.     6
 7      158     1065 2.63   404.  404.     7
 8       53      229 0.883  259.  259.     8
 9      140      944 2.33   405.  405.     9
10      138      733 2.3    319.  319.    10
# ℹ 336,766 more rows

changing variable type

Suppose we wanted to make year a factor…

flights |>
  mutate(year = as.factor(year)) |>
  select(year)

# A tibble: 336,776 × 1
   year 
   <fct>
 1 2013 
 2 2013 
 3 2013 
 4 2013 
 5 2013 
 6 2013 
 7 2013 
 8 2013 
 9 2013 
10 2013 
# ℹ 336,766 more rows

`group_by()` + `summarise()`

summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.

group_by() is used for grouped operations. It’s very powerful when paired with summarise() to calculate summary statistics by group.

Find the mean and standard deviation of departure delay for each month. Comment each line of code below.

flights |>
  group_by(month) |>
  summarise(mean_dp = mean(dep_delay, na.rm = T),
            sd = sd(dep_delay, na.rm = T))

# A tibble: 12 × 3
   month mean_dp    sd
   <int>   <dbl> <dbl>
 1     1   10.0   36.4
 2     2   10.8   36.3
 3     3   13.2   40.1
 4     4   13.9   43.0
 5     5   13.0   39.4
 6     6   20.8   51.5
 7     7   21.7   51.6
 8     8   12.6   37.7
 9     9    6.72  35.6
10    10    6.24  29.7
11    11    5.44  27.6
12    12   16.6   41.9

`rename()`

rename() changes the names of individual variables using new_name = old_name

Let’s practice using this function be renaming dep_time to departure time.

flights |>
  rename("departure time" = dep_time)

# A tibble: 336,776 × 19
    year month   day `departure time` sched_dep_time dep_delay arr_time
   <int> <int> <int>            <int>          <int>     <dbl>    <int>
 1  2013     1     1              517            515         2      830
 2  2013     1     1              533            529         4      850
 3  2013     1     1              542            540         2      923
 4  2013     1     1              544            545        -1     1004
 5  2013     1     1              554            600        -6      812
 6  2013     1     1              554            558        -4      740
 7  2013     1     1              555            600        -5      913
 8  2013     1     1              557            600        -3      709
 9  2013     1     1              557            600        -3      838
10  2013     1     1              558            600        -2      753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Pivots

l <- tibble(
  state = rep(c("MT", "NC" , "SC"),2),
  group = c(rep("C", 3), rep("D", 3)),
  obs = c(1:6)
  )

l

# A tibble: 6 × 3
  state group   obs
  <chr> <chr> <int>
1 MT    C         1
2 NC    C         2
3 SC    C         3
4 MT    D         4
5 NC    D         5
6 SC    D         6

Pivot these data so that the data are wide. i.e. Each state should be it’s own unique observation (row). Save this new data set as w.

w <- l |>
  pivot_wider(
    names_from = group, 
    values_from = obs
  )

w

# A tibble: 3 × 3
  state     C     D
  <chr> <int> <int>
1 MT        1     4
2 NC        2     5
3 SC        3     6

Now, let’s change it back. Introducing pivot_longer. There are three things we need to consider with pivot_longer:

w |>
  pivot_longer(
    cols = !state, 
    names_to = "group",
    values_to = "obs"
  )

# A tibble: 6 × 3
  state group   obs
  <chr> <chr> <int>
1 MT    C         1
2 MT    D         4
3 NC    C         2
4 NC    D         5
5 SC    C         3
6 SC    D         6

Joins

Often instead of being provided the data you need for your analysis in a single data frame, you will need to bring information from multiple datasets together into a data frame yourself. These datasets will be linked to each other via a column (usually an identifier, something that links the two datasets together) that you can use to join them together.

There are many possible types of joins. All have the format something_join(x, y).

x <- tibble(
  value = c(1, 2, 3),
  xcol = c("x1", "x2", "x3")
  )

y <- tibble(
  value = c(1, 2, 4),
  ycol = c("y1", "y2", "y4")
  )

x

# A tibble: 3 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3

We will demonstrate each of the joins on these small, toy datasets.

Note: These functions below know to join x and y by value because each dataset has value as a column. See for yourself!

names(x)

[1] "value" "xcol"

names(y)

[1] "value" "ycol"

`inner_join()`

Joins where the values of x and y are the same (1 and 2)

inner_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 2 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2

`left_join()`

Keep all the value of x, and attach y

left_join(x , y)

Joining with `by = join_by(value)`

# A tibble: 3 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA>

`right_join()`

Keep all the value of y, and attach x

right_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 3 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y4

`full_join()`

keep both x and y

full_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 4 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y4

`semi_join()`

only keep x that matches with y

semi_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 2 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2

`anti_join()`

only keep x that DO NOT match with y

anti_join(x, y)

Joining with `by = join_by(value)`

# A tibble: 1 × 2
  value xcol 
  <dbl> <chr>
1     3 x3

Joining by more than one variable

x2 <- tibble(
  value = c(1, 2, 3),
  value2 = c(4,5,6),
  xcol = c("x1", "x2", "x3")
  )

y2<- tibble(
  value = c(1, 2, 4), 
  value2 = c(4,5,7),
  ycol = c("y1", "y2", "y4")
  )

x2

# A tibble: 3 × 3
  value value2 xcol 
  <dbl>  <dbl> <chr>
1     1      4 x1   
2     2      5 x2   
3     3      6 x3

y2

# A tibble: 3 × 3
  value value2 ycol 
  <dbl>  <dbl> <chr>
1     1      4 y1   
2     2      5 y2   
3     4      7 y4

`inner_join()`

inner_join(x2, y2, by = c("value"))

# A tibble: 2 × 5
  value value2.x xcol  value2.y ycol 
  <dbl>    <dbl> <chr>    <dbl> <chr>
1     1        4 x1           4 y1   
2     2        5 x2           5 y2

inner_join(x2, y2 , by = c("value" , "value2"))

# A tibble: 2 × 4
  value value2 xcol  ycol 
  <dbl>  <dbl> <chr> <chr>
1     1      4 x1    y1   
2     2      5 x2    y2

`left_join()`

left_join(x2 , y2, by = c("value"))

# A tibble: 3 × 5
  value value2.x xcol  value2.y ycol 
  <dbl>    <dbl> <chr>    <dbl> <chr>
1     1        4 x1           4 y1   
2     2        5 x2           5 y2   
3     3        6 x3          NA <NA>

left_join(x2 , y2, by = c("value" , "value2"))

# A tibble: 3 × 4
  value value2 xcol  ycol 
  <dbl>  <dbl> <chr> <chr>
1     1      4 x1    y1   
2     2      5 x2    y2   
3     3      6 x3    <NA>

Footnotes

From dplyr vignette ↩︎

Data wrangling with dplyr

Activities

select()

slice()

arrange()

filter()

count()

mutate()

changing variable type

group_by() + summarise()

rename()

Pivots

Joins

inner_join()

left_join()

right_join()

full_join()

semi_join()

anti_join()

Joining by more than one variable

inner_join()

left_join()

Footnotes

Data wrangling with `dplyr`

`select()`

`slice()`

`arrange()`

`filter()`

`count()`

`mutate()`

`group_by()` + `summarise()`

`rename()`

`inner_join()`

`left_join()`

`right_join()`

`full_join()`

`semi_join()`

`anti_join()`

`inner_join()`

`left_join()`