Very often, data comes from multiple sources. To analyze it, you typically need to join (merge) two data frames using one or more key columns.
In base R, the main tool is merge(). By default,
merge(x, y) performs an inner join (keeps
only matching keys). You can control outer joins with: -
all.x = TRUE → left join (keep all rows
from x) - all.y = TRUE → right
join (keep all rows from y) -
all = TRUE (or both all.x and
all.y) → full join (keep all rows from
both)
library(tibble)
producers <- tibble(
surname = c("Spielberg", "Scorsese", "Hitchcock", "Tarantino", "Polanski"),
nationality = c("US", "US", "UK", "US", "Poland")
)
movies <- tibble(
surname = c("Spielberg", "Scorsese", "Hitchcock", "Hitchcock", "Spielberg", "Tarantino", "Polanski"),
title = c(
"Super 8",
"Taxi Driver",
"Psycho",
"North by Northwest",
"Catch Me If You Can",
"Reservoir Dogs",
"Chinatown"
)
)
producers| surname | nationality |
|---|---|
| Spielberg | US |
| Scorsese | US |
| Hitchcock | UK |
| Tarantino | US |
| Polanski | Poland |
| surname | title |
|---|---|
| Spielberg | Super 8 |
| Scorsese | Taxi Driver |
| Hitchcock | Psycho |
| Hitchcock | North by Northwest |
| Spielberg | Catch Me If You Can |
| Tarantino | Reservoir Dogs |
| Polanski | Chinatown |
| surname | nationality | title |
|---|---|---|
| Hitchcock | UK | Psycho |
| Hitchcock | UK | North by Northwest |
| Polanski | Poland | Chinatown |
| Scorsese | US | Taxi Driver |
| Spielberg | US | Super 8 |
| Spielberg | US | Catch Me If You Can |
| Tarantino | US | Reservoir Dogs |
## [1] 7 3
This returns only producers that appear in both tables (matching
surname).
Sometimes the key columns don’t have the same name.
movies2 <- movies |>
dplyr::rename(name = surname)
m_inner_diffkey <- merge(producers, movies2, by.x = "surname", by.y = "name")
head(m_inner_diffkey)| surname | nationality | title |
|---|---|---|
| Hitchcock | UK | Psycho |
| Hitchcock | UK | North by Northwest |
| Polanski | Poland | Chinatown |
| Scorsese | US | Taxi Driver |
| Spielberg | US | Super 8 |
| Spielberg | US | Catch Me If You Can |
A left join keeps all rows from the left table (x) and
fills missing matches from the right table (y) with
NA.
library(dplyr)
producers_plus <- producers |>
add_row(surname = "Lucas", nationality = "US")
# Left join with base R:
m_left <- merge(producers_plus, movies2, by.x = "surname", by.y = "name", all.x = TRUE)
m_left| surname | nationality | title |
|---|---|---|
| Hitchcock | UK | Psycho |
| Hitchcock | UK | North by Northwest |
| Lucas | US | NA |
| Polanski | Poland | Chinatown |
| Scorsese | US | Taxi Driver |
| Spielberg | US | Super 8 |
| Spielberg | US | Catch Me If You Can |
| Tarantino | US | Reservoir Dogs |
## [1] 8 3
Because “Lucas” has no match in movies2, the
title column becomes NA for that row.
A full join keeps all rows from both tables and uses NA
where there is no match.
| surname | nationality | title |
|---|---|---|
| Hitchcock | UK | Psycho |
| Hitchcock | UK | North by Northwest |
| Lucas | US | NA |
| Polanski | Poland | Chinatown |
| Scorsese | US | Taxi Driver |
| Spielberg | US | Super 8 |
| Spielberg | US | Catch Me If You Can |
| Tarantino | US | Reservoir Dogs |
## [1] 8 3
A work by Gianluca Sottile
gianluca.sottile@unipa.it