Very often, data comes from multiple sources. To analyze it, you typically need to join (merge) two data frames using one or more key columns.

Join types (concepts)

In base R, the main tool is merge(). By default, merge(x, y) performs an inner join (keeps only matching keys). You can control outer joins with: - all.x = TRUEleft join (keep all rows from x) - all.y = TRUEright join (keep all rows from y) - all = TRUE (or both all.x and all.y) → full join (keep all rows from both)

Example data

library(tibble)

producers <- tibble(
  surname = c("Spielberg", "Scorsese", "Hitchcock", "Tarantino", "Polanski"),
  nationality = c("US", "US", "UK", "US", "Poland")
)

movies <- tibble(
  surname = c("Spielberg", "Scorsese", "Hitchcock", "Hitchcock", "Spielberg", "Tarantino", "Polanski"),
  title = c(
    "Super 8",
    "Taxi Driver",
    "Psycho",
    "North by Northwest",
    "Catch Me If You Can",
    "Reservoir Dogs",
    "Chinatown"
  )
)

producers
surname nationality
Spielberg US
Scorsese US
Hitchcock UK
Tarantino US
Polanski Poland
movies
surname title
Spielberg Super 8
Scorsese Taxi Driver
Hitchcock Psycho
Hitchcock North by Northwest
Spielberg Catch Me If You Can
Tarantino Reservoir Dogs
Polanski Chinatown

Inner join (matches only)

Base R: merge()

m_inner <- merge(producers, movies, by = "surname")
m_inner
surname nationality title
Hitchcock UK Psycho
Hitchcock UK North by Northwest
Polanski Poland Chinatown
Scorsese US Taxi Driver
Spielberg US Super 8
Spielberg US Catch Me If You Can
Tarantino US Reservoir Dogs
dim(m_inner)
## [1] 7 3

This returns only producers that appear in both tables (matching surname).

dplyr alternative (optional)

library(dplyr)

m_inner2 <- producers |>
  inner_join(movies, by = "surname")

inner_join() keeps rows with keys present in both tables and adds columns from y to x.

Join with different key names

Sometimes the key columns don’t have the same name.

movies2 <- movies |>
  dplyr::rename(name = surname)

m_inner_diffkey <- merge(producers, movies2, by.x = "surname", by.y = "name")
head(m_inner_diffkey)
surname nationality title
Hitchcock UK Psycho
Hitchcock UK North by Northwest
Polanski Poland Chinatown
Scorsese US Taxi Driver
Spielberg US Super 8
Spielberg US Catch Me If You Can

Left join (keep all from x)

A left join keeps all rows from the left table (x) and fills missing matches from the right table (y) with NA.

library(dplyr)

producers_plus <- producers |>
  add_row(surname = "Lucas", nationality = "US")

# Left join with base R:
m_left <- merge(producers_plus, movies2, by.x = "surname", by.y = "name", all.x = TRUE)
m_left
surname nationality title
Hitchcock UK Psycho
Hitchcock UK North by Northwest
Lucas US NA
Polanski Poland Chinatown
Scorsese US Taxi Driver
Spielberg US Super 8
Spielberg US Catch Me If You Can
Tarantino US Reservoir Dogs
dim(m_left)
## [1] 8 3

Because “Lucas” has no match in movies2, the title column becomes NA for that row.

dplyr equivalent (optional)

m_left2 <- producers_plus |>
  left_join(movies2, by = c("surname" = "name"))

Full join (keep all rows)

A full join keeps all rows from both tables and uses NA where there is no match.

m_full <- merge(producers_plus, movies2, by.x = "surname", by.y = "name", all = TRUE)
m_full
surname nationality title
Hitchcock UK Psycho
Hitchcock UK North by Northwest
Lucas US NA
Polanski Poland Chinatown
Scorsese US Taxi Driver
Spielberg US Super 8
Spielberg US Catch Me If You Can
Tarantino US Reservoir Dogs
dim(m_full)
## [1] 8 3
 

A work by Gianluca Sottile

gianluca.sottile@unipa.it