What is a data frame?

A data frame is a table-like object where:

  • each column is a vector,
  • all columns have the same length,
  • different columns can have different types (numeric, character, logical, factor, …).

A matrix can only store one type (all numeric, all character, etc.), while a data frame can mix types.

Create a data frame

You can create a data frame with data.frame() or (recommended for modern workflows) tibble::tibble().

Base R: data.frame()

data.frame(..., stringsAsFactors = FALSE)

Note: In R ≥ 4.0 the default is effectively stringsAsFactors = FALSE, but it can be set explicitly for compatibility or teaching.

Example

df <- data.frame(
  ID    = c(10, 20, 30, 40),
  item  = c("book", "pen", "textbook", "pencil_case"),
  store = c(TRUE, FALSE, TRUE, FALSE),
  price = c(2.5, 8, 10, 7)
)

df
ID item store price
10 book TRUE 2.5
20 pen FALSE 8.0
30 textbook TRUE 10.0
40 pencil_case FALSE 7.0
str(df)
## 'data.frame':    4 obs. of  4 variables:
##  $ ID   : num  10 20 30 40
##  $ item : chr  "book" "pen" "textbook" "pencil_case"
##  $ store: logi  TRUE FALSE TRUE FALSE
##  $ price: num  2.5 8 10 7

Tidy alternative: tibble

library(tibble)

df_tbl <- tibble(
  ID    = c(10, 20, 30, 40),
  item  = c("book", "pen", "textbook", "pencil_case"),
  store = c(TRUE, FALSE, TRUE, FALSE),
  price = c(2.5, 8, 10, 7)
)

df_tbl
ID item store price
10 book TRUE 2.5
20 pen FALSE 8.0
30 textbook TRUE 10.0
40 pencil_case FALSE 7.0
str(df_tbl)
## tibble [4 × 4] (S3: tbl_df/tbl/data.frame)
##  $ ID   : num [1:4] 10 20 30 40
##  $ item : chr [1:4] "book" "pen" "textbook" "pencil_case"
##  $ store: logi [1:4] TRUE FALSE TRUE FALSE
##  $ price: num [1:4] 2.5 8 10 7

Slice (index) a data frame

Indexing uses df[rows, cols]:

  • leaving rows blank means “all rows”
  • leaving cols blank means “all columns”
# One cell: row 1, column 2
df
ID item store price
10 book TRUE 2.5
20 pen FALSE 8.0
30 textbook TRUE 10.0
40 pencil_case FALSE 7.0
# Rows 1 to 2 (all columns)
df[1:2, ]
ID item store price
10 book TRUE 2.5
20 pen FALSE 8.0
# Column 1 (all rows)
df[, 1]
## [1] 10 20 30 40

Select columns by name:

df[, c("ID", "store")]
ID store
10 TRUE
20 FALSE
30 TRUE
40 FALSE

Tip: extracting a single column can be done three ways:

df$ID         # convenient interactive use
## [1] 10 20 30 40
df[["ID"]]    # safest programmatically
## [1] 10 20 30 40
df[, "ID"]    # returns a vector by default
## [1] 10 20 30 40

Append a column

A new column must have the same number of rows as the data frame.

quantity <- c(10, 35, 40, 5)
df$quantity <- quantity
df
ID item store price quantity
10 book TRUE 2.5 10
20 pen FALSE 8.0 35
30 textbook TRUE 10.0 40
40 pencil_case FALSE 7.0 5

If lengths don’t match, R errors:

bad_quantity <- c(10, 35, 40)
df$quantity <- bad_quantity
## Error in `$<-.data.frame`:
## ! replacement has 3 rows, data has 4

Modern alternative (nice in pipelines):

library(dplyr)

df2 <- df |>
  mutate(quantity = c(10, 35, 40, 5))

df2
ID item store price quantity
10 book TRUE 2.5 10
20 pen FALSE 8.0 35
30 textbook TRUE 10.0 40
40 pencil_case FALSE 7.0 5

Subset (filter) rows

subset()

subset() can be convenient for quick exploration, but many workflows prefer bracket indexing for explicitness.

subset(df, price > 5)
ID item store price quantity
2 20 pen FALSE 8 35
3 30 textbook TRUE 10 40
4 40 pencil_case FALSE 7 5

dplyr::filter() alternative

df |>
  filter(price > 5)
ID item store price quantity
20 pen FALSE 8 35
30 textbook TRUE 10 40
40 pencil_case FALSE 7 5
 

A work by Gianluca Sottile

gianluca.sottile@unipa.it