What is a factor in R?

A factor is R’s data type for categorical variables (variables that take a limited set of values, called levels). In most datasets you will see two broad types of variables:

  • Categorical: a finite set of categories (e.g., country, gender, job title).
  • Continuous: numeric measurements on a scale (e.g., income, temperature, price).

R models and plotting functions often treat factors differently from numeric variables, which is why it matters to store categorical data as factors.

Categorical variables

You create factors with factor():

factor(x, levels = NULL, labels = levels, ordered = FALSE)

Key arguments:

  • x: a vector (typically character, integer, or numeric codes).
  • levels: the set of possible values (and their order, if provided).
  • labels: the human-readable labels for the levels.
  • ordered: whether levels should be treated as ordered.

Example: character to factor

gender <- c("Male", "Female", "Female", "Male", "Male")

class(gender)
## [1] "character"
gender_f <- factor(gender)
class(gender_f)
## [1] "factor"
gender_f
## [1] Male   Female Female Male   Male  
## Levels: Female Male
levels(gender_f)
## [1] "Female" "Male"

Example: numeric codes with labels

gender_code <- c(1, 0, 0, 1, 1)

gender_labeled <- factor(
  gender_code,
  levels = c(0, 1),
  labels = c("Female", "Male")
)

gender_labeled
## [1] Male   Female Female Male   Male  
## Levels: Female Male

Nominal vs ordinal categorical variables

Nominal factors

Nominal categories have no natural order (e.g., colors).

color <- c("blue", "red", "green", "white", "black", "yellow")
color_f <- factor(color)
color_f
## [1] blue   red    green  white  black  yellow
## Levels: black blue green red white yellow
levels(color_f)
## [1] "black"  "blue"   "green"  "red"    "white"  "yellow"

The level order is just how R stores it (often alphabetical unless you specify levels explicitly).

Ordinal factors

Ordinal categories have a natural order (e.g., low < medium < high).
In R, you can encode that by setting ordered = TRUE and supplying levels in the correct order.

Example 1: create an ordered factor

day <- c("evening", "morning", "afternoon", "midday", "midnight", "evening")

day_ord <- factor(
  day,
  ordered = TRUE,
  levels = c("morning", "midday", "afternoon", "evening", "midnight")
)

day_ord
## [1] evening   morning   afternoon midday    midnight  evening  
## Levels: morning < midday < afternoon < evening < midnight

Example 2: summarize levels

summary(day_ord)
##   morning    midday afternoon   evening  midnight 
##         1         1         1         2         1

Because the factor is ordered, printing shows the order of levels using <.

Continuous variables

Continuous variables are usually stored as numeric (or sometimes integer). For example, mtcars$mpg is numeric, so it is a continuous variable.

dataset <- mtcars
class(dataset$mpg)
## [1] "numeric"

Practical tip (modeling)

In many modeling functions, factors are automatically expanded into indicator (dummy) variables in the design matrix. This is one reason factors are important for regression and other ML methods.

 

A work by Gianluca Sottile

gianluca.sottile@unipa.it