A factor is R’s data type for categorical variables (variables that take a limited set of values, called levels). In most datasets you will see two broad types of variables:
R models and plotting functions often treat factors differently from numeric variables, which is why it matters to store categorical data as factors.
You create factors with factor():
Key arguments:
x: a vector (typically character, integer, or numeric
codes).levels: the set of possible values (and their order, if
provided).labels: the human-readable labels for the levels.ordered: whether levels should be treated as
ordered.## [1] "character"
## [1] "factor"
## [1] Male Female Female Male Male
## Levels: Female Male
## [1] "Female" "Male"
Nominal categories have no natural order (e.g., colors).
## [1] blue red green white black yellow
## Levels: black blue green red white yellow
## [1] "black" "blue" "green" "red" "white" "yellow"
The level order is just how R stores it (often alphabetical unless you specify levels explicitly).
Continuous variables are usually stored as numeric (or
sometimes integer). For example, mtcars$mpg is
numeric, so it is a continuous variable.
## [1] "numeric"
In many modeling functions, factors are automatically expanded into indicator (dummy) variables in the design matrix. This is one reason factors are important for regression and other ML methods.
A work by Gianluca Sottile
gianluca.sottile@unipa.it