A data frame is a list of vectors which are of equal length. A matrix contains only one type of data, while a data frame accepts different data types (numeric, character, factor, etc.).
How to Create a Data Frame We can create a data frame by passing the variable a,b,c,d into the data.frame() function. We can name the columns with name() and simply specify the name of the variables.
data.frame(df, stringsAsFactors = TRUE)
Arguments:
We can create our first data set by combining four variables of same length.
# Create a, b, c, d variables
c(10, 20, 30, 40)
a <- c('book', 'pen', 'textbook', 'pencil_case')
b <- c(TRUE, FALSE, TRUE, FALSE)
c <- c(2.5, 8, 10, 7)
d <-# Join the variables to create a data frame
data.frame(a,b,c,d)
df <- df
a | b | c | d |
---|---|---|---|
10 | book | TRUE | 2.5 |
20 | pen | FALSE | 8.0 |
30 | textbook | TRUE | 10.0 |
40 | pencil_case | FALSE | 7.0 |
We can see the column headers have the same name as the variables. We can change the column name with the function names(). Check the example below:
# Name the data frame
names(df) <- c('ID', 'items', 'store', 'price')
df
ID | items | store | price |
---|---|---|---|
10 | book | TRUE | 2.5 |
20 | pen | FALSE | 8.0 |
30 | textbook | TRUE | 10.0 |
40 | pencil_case | FALSE | 7.0 |
# Print the structure
str(df)
## 'data.frame': 4 obs. of 4 variables:
## $ ID : num 10 20 30 40
## $ items: chr "book" "pen" "textbook" "pencil_case"
## $ store: logi TRUE FALSE TRUE FALSE
## $ price: num 2.5 8 10 7
By default, data frame returns string variables as a factor.
It is possible to SLICE values of a Data Frame. We select the rows and columns to return into bracket precede by the name of the data frame.
A data frame is composed of rows and columns, df[A, B]. A represents the rows and B the columns. We can slice either by specifying the rows and/or columns.
From picture 1, the left part represents the rows, and the right part is the columns. Note that the symbol :
means to. For instance, `1:3 intends to select values from 1 to 3.
In below diagram we display how to access different selection of the data frame:
The yellow arrow selects the row 1 in column 2 The green arrow selects the rows 1 to 2 The red arrow selects the column 1 The blue arrow selects the rows 1 to 3 and columns 3 to 4
Note that, if we let the left part blank, R will select all the rows. By analogy, if we let the right part blank, R will select all the columns.
## Select row 1 in column 2
1, 2] df[
## [1] "book"
## Select Rows 1 to 2
1:2, ] df[
ID | items | store | price |
---|---|---|---|
10 | book | TRUE | 2.5 |
20 | pen | FALSE | 8.0 |
## Select Columns 1
1] df[,
## [1] 10 20 30 40
It is also possible to select the columns with their names. For instance, the code below extracts two columns: ID and store.
# Slice with columns name
c('ID', 'store')] df[,
ID | store |
---|---|
10 | TRUE |
20 | FALSE |
30 | TRUE |
40 | FALSE |
You can also append a column to a Data Frame. You need to use the symbol $ to append a new variable.
# Create a new vector
c(10, 35, 40, 5)
quantity <-
# Add `quantity` to the `df` data frame
$quantity <- quantity
df df
ID | items | store | price | quantity |
---|---|---|---|---|
10 | book | TRUE | 2.5 | 10 |
20 | pen | FALSE | 8.0 | 35 |
30 | textbook | TRUE | 10.0 | 40 |
40 | pencil_case | FALSE | 7.0 | 5 |
Note: The number of elements in the vector has to be equal to the no of elements in data frame. Executing the following statement
c(10, 35, 40)
quantity <-
# Add `quantity` to the `df` data frame
$quantity <- quantity df
## Error in `$<-.data.frame`(`*tmp*`, quantity, value = c(10, 35, 40)): replacement has 3 rows, data has 4
Sometimes, we need to store a column of a data frame for future use or perform operation on a column. We can use the $ sign to select the column from a data frame.
# Select the column ID
$ID df
## [1] 10 20 30 40
In the previous section, we selected an entire column without condition. It is possible to subset based on whether or not a certain condition was true.
We use the subset()
function.
subset(x, condition)
Arguments:
We want to return only the items with price above 10, we can do:
# Select price above 5
subset(df, subset = price > 5)
ID | items | store | price | quantity | |
---|---|---|---|---|---|
2 | 20 | pen | FALSE | 8 | 35 |
3 | 30 | textbook | TRUE | 10 | 40 |
4 | 40 | pencil_case | FALSE | 7 | 5 |
A work by Gianluca Sottile
gianluca.sottile@unipa.it