One of the most widely data store is the .csv (comma-separated values) file formats. R loads an array of libraries during the start-up, including the utils package. This package is convenient to open csv files combined with the reading.csv() function. Here is the syntax for read.csv
read.csv(file, header = TRUE, sep = ",")
Argument:
,
.We will read the data file name mtcats. The csv file is stored online. If your .csv file is stored locally, you can replace the PATH inside the code snippet. Don’t forget to wrap it inside ’ ’. The PATH needs to be a string value.
For mac user, the path for the download folder is:
"/Users/USERNAME/Downloads/FILENAME.csv"
For windows user:
"C:\Users\USERNAME\Downloads\FILENAME.csv"
Note that, we should always specify the extension of the file name.
'https://raw.githubusercontent.com/guru99-edu/R-Programming/master/mtcars.csv'
PATH <- read.csv(PATH, header = TRUE, sep = ',')
df <-length(df)
## [1] 12
class(df$X)
## [1] "character"
R, by default, returns character values as Factor. We can turn off this setting by adding stringsAsFactors = FALSE.
'https://raw.githubusercontent.com/guru99-edu/R-Programming/master/mtcars.csv'
PATH <-read.csv(PATH, header =TRUE, sep = ',', stringsAsFactors =FALSE)
df <-class(df$X)
## [1] "character"
The class for the variable X is now a character.
Excel files are very popular among data analysts. Spreadsheets are easy to work with and flexible. R is equipped with a library readxl to import Excel spreadsheet.
Use this code
require(readxl)
## Loading required package: readxl
to check if readxl is installed in your machine. If you install r with r-conda-essential, the library is already installed. You should see in the command window:
If the package does not exit, you can install it with the conda library or in the terminal, use conda install -c mittner r-readxl.
Use the following command to load the library to import excel files.
library(readxl)
We use the examples included in the package readxl during this tutorial.
Use code
readxl_example()
## [1] "clippy.xls" "clippy.xlsx" "datasets.xls" "datasets.xlsx"
## [5] "deaths.xls" "deaths.xlsx" "geometry.xls" "geometry.xlsx"
## [9] "type-me.xls" "type-me.xlsx"
to see all the available spreadsheets in the library.
To check the location of the spreadsheet named clippy.xls, simple use
readxl_example("geometry.xls")
## [1] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library/readxl/extdata/geometry.xls"
The function read_excel() is of great use when it comes to opening xls and xlsx extention.
The syntax is:
read_excel(PATH, sheet = NULL, range = NULL, col_names = TRUE)
Arguments:
-PATH: Path where the excel is located -sheet: Select the sheet to import. By default, all -range: Select the range to import. By default, all non-null cells -col_names: Select the columns to import. By default, all non-null columns
We can import the spreadsheets from the readxl library and count the number of columns in the first sheet.
# Store the path of `datasets.xlsx`
readxl_example("datasets.xlsx")
example <-# Import the spreadsheet
read_excel(example)
df <-# Count the number of columns
length(df)
## [1] 5
The file datasets.xlsx is composed of 4 sheets. We can find out which sheets are available in the workbook by using excel_sheets() function
readxl_example("datasets.xlsx")
example <-excel_sheets(example)
## [1] "iris" "mtcars" "chickwts" "quakes"
If a worksheet includes many sheets, it is easy to select a particular sheet by using the sheet arguments. We can specify the name of the sheet or the sheet index. We can verify if both function returns the same output with identical().
readxl_example("datasets.xlsx")
example <- read_excel(example, sheet = "quakes")
quake <-1 <-read_excel(example, sheet = 4)
quake_identical(quake, quake_1)
## [1] TRUE
We can control what cells to read in 2 ways
For example, we set n_max equals to 5 to import the first five rows.
# Read the first five row: with header
read_excel(example, n_max = 5, col_names = TRUE)
iris <- iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
If we change col_names to FALSE, R creates the headers automatically.
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
We can also use the argument range to select rows and columns in the spreadsheet. In the code below, we use the excel style to select the range A1 to B5.
# Read rows A1 to B5
1 <-read_excel(example, range = "A1:B5", col_names =TRUE)
example_dim(example_1)
## [1] 4 2
We can see that the example_1 returns 4 rows with 2 columns. The dataset has header, that the reason the dimension is 4x2.
In the second example, we use the function cell_rows() which controls the range of rows to return. If we want to import the rows 1 to 5, we can set cell_rows(1:5). Note that, cell_rows(1:5) returns the same output as cell_rows(5:1).
# Read rows 1 to 5
2 <-read_excel(example, range =cell_rows(1:5),col_names =TRUE)
example_dim(example_2)
## [1] 4 5
The example_2 however is a 4x5 matrix. The iris dataset has 5 columns with header. We return the first four rows with header of all columns
In case we want to import rows which do not begin at the first row, we have to include col_names = FALSE. If we use range = cell_rows(2:5), it becomes obvious our data frame does not have header anymore.
read_excel(example, range = cell_rows(2:3), col_names = TRUE)
iris_row_with_header <- read_excel(example, range = cell_rows(2:3), col_names = FALSE) iris_row_no_header <-
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
We can select the columns with the letter, like in Excel.
# Select columns A and B
read_excel(example, range =cell_cols("A:B"))
col <-dim(col)
## [1] 150 2
Note: range = cell_cols("A:B")
, returns output all cells with non-null value. The dataset contains 150 rows, therefore, read_excel() returns rows up to 150. This is verified with the dim() function.
read_excel() returns NA when a symbol without numerical value appears in the cell. We can count the number of missing values with the combination of two functions
Here is the code
read_excel(example, na = "setosa")
iris_na <-sum(is.na(iris_na))
## [1] 50
We have 50 values missing, which are the rows belonging to the setosa species.
We will import different files format with the heaven package. This package support SAS, STATA and SPSS softwares. We can use the following function to open different types of dataset, according to the extension of the file:
Only one argument is required within these function. We need to know the PATH where the file is stored. That’s it, we are ready to open all the files from SAS, STATA and SPSS. These three function accepts an URL as well.
For our example, we are going to use the admission dataset from IDRE.
'https://github.com/guru99-edu/R-Programming/blob/master/binary.sas7bdat?raw=true'
PATH_sas <- read_sas(PATH_sas)
df <-head(df)
ADMIT | GRE | GPA | RANK |
---|---|---|---|
0 | 380 | 3.61 | 3 |
1 | 660 | 3.67 | 3 |
1 | 800 | 4.00 | 1 |
1 | 640 | 3.19 | 4 |
0 | 520 | 2.93 | 4 |
1 | 760 | 3.00 | 2 |
For STATA data files you can use read_dta(). We use exactly the same dataset but store in .dta file.
'https://github.com/guru99-edu/R-Programming/blob/master/binary.dta?raw=true'
PATH_stata <- read_dta(PATH_stata)
df <-head(df)
admit | gre | gpa | rank |
---|---|---|---|
0 | 380 | 3.61 | 3 |
1 | 660 | 3.67 | 3 |
1 | 800 | 4.00 | 1 |
1 | 640 | 3.19 | 4 |
0 | 520 | 2.93 | 4 |
1 | 760 | 3.00 | 2 |
We use the read_sav()function to open a SPSS file. The file extension “.sav”
'https://github.com/guru99-edu/R-Programming/blob/master/binary.sav?raw=true'
PATH_spss <- read_sav(PATH_spss)
df <-head(df)
admit | gre | gpa | rank |
---|---|---|---|
0 | 380 | 3.61 | 3 |
1 | 660 | 3.67 | 3 |
1 | 800 | 4.00 | 1 |
1 | 640 | 3.19 | 4 |
0 | 520 | 2.93 | 4 |
1 | 760 | 3.00 | 2 |
When we want to import data into R, it is useful to implement following checklist. It will make it easy to import data correctly into R:
Following table summarizes the function to use in order to import different types of file in R. The column one states the library related to the function. The last column refers to the default argument.
Library | Objective | Function | Default.Argument |
---|---|---|---|
utils | Read CSV file | read.csv() | file, header = , TRUE, sep = “,” |
readxl | Read EXCEL file | read_excel() | path, range = NULL, col_names = TRUE |
haven | Read SAS file | read_sas() | path |
Read STATA file | read_stata() | path | |
Read SPSS file | read_sav() | path |
Following table shows the different ways to import a selection with read_excel() function.
Function | Objective | Arguments |
---|---|---|
read_excel() | Read n number of rows | n_max = 10 |
Select rows and columns like in excel | range = “A1:D10” | |
Select rows with indexes | range = cell_rows(1:3) | |
Select columns with letters | range = cell_cols(“A:C”) |
A work by Gianluca Sottile
gianluca.sottile@unipa.it