Read CSV

CSV (comma-separated values) is one of the most common data formats.

Base R: read.csv()

read.csv(file, header = TRUE, sep = ",")

Notes:

  • Since R 4.0, strings are not converted to factors by default, but older code often included stringsAsFactors = FALSE; keeping it can be useful for portability.
  • On Windows, paths in quotes need escaping (\\) or use forward slashes (/).
PATH <- "raw_data/mtcars.csv"

# Keep stringsAsFactors explicitly for compatibility across R versions
df <- read.csv(PATH, header = TRUE, sep = ",", stringsAsFactors = FALSE)

ncol(df)
## [1] 12
str(df)
## 'data.frame':    32 obs. of  12 variables:
##  $ X   : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : int  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : int  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : int  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : int  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: int  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: int  4 4 1 1 2 1 4 2 2 4 ...

Tidyverse alternative: readr::read_csv()

readr::read_csv(PATH)

Read Excel

The readxl package reads both .xls and .xlsx files.

library(readxl)

readxl_example()

readxl_example() lists example spreadsheets shipped with the package.

readxl_example()
##  [1] "clippy.xls"    "clippy.xlsx"   "datasets.xls"  "datasets.xlsx"
##  [5] "deaths.xls"    "deaths.xlsx"   "geometry.xls"  "geometry.xlsx"
##  [9] "type-me.xls"   "type-me.xlsx"
readxl_example("datasets.xlsx")
## [1] "/home/runner/work/_temp/Library/readxl/extdata/datasets.xlsx"

read_excel()

read_excel(path, sheet = NULL, range = NULL, col_names = TRUE, n_max = Inf, na = "")

Examples (sheet by name or index):

example <- readxl_example("datasets.xlsx")

df1 <- read_excel(example)                 # default first sheet
sheets <- excel_sheets(example)            # list sheets
sheets
## [1] "mtcars"   "chickwts" "quakes"
quakes_a <- read_excel(example, sheet = "quakes")
quakes_b <- read_excel(example, sheet = 3)

identical(quakes_a, quakes_b)
## [1] TRUE

Read only first rows

iris_head <- read_excel(example, n_max = 5, col_names = TRUE)
iris_head
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Read a rectangular range (Excel notation)

range_ab <- read_excel(example, range = "A1:B5", col_names = TRUE)
dim(range_ab)
## [1] 4 2

Read specific rows/columns (cell_rows / cell_cols)

cell_rows() and cell_cols() give fine control over the imported rectangle.

rows_1_5 <- read_excel(example, range = cell_rows(1:5), col_names = TRUE)
dim(rows_1_5)
## [1]  4 11
cols_ab <- read_excel(example, range = cell_cols("A:B"))
dim(cols_ab)
## [1] 32  2

Treat a value as missing (na argument)

iris_na <- read_excel(example, na = "setosa")
sum(is.na(iris_na))
## [1] 0

Import SAS / Stata / SPSS

The haven package imports (and exports) common file formats from other statistical software, including SAS, Stata, and SPSS.

library(haven)

Read SAS

PATH_sas <- "raw_data/binary.sas7bdat"
df_sas <- read_sas(PATH_sas)
head(df_sas)
ADMIT GRE GPA RANK
0 380 3.61 3
1 660 3.67 3
1 800 4.00 1
1 640 3.19 4
0 520 2.93 4
1 760 3.00 2

Read Stata

PATH_stata <- "raw_data/binary.dta"
df_stata <- read_dta(PATH_stata)
head(df_stata)
admit gre gpa rank
0 380 3.61 3
1 660 3.67 3
1 800 4.00 1
1 640 3.19 4
0 520 2.93 4
1 760 3.00 2

Read SPSS

PATH_spss <- "raw_data/binary.sav"
df_spss <- read_sav(PATH_spss)
head(df_spss)
admit gre gpa rank
0 380 3.61 3
1 660 3.67 3
1 800 4.00 1
1 640 3.19 4
0 520 2.93 4
1 760 3.00 2

Tip: haven::read_spss() can also choose between SPSS formats based on file extension.

Best practices

  • Use the first row as column names (headers).
  • Avoid spaces and special characters in column names; prefer snake_case.
  • Use a consistent missing-value marker (e.g., blank cells or NA) and declare it via na = ... when importing.
  • Keep raw data immutable; do cleaning in separate scripts.

Summary

Library Objective Function Key_args
utils (base R) Read CSV read.csv() file, header=TRUE, sep=‘,’
readxl Read Excel read_excel() path, sheet, range, col_names, n_max, na
haven Read SAS read_sas() data_file
haven Read Stata read_dta() file
haven Read SPSS read_sav() file
 

A work by Gianluca Sottile

gianluca.sottile@unipa.it