This 9 ECTS (approximately 72 contact hours) course is designed for students with a basic background in descriptive and inferential statistics, including familiarity with concepts such as random variables, probability distributions, confidence intervals, and hypothesis testing. A minimal level of computer literacy on Windows, macOS, or Linux is required, including the ability to manage files and folders, install software, and navigate standard graphical user interfaces.
No prior experience with R is strictly required; however, previous exposure to any programming language (e.g., Python, MATLAB, C, or Java) is beneficial and will facilitate faster progression through the programming components of the course. Participants are expected to bring their own device (laptop) with administrative rights, on which they can install R, RStudio (or an equivalent IDE), and any additional packages used during the lectures and labs.
Optional deep learning track prerequisites: a
working local configuration of keras3 and
tensorflow for R is recommended to run the deep learning
lessons end‑to‑end (including model training), for example by using the
install_keras() helper when appropriate. [web:236]
This course offers a structured and comprehensive introduction to the R programming language as a modern environment for data analysis, statistical computing, and reproducible research. Starting from the fundamentals of R syntax and core data structures, the course progressively covers user‑defined functions, control structures, and vectorized programming patterns that are essential for writing clear, efficient, and reusable code.
A substantial part of the course is devoted to data management and transformation, including importing and exporting data from multiple formats, handling missing values, reshaping tables, and preparing analysis‑ready data sets. Students will then learn how to perform exploratory data analysis and produce informative visualizations to summarize univariate and multivariate patterns, using R’s base graphics and, where appropriate, modern plotting tools.
In the final portion, the course introduces the foundations of machine learning with R, with a clear distinction between supervised and unsupervised approaches. Supervised learning topics include regression and classification models, while unsupervised learning covers dimensionality reduction techniques and clustering methods for exploratory pattern discovery. Throughout, emphasis is placed on good programming practice, critical assessment of model outputs, and reproducible workflows suitable for academic and applied research.
An additional set of lessons provides an optional deep learning track implemented in modern R using Keras and TensorFlow, covering deep neural networks for structured (tabular) data, autoencoders for representation learning, sequence models (LSTM/GRU) for forecasting and text classification, and convolutional neural networks (CNNs) for image classification.
By the end of the course, students will:
Upon successful completion, students will be able to:
The course also aims to strengthen students’ ability to:
In addition, students will:
By the end of this course, students will be able to:
The following schedule is indicative and may be adapted to the specific calendar and pace of the class, while preserving the overall balance between foundational topics, data manipulation, exploratory analysis, and machine learning.
| Hours | Topics |
|---|---|
| 5 | Introduction to R and RStudio. Installation, projects, scripts, and basic workflow. Overview of R as a language for statistical computing and data science. |
| 7 | Core R objects and data types: numeric, character, logical, factors, vectors, matrices, arrays, lists, and data frames. Indexing, subsetting, and basic operations on these structures. |
| 8 | Programming fundamentals: functions, arguments, return values, environments. Control structures: if/else, for, while, repeat, and logical operators. Introduction to vectorization and the apply‑family of functions. |
| 10 | Data import and export: reading and writing CSV, text, Excel, and common statistical formats (e.g., SPSS). Data cleaning and preparation: handling missing values, recoding variables, filtering, merging, and reshaping data. |
| 10 | Exploratory data analysis: univariate and bivariate summaries, correlation and association measures. Visualization of distributions and relationships using histograms, boxplots, scatter plots, and other standard graphics. |
| 14 | Supervised learning: linear regression, model diagnostics, variable selection, and introduction to generalized linear models. Classification methods such as penalized regression, decision trees, random forests, gradient boosting, and support vector machines, with emphasis on model evaluation and overfitting control. |
| 10 | Unsupervised learning: dimensionality reduction techniques (e.g., principal component analysis) and clustering methods (e.g., k‑means, hierarchical, density‑based, model‑based, and spectral clustering). Interpretation of components and clusters, visualization, and practical considerations in unsupervised analysis. |
| 8 | Integrated case studies and project work: end‑to‑end analysis of real‑world data sets, from data ingestion to model building and reporting using reproducible R scripts and R Markdown. Review, discussion, and preparation for assessment. |
| Optional (6–10) | Deep learning extension: DNNs for tabular data, autoencoders, LSTM/GRU for sequences and text, CNNs for images; training/validation protocols, regularization, and practical considerations for reproducible experiments in R. |
A work by Gianluca Sottile
gianluca.sottile@unipa.it