Prof. Nicola Torelli

Chair: Rosanna Verde

09:30 - 10:30, Aula 9


Classification with imbalanced data and the (eternal?) struggle between statistics and data science


Abstract

The problem of classification with imbalanced data is recognized, among data science practitioners, as an issue of primary relevance since it occurs in many real situations and domains. When data on the dichotomous (or possibly polytomous) response variable are skewed, the classifiers typically used in statistical and machine learning could perform poorly. Several remedies to this problem have been proposed and among the most popular and successful methods are those that alter the class distribution by non-proportionally sampling data or by generating new data. The analysis of some of the solutions will offer the opportunity to discuss how different scholars addressed the issue. First, it received much more attention from data scientists in the machine learning community than from statisticians who, nonetheless, gave important contributions. Also, the more popular solutions and methods for dealing with imbalanced data are rooted in the different methodological approaches of the two communities. To illustrate the aforementioned points the algorithm ROSE, introduced in 2014 by Menardi and Torelli, will be reviewed along with some recent proposals for refining it, also borrowing some ideas from the machine learning community.

 

A work by Gianluca Sottile

(on behalf of the local organizing committee)