C8. Advances in text mining

08:40 - 09:50, Aula 12

Chair: Mariangela Sciandra

Can Correspondence Analysis Challenge Transformers in Authorship Attribution Tasks?

Andrea Sciandra and Arjuna Tuzzi

Abstract: With reference to a large corpus of 76 Italian contemporary popular mystery novels by 16 different authors, this study aims to assess the performance of large language models in an authorship attribution test. The results obtained through both transformers and correspondence analysis vector representations are compared and contrast in machine learning classification tasks. Although in previous works transformers have been shown to perform better than other alternatives, in this case, correspondence analysis wins the challenge. Results support the hypothesis that specialized large corpora require tailor-made representations.

Click here to view the abstract.

A Fuzzy Topic Modeling approach to legal corpora

Antonio Calcagnì and Arjuna Tuzzi

Abstract: This study investigates the application of Fuzzy Latent Semantic Analysis (fLSA) in analyzing expenditure chapters within legal texts, using Italy’s budget law 178/2020 as a case study. Faced with challenges in legal studies, such as specialized language and heterogeneity, fLSA combines Latent Semantic Analysis (LSA) with dimensionality reduction and soft clustering. Results show comparable performance with the widely used Latent Dirichlet Allocation (LDA) in identifying coherent ex- penditure chapters, with fLSA showing a tendency to retrieve more distinctive and exclusive topics.

Click here to view the abstract.

EmurStat: a digital tool for statistical analysis of emur flow

Simone Paesano, Maria Gabriella Grassia, Marina Marino, Dario Sacco and Rocco Mazza

Abstract: New Public Management (NPM) emphasizes the use of market-based techniques to improve efficiency and effectiveness in public service delivery. This approach seeks to promote accountability and performance measurement. Key performance indicators describe the performance of processes that characterize a specific workflow. One of the concepts that has emerged in the last decade is Precision Public Health, which integrates traditional determinants of health with new approaches such as data science and health economics. Moreover, visualizations help to understand social determinants of health and public health indicators. This paper aims to present a useful application for data visualization, processing, and analysis for understanding and evaluating the performance of services provided by emergency rooms, through the lunge on a specific case.

Click here to view the abstract.

Graph Neural Networks for clustering medical documents

Vittorio Torri and Francesca Ieva

Abstract: Clustering is one of the most challenging tasks in the field of Natural Language Processing, due to the high dimensionality of textual data. Different types of document embeddings have been proposed in the past, often based on the transformer neural network architecture. In this work, we propose to exploit a graph-based representation combining it with the recent advancements in the field of graph neural networks. While graph neural networks achieved promising results in document classification, their potential for document clustering has not been explored yet. In particular, we propose an application in the medical domain, where document clustering is of paramount importance due to the large amount of information present in medical documents and the difficulties in labelling them.

Click here to view the abstract.