Séminaires autour de l’active learning – Laboratoire d'Informatique de l'Université du Mans

Seminar from Meysam Shamsi, lecturer, Natacha Miniconi and Matthieu François, PhD students

Date : 08 décembre 2025
Time : 09h00
Place : IC2, Salle du Conseil
Speakers : Meysam Shamsi, Natacha Miniconi, Matthieu François

Active Machine Learning: Strategies for Efficient Data Selection, Annotation, and Model Improvement (by Meysam Shamsi):

This presentation introduces the core principles of active machine learning, where the model actively participates in data selection rather than passively consuming a fixed dataset. It highlights the limitations of traditional supervised learning, especially in large-scale, real-world applications where annotation is costly and impractical. Key sampling strategies—diversity (exploring underrepresented areas of the data space), uncertainty (exploiting the model’s confusion near decision boundaries), and error-minimization (teaching the model to recognize what it does not know)—are explained through practical methods such as clustering, outlier detection, perturbation-based uncertainty, committees of models, and auxiliary familiarity or correctness predictors.

The presentation also discusses annotation quality, efficient annotation interfaces, semi-supervised labeling, and challenges in updating models with new data. Finally, it emphasizes evaluation protocols and the iterative nature of active learning, demonstrating how smarter sample selection can drastically reduce annotation costs while improving model performance across changing domains.

Active Learning for Speech Synthesis Quality Prediction (by Natacha Miniconi) :

This work explores reducing human effort in speech synthesis evaluation by combining Active Learning and automatic proxy metrics. In the first study, uncertainty-based (MC-Dropout, adversarial noise) and diversity driven selection strategies were tested to identify the most informative samples for MOS annotation. These methods enable smarter data querying, prioritizing samples where the predictor is uncertain or where acoustic characteristics deviate from previously annotated material. Experiments demonstrate that such targeted selection accelerates model adaptation, improves MOS prediction across languages and domains, and reduces the quantity of labeled data required compared to random sampling.

The second study investigates alternatives to MOS, using deepfake detection scores and phonetic formant-based metrics. Results show that deepfake classifiers are correlated with MOS and enable scalable ranking of TTS quality in low-resource contexts, while vowel-space measures provide interpretable diagnostic cues on synthesis quality. Overall, the approach moves toward more automated and efficient TTS quality assessment, reducing dependency on subjective human labeling.

Using Active Learning for the Study of Online Environmental Controversies (by Matthieu François)

Social networks are the main forum for online discussion, where speech is ‘free’ and spontaneous. Supervised and unsupervised learning methods have been widely used to study them. Today, large language models (LLMs) are being explored in numerous studies on automatic annotation and, in particular, on their application in the social sciences. However, several articles have highlighted their limitations and the risks associated with overly broad use. The trade-off between these different approaches is therefore still being researched. In all cases, data annotation remains necessary to adapt and evaluate a model.

Through the real-world case of social media classification in collaboration with social science specialists, this presentation will demonstrate a comprehensive approach to building a corpus and a multi-label text classification model from a large set of unlabelled data. This work involved the participation of expert annotators and the comparison of different active learning strategies for data annotation. Classification is studied using Bert-type classifiers and LLMs. Our work highlights the practical challenges of solving complex multi-label classification projects with an interdisciplinary team