Predictive Modeling of Subjective Disagreement in Speech Annotation/Evaluation

Encadrants: Meysam Shamsi et Anthony Larcher
Equipe d’accueil : LIUM (Laboratoire d’Informatique de l’Université du Mans). Le stage se déroulera en présentiel.
Lieu : Le Mans Université
Début du stage : Février 2024
Contact : Meysam Shamsi et Anthony Larcher, (pré

Candidature : Envoyer votre CV, une lettre de motivation adaptée au sujet proposé, vos relevés de notes des deux dernières années d’études, possibilité de joindre un avis ou des lettres de recommandations. à l’ensemble des encadrantes, avant le 10 janvier 2024


Sujet : In the context of modeling subjective tasks, where diverse opinions, perceptions, and judgments exist among individuals, such as in speech quality or speech emotion recognition, addressing the challenge of defining ground truth and annotating a training set becomes crucial.

The current practice of aggregating all annotations into a single label for modeling a subjective task is neither fair nor efficient [1]. The variability in annotations or evaluations can stem from various factors [2], broadly categorized into those associated with corpus quality and those intrinsic to the samples themselves.

In the first case, the delicate definition of a subjective task introduces sensitivity into the annotation process, potentially leading to more errors, especially where the annotation tools and platform lack precision or annotators experience fatigue. In the second case, the inherent ambiguity in defining a subjective task and different perception may result in varying annotations and disagreements. Developing a predictive model to understand annotator/evaluator disagreement is crucial for engaging in discussions related to ambiguous samples and refining the definition of subjective concepts. Furthermore, this model can serve as a valuable tool for assessing the confidence of automatic evaluations [3,4].

This modeling approach will contribute to the automatic evaluation of corpus annotations, identification of ambiguous samples for reconsideration or re-annotation, automatic assessment of subjective models, and the detection of underrepresented samples and biases in the dataset.

The proposed research involves utilizing a speech dataset such as MS-Podcast [5], SOMOS [6], VoiceMOS [7], for a subjective task with multiple annotations per sample. The primary objective is to predict the variation in assigned labels, measured through disagreement scores, entropy, or distribution.


Profil de candidature souhaité : Personne motivée par l’intelligence artificielle, inscrite en master en informatique ou dans un domaine connexe.



[1] Davani, A. M., Díaz, M., & Prabhakaran, V. (2022). Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10, 92-110.

[2] Kreiman, J., Gerratt, B. R., & Ito, M. (2007). When and why listeners disagree in voice quality assessment tasks. The Journal of the Acoustical Society of America, 122(4), 2354-2364.

[3] Wu, W., Chen, W., Zhang, C., & Woodland, P. C. (2023). It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation. arXiv preprint arXiv:2310.00486

[4] Han, J., Zhang, Z., Schmitt, M., Pantic, M., & Schuller, B. (2017, October). From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 25th ACM international conference on Multimedia (pp. 890-897).

[5] Lotfian, R., & Busso, C. (2017). Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4), 471-483.

[6] Maniati, G., Vioni, A., Ellinas, N., Nikitaras, K., Klapsas, K., Sung, J.S., Jho, G., Chalamandaris, A., Tsiakoulis, P. (2022) SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis. Proc. Interspeech 2022, 2388-2392.

[7] Cooper, E., Huang, W. C., Tsao, Y., Wang, H. M., Toda, T., & Yamagishi, J. (2023). The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains. arXiv preprint arXiv:2310.02640.