Evaluation of speech synthesis systems in a noisy environment


Supervisors: Aghilas Sini (LIUM), Thibault Vicente (LAUM)
Hosting labs: LIUM (Laboratoire d’Informatique de l’Université du Mans) and LAUM (Laboratoire d’Acoustique de l’Université du Mans)
Place: Le Mans Université
Beginning of internship : March 2024
Contacts: Aghilas Sini and Thibault Vicente, (firstname.name@univ-lemans.fr)

Application: Send a CV, a covering letter relevant to the proposed subject, your grades for the last two years of study and the possibility of attaching letters of recommendation to all the supervisors, before February 20, 2024

Context: erceptual evaluation is crucial in many areas related to speech technology, including speech synthesis. It enables the quality of synthesis to be assessed subjectively by asking a panel[5] to rate the quality of a synthesised speech stimulus[1, 2]. Recent work has led to the development of an artificial intelligence model [3, 4] that predicts the subjective evaluation of a segment of synthesised speech, thereby doing away with the need for a jury test.

The major problem with this assessment is the interpretation of the word “quality”. Some listeners may base their judgement on the intrinsic characteristics of the speech (such as timbre, speech rate, punctuation, etc) while others may base their judgement on characteristics related to the audio signal (such as the presence or absence of distortion). Thus, the subjective evaluation of speech may be biased by the listeners’ interpretation of the instruction. As a result, the artificial intelligence model mentioned above may be based on biased measurements.

The aim of the project is to carry out exploratory work to assess the quality of speech synthesis in a more robust way than has been proposed to date. To do this, we start from the hypothesis that the quality of speech synthesis can be estimated by means of its detection in a real environment. In other words, a signal synthesised perfectly to reproduce a human speech signal should not be detected in a real-life environment.

Based on this hypothesis, we propose to set up a speech perception experiment in a noisy environment. There are methods of reproducing a sound environment that can simulate an existing environment using headphones. The advantage of these methods is that it is also possible to play a recording of a real environment on headphones while adding signals as if it had been present in the recorded sound scene.

This involves an acoustic measurement campaign in noisy everyday environments (transport, open spaces, canteens, etc.). Next, synthesised speech will need to be generated, taking into account the context of the recordings. It will also be appropriate to vary the parameters of the synthesised speech while maintaining the same semantics. The recordings of everyday life will then be mixed with the synthesised speech signals to assess the latter’s detection. We will use the percentage of times that the synthesised speech is detected as an indicator of quality. These detection percentages will then be compared with the prediction of the artificial intelligence model mentioned above. In this way, we will be able to conclude (1) whether the methods are equivalent or complementary and (2) which parameter(s) of the synthesised speech result in its detection in a noisy environment.



Keywords : synthesised speech, binaural sound synthesis, jury testing



[1] Y.-Y. Chang. Evaluation of tts systems in intelligibility and comprehension tasks. In Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing (ROCLING 2011), pages 64–78, 2011.

[2] J. Chevelu, D. Lolive, S. Le Maguer, and D. Guennec. Se concentrer sur les différences: une méthode d’évaluation subjective efficace pour la comparaison de systèmes de synthèse (focus on differences: a subjective evaluation method to efficiently compare tts systems*). In Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1: JEP, pages 137–145, 2016.

[3] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang. MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion. In Proc. Interspeech 2019, pages 1541–1545, 2019

[4] S. Mittag and S. Möller. Deep learning based assessment of synthetic speech naturalness. arXiv preprint arXiv:2104.11673, 2021

[5] M. Wester, C. Valentini-Botinhao, and G. E. Henter. Are we using enough listeners? no!—an empirically-supported critique of interspeech 2014 tts evaluations. In 16th Annu. Conf. Int. Speech Commun. Assoc., 2015