Continuous learning for objective evaluation of synthetic speech

Encadrant(e)s: Meysam SHAMSI
Host laboratory: LIUM
place: Le Mans
Contact : Meysam.Shamsi(at)
Application: Send CV + cover letter to Meysam Shamsi before November 18, 2022.

Context and objectives:

The main objective of a text-to-speech (TTS) system and a speech conversion system is to synthesize or generate a high-quality speech signal. The quality of synthetic speech is usually evaluated subjectively by human listeners. This listening test is an effort to evaluate the degree of human-likeness (versus machine-likeness). One of the most popular evaluation methods is the Mean Opinion Score (MOS), which assigns a number, often between 1 and 5, to the quality of the speech signal. This subjective evaluation by a human is expensive and time consuming, but it is also very subjective and can yield different results depending on the number of evaluators. Recently, thanks to advances in neural networks, researchers have become interested in evaluating the synthetic speech signal using an automatic measure.

The VoiceMOS Challenge [1] was one of the steps towards automating speech quality evaluation. They collected human evaluation of synthetic signals from previous Blizzard Challenge [2] and Voice Conversion Challenge [3] and provided a basic model [4,5] for automatic synthetic speech quality evaluation.

In addition to the study of out-of-domain evaluation [1] which aims to adapt a model to other domains such as synthetic speech in a new language, the evolution of TTS systems is changing the problem of quality evaluation. The improvement in the quality of synthetic speech over the last decade is considerable [6]. This means that the problem of synthetic signal quality evaluation today is different from that of the past. For example, while in the past the priority of speech synthesis was intelligibility, today the focus is more on the expressiveness of synthetic speech. The goal of this course is to investigate the appropriateness of lifelong learning [7] or continuous learning for the automatic evaluation of synthetic speech. In a continuous learning approach, the model should be able to adapt to new data by taking samples in chronological order. The internship will focus on developing a model capable of training using the date of the systems in the provided dataset.

The main application of this system is to reduce the cost of human listeners for the evaluation of the quality of synthetic systems and to have an evaluation metric that is adaptable in time. Moreover, the result of this work can be ultimately used to improve the quality of the TTS or speech conversion system


Applicant profile : Candidate motivated by artificial intelligence, enrolled in a Master’s degree in Computer Science or related fields

[1]. Huang, W.C., Cooper, E., Tsao, Y., Wang, H.-M., Toda, T., Yamagishi, J. “The VoiceMOS Challenge 2022.” Proc. Interspeech 2022, 2022, pp. 4536-4540
[2]. Z. Wu, Z. Xie, and S. King, “The Blizzard Challenge 2019,” 2019.
[3]. Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kinnunen, Z. Ling, and T. Toda, “Voice Conversion Challenge 2020 — intra-lingual semi-parallel and cross-lingual voice conversion —,” in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020, pp. 80–98.
[4]. Cooper, E., Huang, W. C., Toda, T., & Yamagishi, J., Generalization ability of MOS prediction networks. In ICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022, pp. 8442-8446.
[6]. Cooper, Erica, and Junichi Yamagishi. “How do voices from past speech synthesis challenges compare today?.” arXiv preprint arXiv:2105.02373 (2021).
[7]. Chen, Z. and Liu, B, Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3), 2018, pp.1-207.