Neural representations disentangled by prosody and application to speech synthesis

Level: Master 1
Supervisors: Marie Tahon & Théo Mariotte (LIUM)
Host Laboratory: Laboratoire d’Informatique de l’Université du Mans (LIUM)
Location: Le Mans
Beginning of internship: From April 2026
Contact: Marie Tahon or Théo Mariotte (firstname.name@univ-lemans.fr)
Application: Send a CV, a covering letter relevant to the proposed subject to Marie Tahon and Theo Mariotte before February 28, 2026

Context and objectives

Text-to-speech synthesis involves converting a sequence of characters provided by the user into an intelligible audio signal corresponding to a voice. Most current synthesis systems are based on neural networks, such as KNN-TTS [1]. These models are capable of encoding linguistic and prosodic information as well as the speaker’s timbre.

The KNN-TTS approach has the advantage of decoupling linguistic characteristics (generated from text) on the one hand and the speaker’s timbre (generated from an audio sample corresponding to a voice) on the other, using a control parameter 𝜆. Prosody is defined as a set of acoustic properties of vocal expressions [2]. The acoustic properties generally used are intonation (or fundamental frequency curve – F0), sound intensity, and rhythm. They are found both in characteristics related to the speaker (speech rate, voice pitch, etc.) and those related to the text itself (pause, syntax tracking, emphasis, etc.).

The objective of the internship will therefore be to learn linguistic representations and speaker disentanglement with respect to prosody. More specifically, the intern will train autoencoders that enable certain parts of the space to be disentangled using the predefined prototype method [3]. Recent work has shown that parsimony allows for better disentanglement of features, so the autoencoder will be a simple SAE (Sparse Autoencoder), trained with a top-k loss [4, 5] and prototype loss [3].

During the internship, the tasks to be performed will be as follows :

  1. Set up a baseline synthesis system. We will use KNN-TTS trained on a French corpus (SIWIS [6] or Blizzard [7]), and we will objectively evaluate the signals generated using TTS4ALL [8] for several speakers.
  2. Train a sparse autoencoder including prototype loss based on linguistic outputs and speakers. These prototypes will initially be based on F0 and energy. This SAE will be evaluated on its ability to accurately disentangle prosody.
  3. Integrate the SAE into the synthesis system and evaluate the degradation of the quality of the output audio signals.
  4. Also evaluate the possibility of intervention, i.e., manual modification of F0 or energy and its impact on the generated signal.

 

Laboratories and supervisory team

The internship will be hosted at LIUM (Laboratoire d’Informatique de l’Université du Mans), where the intern will have full access to the laboratory’s computational infrastructure.

Candidate profil

Master’s degree in Computer Science, the candidate must demonstrate a keen interest in natural language processing.

Références

[1]. K. E. Hajal, A. Kulkarni, E. Hermann, and M. Magimai Doss, “kNN retrieval for simple and effective zero-shot multi-speaker text-to-speech,” in Proc. NAACL, 2025, pp. 778–786.
[2]. Larrouy-Maestri, P., Poeppel, D., & Pell, M. D. (2024). The Sound of Emotional Prosody: Nearly 3 Decades of Research and Future Directions. Perspectives on Psychological Science, 20(4), 623-638. https://doi.org/10.1177/17456916231217722.
[3]. Almudévar, A., Mariotte, T., Ortega, A., Tahon, M., Vicente, L., Miguel, A., Lleida, E. (2024) Predefined Prototypes for Intra-Class Separation and Disentanglement. Proc. Interspeech 2024, 3809-3813, doi: 10.21437/Interspeech.2024-825
[4]. Félix Saget, Nicolas Dugué, Marie Tahon, Anthony Larcher. Functionally-grounded evaluation of dimensional interpretability in sparse speaker representations. 2025. ⟨hal-05302071⟩
[5]. Mariotte, T., Lebourdais, M., Almudévar, A., Tahon, M., Ortega, A., & Dugué, N. (2025). Sparse Autoencoders Make Audio Foundation Models more Explainable. Proceedings of ICASSP, 2026. https://arxiv.org/abs/2509.24793
[6]. Jean-Philippe Goldman, Pierre-Edouard Honnet, Rob Clark, Philip N. Garner, Maria Ivanova, Alexandros Lazaridis, Hui Liang, Tiago Macedo, Beat Pfister, Manuel Sam Ribeiro, Eric Wehrli, and Junichi Yamagishi. The SIWIS database: a multilingual speech database with acted emphasis. In Proceedings of Interspeech, pages 1532–1535, San Francisco, CA, USA, September 2016.
[7]. Perrotin, O., Stephenson, B., Gerber, S., Bailly, G. (2023) The Blizzard Challenge 2023. Proc. 18th Blizzard Challenge Workshop, 1-27, doi: 10.21437/Blizzard.2023-1
[8]. https://git-lium.univ-lemans.fr/jsalt2025/wp1/tts4all_eval