{"id":26758,"date":"2026-02-11T17:36:00","date_gmt":"2026-02-11T16:36:00","guid":{"rendered":"https:\/\/lium.univ-lemans.fr\/?p=26758"},"modified":"2026-02-11T17:41:59","modified_gmt":"2026-02-11T16:41:59","slug":"offre-stage-m1-representation-neuronales-demelees-par-la-prosodie-et-application-a-la-synthese-de-parole","status":"publish","type":"post","link":"https:\/\/lium.univ-lemans.fr\/en\/offre-stage-m1-representation-neuronales-demelees-par-la-prosodie-et-application-a-la-synthese-de-parole\/","title":{"rendered":"Offre Stage M1 :  Repr\u00e9sentation neuronales d\u00e9m\u00eal\u00e9es par la prosodie et application \u00e0 la synth\u00e8se de parole"},"content":{"rendered":"<div class=\"panel-grid\" id=\"pg-26758-0\" ><div class=\"panel-grid-core\"><div class=\"panel-grid-cell\" id=\"pgc-26758-0-0\" ><div class=\"panel-widget-style\" ><h2 style=\"color: #e5442d;\">Neural representations disentangled by prosody and application to speech synthesis<\/h2>\n<p><strong>Level<\/strong>: Master 1<br \/>\n<strong>Supervisors: <\/strong> Marie Tahon &#038; Th\u00e9o Mariotte (LIUM)<br \/>\n<strong>Host Laboratory:<\/strong> <a href=\"http:\/\/lium.univ-lemans.fr\/en\/\">Laboratoire d\u2019Informatique de l\u2019Universite\u0301 du Mans<\/a> (LIUM)<br \/>\n<strong>Location:<\/strong> Le Mans<br \/>\n<strong>Beginning of internship:<\/strong> From April 2026<br \/>\n<strong>Contact: <\/strong>Marie Tahon or Th\u00e9o Mariotte  (firstname.name@univ-lemans.fr)<br \/>\n<strong>Application: <\/strong> Send a CV, a covering letter relevant to the proposed subject to Marie Tahon and Theo Mariotte <strong>before February 28, 2026<\/strong><\/p>\n<h3><strong>Context and objectives<\/strong><\/h3>\n<p>Text-to-speech synthesis involves converting a sequence of characters provided by the user into an intelligible audio signal corresponding to a voice. Most current synthesis systems are based on neural networks, such as KNN-TTS [1]. These models are capable of encoding linguistic and prosodic information as well as the speaker&#8217;s timbre.<\/p>\n<p>The KNN-TTS approach has the advantage of decoupling linguistic characteristics (generated from text) on the one hand and the speaker&#8217;s timbre (generated from an audio sample corresponding to a voice) on the other, using a control parameter \ud835\udf06. Prosody is defined as a set of acoustic properties of vocal expressions [2]. The acoustic properties generally used are intonation (or fundamental frequency curve \u2013 F0), sound intensity, and rhythm. They are found both in characteristics related to the speaker (speech rate, voice pitch, etc.) and those related to the text itself (pause, syntax tracking, emphasis, etc.). <\/p>\n<p>The objective of the internship will therefore be to learn linguistic representations and speaker disentanglement with respect to prosody. More specifically, the intern will train autoencoders that enable certain parts of the space to be disentangled using the predefined prototype method [3]. Recent work has shown that parsimony allows for better disentanglement of features, so the autoencoder will be a simple SAE (Sparse Autoencoder), trained with a top-k loss [4, 5] and prototype loss [3].<\/p>\n<p>During the internship, the tasks to be performed will be as follows :<\/p>\n<ol>\n<li>    Set up a baseline synthesis system. We will use KNN-TTS trained on a French corpus (SIWIS [6] or Blizzard [7]), and we will objectively evaluate the signals generated using TTS4ALL [8] for several speakers.<\/li>\n<li>    Train a sparse autoencoder including prototype loss based on linguistic outputs and speakers. These prototypes will initially be based on F0 and energy. This SAE will be evaluated on its ability to accurately disentangle prosody.<\/li>\n<li>    Integrate the SAE into the synthesis system and evaluate the degradation of the quality of the output audio signals.<\/li>\n<li>    Also evaluate the possibility of intervention, i.e., manual modification of F0 or energy and its impact on the generated signal.<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<h3><strong>Laboratories and supervisory team<\/strong><\/h3>\n<p>The internship will be hosted at LIUM (Laboratoire d\u2019Informatique de l\u2019Universit\u00e9 du Mans), where the intern will have full access to the laboratory\u2019s computational infrastructure.<\/p>\n<h3><strong>Candidate profil<\/strong><\/h3>\n<p>Master&#8217;s degree in Computer Science, the candidate must demonstrate a keen interest in natural language processing. <\/p>\n<h3><strong>R\u00e9f\u00e9rences<\/strong><\/h3>\n<p>[1]. K. E. Hajal, A. Kulkarni, E. Hermann, and M. Magimai Doss, \u201ckNN retrieval for simple and effective zero-shot multi-speaker text-to-speech,\u201d in Proc. NAACL, 2025, pp. 778\u2013786.<br \/>\n[2]. Larrouy-Maestri, P., Poeppel, D., &#038; Pell, M. D. (2024). The Sound of Emotional Prosody: Nearly 3 Decades of Research and Future Directions. Perspectives on Psychological Science, 20(4), 623-638. https:\/\/doi.org\/10.1177\/17456916231217722.<br \/>\n[3]. Almud\u00e9var, A., Mariotte, T., Ortega, A., Tahon, M., Vicente, L., Miguel, A., Lleida, E. (2024) Predefined Prototypes for Intra-Class Separation and Disentanglement. Proc. Interspeech 2024, 3809-3813, doi: 10.21437\/Interspeech.2024-825<br \/>\n[4]. F\u00e9lix Saget, Nicolas Dugu\u00e9, Marie Tahon, Anthony Larcher. Functionally-grounded evaluation of dimensional interpretability in sparse speaker representations. 2025. \u27e8hal-05302071\u27e9<br \/>\n[5]. Mariotte, T., Lebourdais, M., Almud\u00e9var, A., Tahon, M., Ortega, A., &#038; Dugu\u00e9, N. (2025). Sparse Autoencoders Make Audio Foundation Models more Explainable. Proceedings of ICASSP, 2026. https:\/\/arxiv.org\/abs\/2509.24793<br \/>\n[6]. Jean-Philippe Goldman, Pierre-Edouard Honnet, Rob Clark, Philip N. Garner, Maria Ivanova, Alexandros Lazaridis, Hui Liang, Tiago Macedo, Beat Pfister, Manuel Sam Ribeiro, Eric Wehrli, and Junichi Yamagishi. The SIWIS database: a multilingual speech database with acted emphasis. In Proceedings of Interspeech, pages 1532\u20131535, San Francisco, CA, USA, September 2016.<br \/>\n[7]. Perrotin, O., Stephenson, B., Gerber, S., Bailly, G. (2023) The Blizzard Challenge 2023. Proc. 18th Blizzard Challenge Workshop, 1-27, doi: 10.21437\/Blizzard.2023-1<br \/>\n[8]. https:\/\/git-lium.univ-lemans.fr\/jsalt2025\/wp1\/tts4all_eval<\/p><\/div><\/div><\/div><\/div><div class=\"panel-grid\" id=\"pg-26758-1\" ><div class=\"panel-grid-core\"><div class=\"panel-grid-cell\" id=\"pgc-26758-1-0\" >&nbsp;<\/div><div class=\"panel-grid-cell\" id=\"pgc-26758-1-1\" ><div class=\"panel-widget-style\" ><p><img src=\"https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2022\/11\/Logo-LIUM_Couleurs_WEB.png\" alt=\"\" \/ ><\/p><\/div><\/div><div class=\"panel-grid-cell\" id=\"pgc-26758-1-2\" >&nbsp;<\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Neural representations disentangled by prosody and application to speech synthesis Level: Master 1 Supervisors: Marie Tahon &#038; Th\u00e9o Mariotte (LIUM) Host Laboratory: Laboratoire d\u2019Informatique de l\u2019Universite\u0301 du Mans (LIUM) Location: Le Mans Beginning of internship: From April 2026 Contact: Marie Tahon or Th\u00e9o Mariotte (firstname.name@univ-lemans.fr) Application: Send a CV, a covering letter relevant to the [&hellip;]<\/p>\n<p class=\"more-link style2\"><a href=\"https:\/\/lium.univ-lemans.fr\/en\/offre-stage-m1-representation-neuronales-demelees-par-la-prosodie-et-application-a-la-synthese-de-parole\/\"  class=\"themebutton\"><span class=\"more-text\">READ MORE<\/span><span class=\"more-icon\"><i class=\"fa fa-angle-right fa-lg\"><\/i><\/span><\/a><\/p>\n","protected":false},"author":14,"featured_media":17310,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[78,83],"tags":[49],"acf":[],"_links":{"self":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts\/26758"}],"collection":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/comments?post=26758"}],"version-history":[{"count":6,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts\/26758\/revisions"}],"predecessor-version":[{"id":26765,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts\/26758\/revisions\/26765"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/media\/17310"}],"wp:attachment":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/media?parent=26758"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/categories?post=26758"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/tags?post=26758"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}