{"id":24300,"date":"2019-12-13T16:29:23","date_gmt":"2019-12-13T15:29:23","guid":{"rendered":"https:\/\/lium.univ-lemans.fr\/?p=24300"},"modified":"2020-04-09T13:55:17","modified_gmt":"2020-04-09T11:55:17","slug":"apprentissage-actif-interpretation-et-controle-pour-la-synthese-neuronale-de-parole-expressive","status":"publish","type":"post","link":"https:\/\/lium.univ-lemans.fr\/en\/apprentissage-actif-interpretation-et-controle-pour-la-synthese-neuronale-de-parole-expressive\/","title":{"rendered":"Apprentissage actif, interpr\u00e9tation et contr\u00f4le pour la synth\u00e8se neuronale de parole expressive"},"content":{"rendered":"<div class=\"panel-grid\" id=\"pg-24300-0\" ><div class=\"panel-grid-core\"><div class=\"panel-grid-cell\" id=\"pgc-24300-0-0\" ><div class=\"panel-widget-style\" ><h2 style=\"color: #e5442d;\">Active learning, interpretation and control for neural synthesis of expressive speech<\/h2>\n<p>&nbsp;<br \/>\n<strong>Supervisor:<\/strong> Sylvain Meignier and Anthony Larcher<br \/>\n<strong>Co-supervisor(s):<\/strong> Marie Tahon<br \/>\n<strong>Mails :<\/strong> prenom.nom@univ-lemans.fr<br \/>\n<strong>Application deadline :<\/strong> 22 May 2020<br \/>\n&nbsp;<br \/>\n<strong>Context :<\/strong><\/p>\n<p align=\"justify\">The thesis will take place at the Laboratoire d&#8217;Informatique de l&#8217;Universit\u00e9 du Mans (LIUM) in the LST (Language and Speech Technology) team. The candidate should be motivated to work on written and spoken language. He or she must have automatic learning skills, and show an interest in speech synthesis.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<br \/>\n<strong>Descritpion<\/strong><\/p>\n<p align=\"justify\">La synth\u00e8se de parole \u00e0 partir du texte (TTS) est un enjeu d\u2019avenir pour mieux conna\u00eetre les m\u00e9canismes de production de la parole et du langage, mais \u00e9galement pour am\u00e9liorer les outils grand public li\u00e9s au traitement automatique de la parole. La plupart des approches TTS param\u00e9triques actuelles, bas\u00e9es des mod\u00e8les de Markov (HMM) ou bien sur le paradigme neuronal (NN) permettent d\u2019obtenir des signaux synth\u00e9tiques adapt\u00e9s \u00e0 un style ou un locuteur donn\u00e9 [1,2]. Actuellement, lorsqu\u2019elle est exploit\u00e9e (ce qui est rarement le cas), l\u2019expressivit\u00e9 est g\u00e9n\u00e9ralement obtenue de mani\u00e8re implicite \u00e0 partir des statistiques obtenues sur les donn\u00e9es d\u2019apprentissage. La d\u00e9finition explicite de la prosodie reste un d\u00e9fi encore majeur, mais de r\u00e9cents travaux ont montr\u00e9 la possibilit\u00e9 d\u2019une repr\u00e9sentation latente de celle-ci \u00e0 l\u2019aide de r\u00e9seaux de neurones [3]. L\u2019explicitation de telles repr\u00e9sentations latentes pour la synth\u00e8se expressive permettra d\u2019introduire une possibilit\u00e9 de contr\u00f4le par l\u2019utilisateur.<\/p>\n<p align=\"justify\">Dans les approches classiques, une fois livr\u00e9s \u00e0 l\u2019utilisateur, les mod\u00e8les n\u2019\u00e9voluent plus. En \u00e9tant capable d\u2019apprendre les mod\u00e8les de mani\u00e8re incr\u00e9mentale, c\u2019est-\u00e0-dire au fur et \u00e0 mesure de l\u2019arriv\u00e9e de nouvelles donn\u00e9es, l\u2019apprentissage actif peut soit augmenter les performances des mod\u00e8les en augmentant le corpus d\u2019apprentissage, soit adapter les mod\u00e8les \u00e0 un domaine en particulier. De tels syst\u00e8mes ont \u00e9t\u00e9 exp\u00e9riment\u00e9s dans le domaine de la reconnaissance vocale [4], ou de la d\u00e9tection des \u00e9motions [5], mais \u00e0 l\u2019heure actuelle, aucun travail n\u2019a \u00e9t\u00e9 fait pour la synth\u00e8se de parole. Les travaux men\u00e9s au cours de cette th\u00e8se permettront d\u2019inclure un contr\u00f4le utilisateur sur les sorties de la synth\u00e8se, sous la forme de corrections de sorties automatiques ou l\u2019ajout de nouvelles donn\u00e9es.<\/p>\n<p>&nbsp;<br \/>\n<strong>Objectives<\/strong><\/p>\n<p align=\"justify\">L\u2019objectif principal de la th\u00e8se est de proposer, d\u00e9velopper et valider des m\u00e9thodes de permettant \u00e0 l\u2019utilisateur d\u2019interagir avec un mod\u00e8le neuronal au cours de l\u2019apprentissage. Dans un premier temps, le candidat \u00e9tudiera la visualisation et l\u2019interpr\u00e9tation des repr\u00e9sentations latentes apprises par un mod\u00e8le neuronal \u00e9tat de l\u2019art (Tacotron + WaveNet) en termes de prosodie, locuteur, expressivit\u00e9 et prononciation. Il d\u00e9finira des \u00e9l\u00e9ments de contr\u00f4le utilisateur qui prendront la forme d\u2019annotations et seront ensuite int\u00e9gr\u00e9s dans le corpus d\u2019apprentissage \u00e0 l\u2019aide de techniques tels que l\u2019adaptation de param\u00e8tres acoustiques [6], les embeddings [7], les m\u00e9canismes d\u2019attention [8], ou bien l\u2019apprentissage de mod\u00e8les interm\u00e9diaires [9]. Parall\u00e8lement, le candidat proposera des architectures neuronales compatibles avec l\u2019apprentissage actif (renforcement des mod\u00e8les ou adaptation au domaine), et d\u00e9terminera les strat\u00e9gies les plus pertinentes pour l\u2019apprentissage actif. Enfin, une part importante des travaux consistera \u00e0 \u00e9valuer la synth\u00e8se produite dans un contexte de livres audio [10,11].<\/p>\n<p>&nbsp;<br \/>\n<strong>Bibliographie<\/strong><\/p>\n<p align=\"justify\">[1] Z. Wu, P. Swietojanski, C. Veaux, S. Renals, S. King (2015). A study of speaker adaptation for DNN-based speech synthesis. In proc. Interspeech, pp. 879\u2013883.<br \/>\n[2] W. Ping, K. Peng, A. Gibiansky, S. O. Arik et al. (2018). Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning, arXiv:1710.07654.<br \/>\n[3] RJ Skerry-Ryan et al. (2018). Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018.<br \/>\n[4] Syed, A. R., Rosenberg, A., Kislal, E. (2016). Supervised and unsupervised active learning for automatic speech recognition of low-resource languages. In proc. ICASSP. Shanghai, China.<br \/>\n[5] Zhang, Z., Deng, J., Marchi, E., Schuller, B. (2013) Active Learning by Label Uncertainty for Acoustic Emotion Recognition. In proc. Interspeech.<br \/>\n[6] Kanagawa, H., Nose, T., Kobayashi, T. (2013). Speaker-independent style conversion for HMMbased expres- sive speech synthesis. In proc. ICASSP. Vancouver, Canada, pp. 7864\u20137868.<br \/>\n[7] Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., Miller, J. (2018). Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. In: International Conference on Learning Representations (ICLR).<br \/>\n[8] Wan, M., Degottex, G., Gales, M. J. (2017). Integrated speaker-adaptive speech synthesis. In: ASRU.<br \/>\n[9] Tahon, M., Lecorv\u00e9, G., Lolive, D. (2018). Can we Generate Emotional Pronunciations for Expressive Speech Synthesis? IEEE Transactions on Affective Computing.<br \/>\n[10] A. Sini, D. Lolive, G. Vidal, M. Tahon and E. Delais-Roussarie (2018). SynPaFlex-Corpus: An Expressive French Audiobooks Corpus Dedicated to Expressive Speech Synthesis. Proc. of LREC.<br \/>\n[10]S. King, J. Crumlish, A. Martin and L. Wihlborg (2018). The Blizzard Challenge 2018, in Proc. Blizzard Workshop, Hyderabad, India.\n<\/p<\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Active learning, interpretation and control for neural synthesis of expressive speech &nbsp; Supervisor: Sylvain Meignier and Anthony Larcher Co-supervisor(s): Marie Tahon Mails : prenom.nom@univ-lemans.fr Application deadline : 22 May 2020 &nbsp; Context : The thesis will take place at the Laboratoire d&#8217;Informatique de l&#8217;Universit\u00e9 du Mans (LIUM) in the LST (Language and Speech Technology) team. [&hellip;]<\/p>\n<p class=\"more-link style2\"><a href=\"https:\/\/lium.univ-lemans.fr\/en\/apprentissage-actif-interpretation-et-controle-pour-la-synthese-neuronale-de-parole-expressive\/\"  class=\"themebutton\"><span class=\"more-text\">READ MORE<\/span><span class=\"more-icon\"><i class=\"fa fa-angle-right fa-lg\"><\/i><\/span><\/a><\/p>\n","protected":false},"author":14,"featured_media":13249,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[78,87],"tags":[49],"acf":[],"_links":{"self":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts\/24300"}],"collection":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/comments?post=24300"}],"version-history":[{"count":0,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts\/24300\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/media\/13249"}],"wp:attachment":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/media?parent=24300"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/categories?post=24300"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/tags?post=24300"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}