Thibault Gaudier – Laboratoire d'Informatique de l'Université du Mans

The main objective of the project is to propose, develop and validate methods that allow to:

generate expressive speech from a user-given instruction using either text-to-speech systems or voice conversion;
interact with the system during learning and inference to correct the system’s audio outputs.

First, we will study the visualization and interpretation of latent representations learned by a state-of-the-art neural model (Tacotron + WaveNet) in terms of prosody, speaker, expressiveness and pronunciation.It will be necessary to define user control elements like annotations that can be integrated into the learning corpus using techniques such as acoustic parameter adaptation, embeddings, attention mechanisms, or intermediate model learning.

In parallel, neural architectures compatible with active learning (model reinforcement or domain adaptation) will be proposed, and the most relevant strategies for active learning will be determined. Finally, an important part of the work will consist in evaluating the synthesis produced, in a context of audio books or journalistic content

Active learning, interpretation and control for neural synthesis of expressive speech