Active learning, interpretation and control for neural synthesis of expressive speech

Starting: 01/10/2021
PhD Student: Thibault Gaudier
Advisor(s): Anthony Larcher
Co-advisor(s): Marie Tahon, Yannick Estève (LIA)
Funding: Allocation de recherche du Ministère de la Recherche

The main objective of the project is to propose, develop and validate methods that allow to:

  1. generate expressive speech from a user-given instruction using either text-to-speech systems or voice conversion;
  2. interact with the system during learning and inference to correct the system’s audio outputs.

First, we will study the visualization and interpretation of latent representations learned by a state-of-the-art neural model (Tacotron + WaveNet) in terms of prosody, speaker, expressiveness and pronunciation.It will be necessary to define user control elements like annotations that can be integrated into the learning corpus using techniques such as acoustic parameter adaptation, embeddings, attention mechanisms, or intermediate model learning.

In parallel, neural architectures compatible with active learning (model reinforcement or domain adaptation) will be proposed, and the most relevant strategies for active learning will be determined. Finally, an important part of the work will consist in evaluating the synthesis produced, in a context of audio books or journalistic content