HDR Defence, Marie Tahon
Date : 23/01/2023
Heure : 14h00
Lieu : IC2, auditorium
Title : Expressive speech processing: back to interpretable systems ?
jury members:
- Corinne FREDOUILLE, Professeure des Universités – Université d’Avignon, Reviewer
- Damien LOLIVE, Professeur des Universités – ENSSAT/Université de Rennes 1, Reviewer
- Emmanuel VINCENT, Directeur de Recherche – INRIA, LORIA, Reviewer
- Yannick ESTÈVE, Professeur des Universités – Université d’Avignon
- Anthony LARCHER, Professeur des Universités – Université du Mans
- Sylvain MEIGNIER, Professeur des Universités – Université du Mans
Summary of the work:
Speech is a fundamental means of communication that is part of an interaction between the speaker and his/her listeners. In addition to semantic content, the speech signal embeds personal speaker characteristics such as age, gender or emotional state. The study of expressive speech is a multidisciplinary field of research ranging from the acoustic production of speech to the cognitive mechanisms used by the speaker during the interaction to express his thoughts.
Since the beginning of my research in 2009, I have been trying to precise what is called expressive speech by going back and forth between statistical or neural machine learning methods considered as black boxes that are not very interpretable but performant, and the analysis of the expressive phenomenon using acoustic and linguistic elements. My goal is to study how and in what way machine learning systems can provide knowledge on the different acoustic, cognitive and interaction mechanisms that induce the production of expressive speech. This work involves combining machine learning methods and a fine analysis of expressivity in order to determine the links between data, expert features and latent representations from the models.
My research work at LIMSI, IRISA and LIUM covers the analysis of expressive speech on several levels: audio signal segmentation (speech, silence, overlapping speech, speaker, etc.), high-level characterization (interruption, hesitations, emotion, etc.), and expressive speech signal generation. Studying both facets (analysis and synthesis) allows to finely define the expressive phenomenon by acoustic, prosodic, phonetic and linguistic characteristics, and also to validate these characteristics by signal synthesis and their perceptual evaluation. This double point of view is, in my opinion, very important to understand the oral behaviors of human beings in all their diversity and complexity.
Keywords:
Machine learning, Deep learning, Audio signal processing, Audio descriptors, Interpretability