Extraction of end-to-end semantic information from audio signal

Starting: 01/10/2020
PhD Student: Martin Lebourdais
Advisor(s): Sylvain Meignier
Co-advisor(s): Antoine Laurent, Marie Tahon
Funding: ANR GEM

The GEM project aims to describe the differences in representation and treatment between women and men in the media, based on the automatic analysis of large volumes of French-language data contained in the INA and Deezer collections: TV, radio, newspapers and music collections. The ambition of this project is to carry out the largest study on the place of men and women in the media ever carried out, based on the analysis of several million documents sampled over a period of more than 80 years.

This massive quantitative approach aims to create new knowledge in the social sciences and humanities, to appreciate the evolution of the differences in the representation of women and men over time and between different types of material, to objectify part of the citizen debates on gender equality in the media. This automatic description of the representation of men and women responds to societal but also industrial issues: estimation of the impact of actions aimed at a fairer representation of the sexes in broadcast programs, exploration and enhancement of vast digital collections, improvement of the performance of automatic systems and study of borderline cases. The extraction of indicators of gender differences in treatment requires the removal of technological and methodological barriers, contributing to advances in the state of the art in ICST and SHS.


The work carried out in the framework of this thesis focuses on the extraction of semantic information from the audio signal (thematic segmentation, interaction graphs, role of the speaker, …). Initially, the PhD student will develop a speaker-based segmentation tool able to automatically identify overlapping speech zones from the audio signal. In a second step, from annotated INA data (incivilities, covid-19), an automatic characterization of speech interruptions will be carried out on a large scale in collaboration with SHS researchers. This characterization will involve the simultaneous use of acoustic, linguistic and possibly para-linguistic representations.


1) A. Caubrière, N. Tomashenko, A. Laurent, E. Morin, N. Camelin, Y. Estève “Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability”. 2019 Interspeech.
2) A. Caubriere, Y. Esteve, N. Camelin, E. Simonnet, A. Laurent, E. Morin. “End- To-End Named Entity and Semantic Concept Extraction from Speech.” 2018 IEEE Spoken Language Technology Workshop (SLT) 2018
3) A. Laurent, N. Camelin, and C. Raymond. (2014). Boosting bonsai trees for efficient features combination: application to speaker role identification, Interspeech
4) D. Doukhan (2019), À la radio et à la télé, les femmes parlent deux fois moins que les hommes. La revue des médias
5) D. Doukhan, J. Carrive, F. Vallet, A. Larcher and S. Meignier, “An Open-Source Speaker Gender Detection Framework for Monitoring Gender Equality,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 5214-5218.
6) L. Bullock, H. Bredin and L. P. Garcia-Perera, “Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection,” ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7114-7118.
7) H. Bredin et al., “Pyannote.Audio: Neural Building Blocks for Speaker Diarization,” ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7124-7128.