PhD defence, Théo Mariotte

Date : 11/01/2024
Time : 14h00
Location : Le Mans Université; IC2 buiding Auditorium

Title: Automatic Speech Processing in Meetings using Microphone Arrays

Jury members :

  • Jan “Honza” Černocký, Professor, Brno University of Technology, Czechia Reviewer
  • Emmanuel Vincent, Research Director, Inria Nancy – Grand Est, France Reviewer
  • Julie Mauclair Assistant professor, IRIT, Toulouse, France Examiner
  • Gaël Richard Professeur, Télécom Paris, FranceExaminer
  • Jean-Hugh Thomas Professor, Le Mans Université LIUM, Director of thesis
  • Jean-Hugh Thomas Professor, Le Mans Université LAUM, Director of thesis
  • Anthony Larcher, Professor, Le Mans Université LIUM, Supervisor
  • Silvio Montrésor, Assistant Professor, Le Mans Université LAUM, Supervisor



This thesis work focuses on automatic speech processing, and more specifically on speaker diarization. This task requires the signal to be segmented to identify events such as voice activity, overlapped speech, or speaker changes. This work tackles the scenario where the signal is recorded by a device located in the center of a group of speakers, as in meetings. These conditions lead to a degradation in signal quality due to the distance between the speakers (distant speech).

To mitigate this degradation, one approach is to record the signal using a microphone array. The resulting multichannel signal provides information on the spatial distribution of the acoustic field. Two lines of research are being explored for speech segmentation using microphone arrays.

The first introduces a method combining acoustic features with spatial features. We propose a new set of features based on the circular harmonics expansion. This approach improves segmentation performance under distant speech conditions while reducing the number of model parameters and improving robustness in case of change in the array geometry.

The second proposes several approaches that combine channels using self-attention. Different models, inspired by an existing architecture, are developed. Combining channels also improves segmentation under distant speech conditions. Two of these approaches make feature extraction more interpretable. The proposed distant speech segmentation systems also improve speaker diarization.

Channel combination shows poor robustness to changes in the array geometry during inference. To avoid this behavior, a learning procedure is proposed, which improves the robustness in case of array mismatch.

Finally, we identified a gap in the public datasets available for distant multichannel automatic speech processing. An acquisition protocol is introduced to build a new dataset, integrating speaker position annotation in addition to speaker diarization.

Thus, this work aims to improve the quality of multichannel distant speech segmentation. The proposed methods exploit the spatial information provided by microphone arrays while improving the robustness in case of array mismatch.



distant speech, multichannel audio, automatic speech segmentation, speaker diarization, deep learning