Seminar from Martin Lebourdais and Théo Mariotte, PhD students at LIUM
Joint effort on data loading process and its application to VAD, speaker turn detection and overlap detection
Diarization is the task of finding “Who spoke when?” in an audio stream. It relies on two subtasks defined as segmentation and clustering. The former most often includes voice activity detection (VAD), overlapped speech detection (OSD) and speaker change detection (SCD). These three tasks are taking a sequence as input (audio signal) and outputting a sequence. The classification of the frames in the output sequence allows the segmentation of the audio signal, i.e. finding borders in the speech signal between different parts of interest.
One difficulty when training such a system comes from the imbalance of the classes to be detected, especially in the case of overlap and speaker change detection.
This work introduces a data loading process (DataLoader) to format and distribute the speech segments used for the training of these three tasks. In addition, to overcome the imbalance of the data, we propose a segment selection process (DataSampler) to precisely choose the proportion of examples of each class in each training mini-batch (speech, overlap, speaker change, non-speech). The data loading process also enable the use of multi-microphone recordings, which are investigated in the context of speech segmentation. Experimental evaluation is being carried out using the multi-microphone speech corpus AMI and the diarization challenge DiHard corpus.