Séminar from Manon Pinel, PhD student Allo-media/Lium
Speaker: Manon Pinel
Using self supervised pre-trained acoustic and linguistic representation for Speech Emotion Recognition
In Speech Emotion Recognition (SER), the main concept is obviously emotion. These emotions can be defined by multiple theories as it is perceived differently from one person to another. This perception is influenced by the person state of mind, the social environment, their culture, etc. By definition, emotion is highly subjective which explains the challenging aspect of the task. Therefore, to construct relevant datasets, we need multiple annotators to annotate the same data, resulting in a huge cost of the creation of large emotional speech datasets. Moreover, the multiple way of describing emotion (discrete with various labels or continuous with various axis) and the non-standardization of the annotation scheme make it really hard to use multiple dataset altogether to perform a SER task.
In order to create a relevant SER module from this restraint, we studied the best speech representation, to better help the module understanding the aim of the task. To do so, we broke down the problem into two aspects : acoustic and linguistic extracted from speech. As both aspects are relevant and gives different information on the emotional state of a speaker.
For this two aspects, we studied the best input features, in order to keep the emotional information while helping the module to better understand what we aim at retrieving. To do so, we question the use of pre-training for feature extraction as it is an increasingly studied approach to get better continuous representations of audio and text content.
We found out that the use of wav2vec and camemBERT as self-supervised learned models to represent our data in order to perform continuous SER on AlloSat, our large French emotional database describing the satisfaction dimension.