Speech emotion classification using regression model on dimensional representation

Encadrant(e)s: Meysam SHAMSI
Host laboratory: LIUM
place: Le Mans
Contact : Meysam.Shamsi(at)univ-lemans.fr
Application: Send CV + cover letter to Meysam Shamsi before November 18, 2022.


Problem: There are two main approaches for modeling emotion recognition problem. Emotions can be categorized with different labels [1] such as happiness, sadness or anger which makes the recognition a classification task. Another attribute-based approach uses a continuous space [2,3] like arousal (calm versus active), valence (negative versus positive) and dominance (weak versus strong) to identify emotional states, which makes the recognition as a regression task.

While the classification approach is more understandable for human, it is limited to the lexical dictionary. Furthermore, the intra-distance and consequently confusion between classes are not equal. On the other hand, using a continuous value allows having more accurate expression/perception of emotional state. The disagreement of assigning an emotional label with a specific values in attribute-based space is one of the main challenge in the using these two types of information for expression/perception of emotions [4].


Objectives : The objective of this project is to investigate the performance of models that use continuous target variables for prediction of multi-class emotional state from speech signals and study the relation between these representations.


Approach :Deep Neural Network Models can be a pertinent solution for mapping of speech signals to emotional states [5].

The investigation of 3 algorithms’ performances are desired; (1) Classical classification model: the output of the model can be a label out of a set of emotions. (2) Classification via Regression: by changing the output of the neural network, it can be implemented as a regressor model for prediction of emotional attributes. In addition to the possibility of soft labeling [6,7], the output can be converted to class label following the probability likelihood of emotional attributes to emotional categories. (3) Simultaneous classification and regression model: another approach can be a multitasks learning by use of both type of information simultaneously, such as [8] which has claimed the improvement of classification performance on facial expressions’ data.

By having the classification output from these three approaches, the importance and impact of different representations on similar performance metric such as accuracy will be investigated.


In order to study the impact of emotion representation on the recognition of emotional states in speech, a dataset that contains both continuous and categorical annotation is required. The IEMOCAP [9] and MSP-Podcast [10] which have been annotated with categorical emotions and the attribute-based emotions (valence, activation, dominance) can provide the opportunity of investigation. For start, by simplification of multidimensional attributes to one feature, AlloSat [11] data with annotation of satisfaction/frustration can be employed.


Expected results :In machine learning point of view, studying the performance of Classification via Regression models for subjective variables which are ordinal. In affective computing point of view, this model can be an opening for investigation of mapping from categorical labels into continuous space as well. It will shed a new light on the importance of dimensional features in emotion recognition.



Applicant profile : Candidate motivated by artificial intelligence, enrolled in a Master’s degree in Computer Science or related fields



  1. . Ekman, P., (1999). Basic Emotions, pages 301–320. Wiley, New-York.
  2. . Russel, J., (1997). Reading emotions from and into faces: Resurrecting a dimensional-contextual perspective, pages 295–360. Cambridge University Press, U.K.
  3. . Bradley, M. M. and Lang, P. J. (1994). Measuring emotion: The Self-Assessment Manikin and the semantic differential. Journal of Behavior Therapy and Experimental Psychiatry, 25(1):49–59.
  4. . Sethu, V., Provost, E. M., Epps, J., Busso, C., Cummins, N., & Narayanan, S. (2019). The ambiguous world of emotion representation. arXiv preprint arXiv:1909.00360.
  5. . Xu, M., Zhang, F., Cui, X., & Zhang, W. (2021, June). Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6319-6323). IEEE.
  6. . Tarantino, L., Garner, P. N., & Lazaridis, A. (2019). Self-attention for speech emotion recognition. In Interspeech (pp. 2578-2582).
  7. . Lotfian, R., & Busso, C. (2018). Predicting categorical emotions by jointly learning primary and secondary emotions through multitask learning. Interspeech 2018.
  8. . Handrich, Sebastian, et al. “Simultaneous prediction of valence/arousal and emotion categories and its application in an HRC scenario.” Journal of Ambient Intelligence and Humanized Computing 12.1 (2021): 57-73.
  9. . Busso, Carlos, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. (2008). “IEMOCAP: Interactive emotional dyadic motion capture database.” Language resources and evaluation 42, no. 4 335-359.
  10. . Reza Lotfian and Carlos Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471-483, 2019.
  11. . Macary, Manon, Marie Tahon, Yannick Estève, and Anthony Rousseau. “AlloSat: A new call center french corpus for satisfaction and frustration analysis.” In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 1590-1597. 2020.

Other Technical supports :
SER dataset : https://superkogito.github.io/SER-datasets/