Interpretability of pre-trained models for automatic speech processing

Supervisor: Nicolas Dugué, , Anthony Larcher (direction) and Marie Tahon (co-direction)
Host team: LIUM – LST
Localization: Le Mans
Contact: Nicolas.Dugue(at), Marie.Tahon(at)


Context : With the emergence of neural networks, work in the field of machine learning has shifted away from traditional methods, commonly known as feature engineering [1], which required considerable expertise in data. In addition to their performance, neural networks make it possible to formulate the problem to be solved mathematically and apply generic optimisation algorithms to obtain a solution. The solution thus emerges naturally from the data and the criterion to be optimised: human expertise, previously important in terms of the data, is now essentially focused on formulating the problem. Neural approaches make it possible to process different types of data with generic processing chains, the first building block of which is often the learning of vector representations for the data.

Objectives: In this context, the thesis is part of a field of research that seeks to reconnect with artificial intelligence as it existed with feature engineering, that of interpretability. The aim of this field of research is to understand and explain neural models and their performance, by reconnecting the results to the data and their attributes, which can be interpreted by humans. In particular, the thesis work concerns the learning of vector representations, at the root of all processing chains. In the context of representation learning for textual data, research has revealed some interesting avenues: learning in larger spaces [2,3], forcing parsimony [4], analysing internal representations [5,6]. The text object is easily interpreted by humans, and is naturally discrete. The very nature of the audio signal makes interpretation more difficult.

his is because it is a large continuous signal in which different types of information are superimposed at very different timescales: low-level acoustic descriptors (frame), pronunciation (phone), linguistics (word) and expressivity (sentence). Very recent work has led to the development of signal visualisations that can be used to interpret certain aspects of audio signals, most of which use local models such as SHAP [7,8]. Visualisations are interesting, but they require a strong understanding of the tool and are difficult to exploit on a large scale. In this thesis, we want to explore the vector representations used for speech processing (WavLM, X-vector, etc.) to be able to interpret them with known expert descriptors without having to re-train resource-consuming models such as WavLM. The use of speech synthesis based on these representations will make it possible to assess the extent to which the interpretations obtained automatically are actually interpretable by humans.

In order to advance in the construction of interpretable systems exploiting audio signal, strong of our experience in the textual framework, we wish to initiate work on audio with this PhD thesis topic.

The subject would be structured along several axes:

  1. explore the plunging spaces extracted over different time windows (from frame to sentence) in order to uncover interpretable latent dimensions (prosodic, phonetic, speaker and linguistic attributes), taking inspiration from [5], thus making it possible to reconnect the abstract spaces to humanly comprehensible features of the speech signal, or even to build an approach (for example based on [9]) making it possible to obtain bijective mapping between the vector representation and the expert descriptors; the updating of interpretable latent dimensions will involve the creation of natural or synthetic speech datasets adapted to the required attributes;

  2. to assess the robustness of these approaches to learning, and in particular their ability to use the descriptors used in the first area on a regular basis or not. To do this, quantitative and perceptive evaluation methodologies will need to be put in place;

  3. describe new learning models for frame plunging that can be interpreted by construction, forcing a correspondence with the attributes detected during the first axis;

  4. apply this work to data security, anonymisation and/or bias removal: interpretable representations are more easily manipulated to hide undesirable elements in the vector representation based on expert descriptors.

Profile required :

The candidate must be motivated to work on written and spoken language, and show an interest in speech synthesis. He or she should have a Master’s degree in Computer Science, and experience in machine learning would be appreciated.


Send CV and cover letter to Nicolas Dugué and Marie Tahon before October 15, 6 pm (Nicolas.Dugue(at), Marie.Tahon(at)


  • [1] Cardon, Dominique, et al. “Neurons spike back.” Réseaux 211.5 (2018): 173-220.
  • [2] Subramanian, Anant, et al. “Spine: Sparse interpretable neural embeddings.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.
  • [3] Prouteau, Thibault, et al. “SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!.” International Symposium on Intelligent Data Analysis. 2021.
  • [4] Brian Murphy, Partha Talukdar, and Tom Mitchell. Learning Effective and Interpretable Semantic Models using Non-Negative Sparse Embedding. page 18.
  • [5] Bolukbasi, Tolga, et al. “Man is to computer programmer as woman is to homemaker? debiasing word embeddings.” Advances in neural information processing systems 29 (2016).
  • [6] Clark, Kevin, et al. “What does bert look at? an analysis of bert’s attention.” arXiv preprint arXiv:1906.04341 (2019)
  • [7] W. Ge, J. Patino, M. Todisco and N. Evans, “Explaining Deep Learning Models for Spoofing and Deepfake Detection with Shapley Additive Explanations,” ICASSP 2022, pp. 6387-6391 (2022).
  • [8] Sivasankaran, E. Vincent and D. Fohr, “Explaining deep learning models for speech enhancement,” in Proc. Interspeech, pp. 696–700 (2021).
  • [9] Paul-Gauthier Noé, Andreas Nautsch, Driss Matrouf, Pierre-Michel Bousquet, Jean-François Bonastre. A bridge between features and evidence for binary attribute-driven perfect privacy. ICASSP 2022, May 2022, Singapore, Singapore.