Interpretable transformers

Supervisor: Nicolas Dugué, Maître de conférences
Host team : LIUM – LST
Localization : Le Mans
Contact : Nicolas.Dugue(at)


Context : The ANR DIGING project.
Recent approaches to learning lexical embeddings have focused on results, often to the detriment of interpretability and algorithmic complexity. However, interpretability and interaction with the results produced by these systems is a prerequisite for the adoption of automatic systems by users. This is particularly the case when such technologies are used in sensitive sectors such as the legal and medical fields. But it is also the case in applications linked to digital humanities, where it is necessary to produce representations that can be understood by end users. With DIGING, we have proposed a new high-performance and computationally efficient approach for constructing interpretable lexical dips [PDCM22] based on the theory of complex networks [PCD+21]: SINr, for Sparse Interpretable Node Representations. This approach makes it possible to learn extremely parsimonious plots [MTM, SPJ+18], maintaining good performance down to just 10 activations per vector [GPD23].


Objectifs :

Based on the representations learned with SINr, with the interpretable plunging approach as the first building block, the candidate recruited will be in charge of end-to-end interpretable classification neural architectures. The aim is to remain in an interpretable space throughout the classification. In this way, deep mechanisms can be implemented based on the hierarchical structure of the dives produced by SINr, and inspired for example by the work of Victoria Bourgeais [BZBHH21]. Attention mechanisms of the dot product type as in Bahdanau [BCB14], using an attention vector dedicated to the task, which, if it is in the same space as the input, will also be interpretable.

But other approaches are also possible for exploiting interpretability within more complex models such as transformers. Clark et al [CKLM19] have highlighted the roles played by attention heads, and in particular their specialisation. Geva et al [GSBL20] worked on the feed-fordward modules of the transformer to determine their importance. Finally, Mickus et al [MPC22] dissected the transformer to measure the contribution of each of its modules (attention, bias, feed-forward, initial embedding) in the output representations and also in the prediction of the hidden word. In this way, the state of the art has made progress on the explicability of transformers and their mechanisms, allowing us to envisage reduced and interpretable architectures inspired by them.

To evaluate these architectures, we will consider classification tasks such as named entity recognition, polarity analysis or hate content detection. But it will also involve developing an end-to-end interpretability evaluation framework.

Profile required :

  • Thesis in computer science or computational linguistics;
  • Interest in interpretability and understanding systems;
  • Python Language;
  • Github and CI/CD;
  • Experience of neural network learning.

Organisation of research work :

Work will be done at LIUM. You will collaborate with two PhD students and one a student on a rotation programme. Salary will be around 2,2k€ net/month, for 12 months.


Send CV and cover letter to Nicolas Dugué (Nicolas.Dugue(at)


  • [BCB14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans- lation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [BDGP23] Anna Béranger, Nicolas Dugué, Simon Guillot, and Thibault Prouteau. Filtering communities in word co-occurrence networks to foster the emergence of meaning. In Complex Networks and Their Applications, pages 377–388, 2023.
  • [BZBHH21] Victoria Bourgeais, Farida Zehraoui, Mohamed Ben Hamdoune, and Blaise Hanczar. Deep gonet: self-explainable deep neural network based on gene ontol- ogy for phenotype prediction from gene expression data. BMC bioinformatics, 22(10):1–25, 2021.
  • [CKLM19] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
  • [GPD23] Simon Guillot, Thibault Prouteau, and Nicolas Dugué. Sparser is better: one step closer to word embedding interpretability. In IWCS, 2023.
  • [GSBL20] Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
  • [MPC22] Timothee Mickus, Denis Paperno, and Mathieu Constant. How to dissect a muppet: The structure of transformer embedding spaces. Transactions of the Association for Computational Linguistics, 10:981–996, 2022.
  • [MTM12] Brian Murphy, Partha Talukdar, and Tom Mitchell. Learning Effective and Interpretable Semantic Models using Non-Negative Sparse Embedding. pages 1933–1950, 2012.
  • [PCD+ 21]Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, Nathalie Camelin, and Sylvain Meignier. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin! In IDA, 2021.
  • [PDCM22]Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, and Sylvain Meignier. Are embedding spaces interpretable? results of an intrusion detection evaluation on a large french corpus. In LREC, 2022.
  • [SPJ+ 18] Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard Hovy. Spine: Sparse interpretable neural embeddings. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.