PhD defence, Thibault Prouteau

Date : 03/07/2024
Time : 14h00
Location : Le Mans Université; IC2 buiding Auditorium
 

Title: Graphs, Words, and Communities: Converging Paths to Interpretability with a Frugal Embedding Framework
 

Jury members :

  • Vincent LABATUT, Assistant Professor, Université d’Avignon, Reviewer
  • Christine LARGERON, Professor, Université Jean Monnet, Saint-Étienne, Reviewer
  • Cécile BOTHOREL, Assistant Professor, IMT Atlantique, Brest, Examiner
  • Jean-Loup GUILLAUME, Professor, Université de la Rochelle, Examiner
  • Anaïs LEFEUVRE-HALFTERMEYER, Assistant Professor, Université d’Orléans, Examiner
  • Marie TAHON, Professor, Le Mans Université LIUM, Examiner
  • Sylvain MEIGNIER, Professor, Le Mans Université LIUM, Director of thesis
  • Nicolas DUGUÉ, Assistant Professor, Le Mans Université LIUM, Supervisor
  • Nathalie CAMELIN, Assistant Professor, Le Mans Université LIUM, Invited jury member

 

Abstract:

Representation learning with word and graph embedding models allows distributed representations of information that can in turn be used in input of machine learning algorithms.

Through the last two decades, the tasks of embedding graphs nodes and words have shifted from matrix factorization approaches that could be trained in a matter of minutes to large models requiring ever larger quantities of training data and sometimes weeks on large hardware architectures. However, in a context of global warming where sustainability is a critical concern, we ought to look back to previous approaches and consider their performances with regard to resources consumption. Furthermore, with the growing involvement of embeddings in sensitive machine learning applications (judiciary system, health), the need for more interpretable and explainable representations has manifested. To foster efficient representation learning and interpretability, this thesis introduces Lower Dimension Bipartite Graph Framework (LDBGF), a node embedding framework able to embed with the same pipeline graph data and text from large corpora represented as co-occurrence networks.

Within this framework, we introduce two implementations (SINr-NR, SINr-MF) that leverage com- munity detection in networks to uncover a latent embedding space where items (nodes/- words) are represented according to their links to communities.

We show that SINr-NR and SINr-MF can compete with similar embedding approaches on tasks such as predicting missing links in networks (link prediction) or node features (degree centrality, PageRank score). Regarding word embeddings, we show that SINr-NR is a good contender to represent words via word co-occurrence networks. Finally, we demonstrate the interpretability of SINr-NR on multiple aspects. First with a human evaluation that shows that SINr-NR s dimensions are to some extent interpretable. Secondly, by investigating sparsity of vectors, and how having fewer dimensions may allow interpreting how the dimensions combine and allow sense to emerge.

 

Keywords:

spoken language understanding, automatic speech recognition, neural networks, pre-trained models, self-supervised models, semantic concepts extraction