Soutenance de thèse : Thibault Prouteau – Laboratoire d'Informatique de l'Université du Mans

PhD defence, Thibault Prouteau

Date : 03/07/2024
Time : 14h00
Location : Le Mans Université; IC2 buiding Auditorium

Title: Graphs, Words, and Communities: Converging Paths to Interpretability with a Frugal Embedding Framework

Jury members :

Vincent LABATUT, Assistant Professor, Université d’Avignon, Reviewer
Christine LARGERON, Professor, Université Jean Monnet, Saint-Étienne, Reviewer
Cécile BOTHOREL, Assistant Professor, IMT Atlantique, Brest, Examiner
Jean-Loup GUILLAUME, Professor, Université de la Rochelle, Examiner
Anaïs LEFEUVRE-HALFTERMEYER, Assistant Professor, Université d’Orléans, Examiner
Marie TAHON, Professor, Le Mans Université LIUM, Examiner
Sylvain MEIGNIER, Professor, Le Mans Université LIUM, Director of thesis
Nicolas DUGUÉ, Assistant Professor, Le Mans Université LIUM, Supervisor
Nathalie CAMELIN, Assistant Professor, Le Mans Université LIUM, Invited jury member

Abstract:

Representation learning with word and graph embedding models allows distributed representations of information that can in turn be used in input of machine learning algorithms.

Through the last two decades, the tasks of embedding graphs nodes and words have shifted from matrix factorization approaches that could be trained in a matter of minutes to large models requiring ever larger quantities of training data and sometimes weeks on large hardware architectures. However, in a context of global warming where sustainability is a critical concern, we ought to look back to previous approaches and consider their performances with regard to resources consumption. Furthermore, with the growing involvement of embeddings in sensitive machine learning applications (judiciary system, health), the need for more interpretable and explainable representations has manifested. To foster efficient representation learning and interpretability, this thesis introduces Lower Dimension Bipartite Graph Framework (LDBGF), a node embedding framework able to embed with the same pipeline graph data and text from large corpora represented as co-occurrence networks.

Within this framework, we introduce two implementations (SINr-NR, SINr-MF) that leverage com- munity detection in networks to uncover a latent embedding space where items (nodes/- words) are represented according to their links to communities.

We show that SINr-NR and SINr-MF can compete with similar embedding approaches on tasks such as predicting missing links in networks (link prediction) or node features (degree centrality, PageRank score). Regarding word embeddings, we show that SINr-NR is a good contender to represent words via word co-occurrence networks. Finally, we demonstrate the interpretability of SINr-NR on multiple aspects. First with a human evaluation that shows that SINr-NR s dimensions are to some extent interpretable. Secondly, by investigating sparsity of vectors, and how having fewer dimensions may allow interpreting how the dimensions combine and allow sense to emerge.

Keywords:

spoken language understanding, automatic speech recognition, neural networks, pre-trained models, self-supervised models, semantic concepts extraction