BASIN-NMF – Laboratoire d'Informatique de l'Université du Mans

Summary

The project aims at improving interpretability of neural models used for audio segmentation (e.g. speaker diarization). These models are usually considered as black-boxes which provide the segmentation from the audio signal. However, it is necessary in some context to inform the user on how the decision is made by the system. Applications in the audio field remain limited to date.

Non-negative matrix factorisation (NMF) has proven useful for explaining audio models. In our previous work, we showed that it can be used to reconstruct an interpretation in the form of an audible signal and to extract characteristics representative of each class. Currently, there is still a trade-off between the quality of the explanations obtained and the performance of the model.

My preliminary work shows that NMF is a promising approach for explaining the decisions of an audio segmentation system. However, several obstacles remain: (1) reconstructing the audible explanation is difficult, particularly when using self-supervised models to represent the signal. (2) It is difficult to associate the extracted explanations with high-level explanatory factors (e.g., what is the impact of pitch on the decision?). (3) Temporal dependencies are not taken into account in the extraction of explanations.

The proposed project aims to address these three issues by (1) modifying the system optimisation scheme to promote reconstruction during learning, (2) by constraining part of the representation space to encode explicit information about the signal (e.g., F0, intonation) in order to interpret decisions about these factors, (3) by modifying the system architecture to integrate attention mechanisms to take into account temporal dependencies in the segmentation explanation.

Boosting Audio Segmentation Interpretability with Non-negative Matrix Factorization (BASIN-NMF)