SAVID – Speaker and Audiovisual Interpretable Deepfake Detection
Supervisors: Marie Tahon (director) and Aghilas Sini (co-supervisor) at LIUM and Arnaud Delhay (co-director) and Damien Lolive (co-supervisor) at l’IRISA
Hosting teams : LST-LIUM et EXPRESSION-IRISA
Location : Le Mans
Beginning : October 2025
Contact : aghilas.sini(at)univ-lemans.fr, arnaud.delhay(at)irisa.fr
Description :
The proliferation of text-to-speech and facial models has led to a significant increase in spoofing attacks, where audio-visual identities are falsified. This raises major issues in terms of security and trust in digital com- munications. Current detection techniques, based on neural models exploiting spectral or visual representations, have limitations in terms of interpretability, which hampers understanding of the factors that lead to the failure of detection and localisation systems.
Objectives :
This thesis aims to create an efficient audiovisual identity verification system that will detect deepfakes and locate corrupted segments, regardless of the language spoken. To achieve this objective, the thesis consists of three main phases :
Phase 1 : Speaker verification and multimodal deepfakes detection
The aim of this phase is to establish a baseline by testing various audio/video combinations (authenticated/falsified) in order to evaluate the initial performance of the existing system presented in [CGA+24]. It also includes data qualification by analysing system errors, detecting poorly annotated data and analysing error factors associated with speakers and attack types.
Phase 2 : Fine analysis of partially corrupted samples
The aim is to segment audiovisual recordings in order to locate falsified segments. Two types of segmentation will be discussed: a) the use of atemporal speaker embeddings (x-vectors [SGRS+18]) for global segmentation and b) the use of SSL representations (WavLM [CWC+22]) for finer segmentation at the frame level.
Phase 3 : Building a latent space that can be interpreted and explained
To structure the latent space of the system, we would like to use prototyping methods [AMO+24] to encode determining factors (voice quality, lip movements) and improve the explicability of the system by identifying the elements that contribute to decision making.
The advances expected from this thesis include :
- A robust and interpretable audiovisual identity verification system,
- A better understanding of the factors leading to the failure of deepfakes detection systems,
- Precise segmentation and location techniques for fake segments,
- A structured latent space that can be explained, enabling the elements contributing to the decision to be identified.
Desired Profil :
Master’s degree or engineering school diploma in Computer Science with a focus on AI, automatic language processing or cyber security.
Application:
If you would like to apply for this thesis, please send your application (CV and covering letter) to Aghilas SINI (aghilas.sini(at)univ-lemans.fr), Arnaud DELHAY (arnaud.delhay(at)irisa.fr) before 4 April 2025. Applications will be examined on an ongoing basis..This call for thesis funding projects is open exclusively to students who are nationals of the European Union, the United Kingdom or Switzerland..
References
- [AMO+24] Antonio Almudévar, Théo Mariotte, Alfonso Ortega, Marie Tahon, Luis Vi- cente, Antonio Miguel, and Eduardo Lleida. Predefined prototypes for intra- class separation and disentanglement. arXiv preprint arXiv :2406.16145, 2024.
- [CGA+24] Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov. Av-deepfake1m : A large-scale llm-driven audio-visual deepfake dataset. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7414–7423, 2024.
- [CWC+22] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm : Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6) :1505–1518, 2022.
- [SGRS+18] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and San- jeev Khudanpur. X-vectors : Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 5329–5333, 2018.