Enriched end-to-end approach with variable complexity for speech-to-text

Starting: 01/10/2023
PhD Student: Youness Dkhissi
Advisor(s): Anthony Larcher
Co-advisor(s): Stéphane Pateux, Valentin Vielzeuf, Elys Allesiardo (Orange)
Funding: CIFRE

Global context and problematics of the subject
Speech transcription is an essential tool in many services; it enables a user’s request formulated in natural language to be transcribed into a dialogue system, but can also be used to transcribe exchanges during meetings (note-taking, creation of summaries, etc.).
The first approaches to speech transcription were so-called hybrid systems that first extracted the sound units that constitute speech (e.g. phonemes) and then, using a language model, proposed a reconstruction of the words spoken. With the rise of Deep Learning and Transformers in particular, we are now seeing the emergence of End-to-End approaches that enable this transcription to be carried out using a single neural model.
The performance of these transcription tools is improving all the time, but they can be accompanied by an increase in the associated complexity and latency due to the use of transformer-based approaches. A major challenge is therefore to find a solution that offers the best performance/complexity/latency compromise, particularly for an operator such as Orange.
In addition, the textual transcription of voice is accompanied by other problems such as the detection of vocal activity, the separation of turns of speech (diarisation), the extraction of voice attributes (accents/language, feelings, etc.) and the analysis of the voice itself.
Scientific objective – results and barriers to be overcome

The additional value of this thesis is to work on voice transcription tools, and in particular on current End-to-End type approaches [1] which currently offer the best performance in the academic world.

The PhD student will be called upon to develop new neural architectures aimed at proposing a multi-output approach making it possible to offer a range of performance/complexity/latency trade-offs in order to best meet application constraints.
He will have the opportunity to work in a team at the cutting edge of Speech-To-Text solutions, with the possibility of evaluating the contribution of these solutions in a concrete application context.


[1] Li, Jinyu. “Recent Advances in End-to-End Automatic Speech Recognition.” ArXiv abs/2111.01690
[2] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” CoRR, vol. abs/1508.01211, 2015
[3] Anmol Gulati et al « Conformer: Convolution-augmented transformer for Speech Recognition », InterSpeech 2020
[4] Jiahui Yu et al « Dual mode ASR: Unify and improve streaming ASR with full-context modeling », ICLR 2021
[5] Open AI. “Whisper: Robust Speech Recognition via large scale weak supervision”, https://openai.com/research/whisper