Multilingual Multimodal Voice Translation - Expressive (TV2M-E)

Date: 06/2024 - 06/2026
Funding: Région Pays de la Loire
Call: PULSAR
URL: https://lium.univ-lemans.fr/en/tv2m-e/


LIUM Participant(s):
User PicAghilas Sini

Summary
A bilingual or polyglot speaker has the ability to communicate coherently in several languages, adapting to different contexts. Transferring this skill to machines could contribute to the preservation of cultural heritage by maintaining less privileged languages, facilitating interaction between people of different cultures and languages, and reinforcing safety measures.

Multimodal and multilingual expressive speech translation is a current field of research, covering various aspects of automatic language and speech processing. Traditionally, areas such as machine translation, speech recognition and speech synthesis have been addressed separately, but neural approaches merge these processes, reducing errors. However, learning these architectures requires huge amounts of data and a specific computing infrastructure, such as GPUs or TPUs.

The emergence of models such as BERT and GPT-3 has considerably improved automatic language generation, recognition and understanding systems. Generative Transformer language models open up new perspectives, marking a significant evolution in the field of automatic language processing.

Similar open source projects, such as BLOOM and MEGATRAN, are under development. A new generation of multimodal and multilingual neural models, such as Data2Vec, SpeechT5, mSLAM, UNIMO and VATLM, aims to create a unified learning framework for text and speech. Multimodal pre-training opens up promising research opportunities, particularly in speech translation, cross-modal translation and other aspects such as multimodal data recognition and generation.

As part of my research, I want to explore these paradigms to improve expressive speech translation algorithms. The aim is to develop translation abilities that take into account not only the linguistic content, but also the expressiveness of the source utterances, thus offering a more faithful and complete translation into the target language.