TTS for low resource languages, dialects and accents (13/12/2024)

Currently, many different neural architectures are available to use a Text-to-Speech (TTS) system on the shelf. However it is not always easy to choose the best network for a given application. Especially the limits and drawbacks of pre-trained models are not well defined. This can be crucial when addressing some specific applications (health, human-robot interactions, etc.), but also when the model needs to be adapted for low resource languages, dialects or accents. Indeed, a TTS system usually involves a text processing module, (i.e. phonetization), an encoder which predicts a time-frequency representation and a vocoder which generates the speech signal itself. In order to process these different modules, one needs to collect audio data, get its linguistic or phonetic transcription. To do so NLP tools (such as the phonetizer) must be adapted to their specific languages. The evaluation of synthetic speech is the last bottleneck: indeed it is not always easy to find native speakers who are able to accurately evaluate a synthetic speech signal in their own language as the acculturation with synthetic speech is not uniform across languages.
The aim of this day is to get an overview of 1) the difficulties to collect, process and manage low resource speech data, 2) how existing architectures are robust to the low resource languages, and 3) the evaluation protocol when native speakers are rare.

 
 

 
Organisation of day
10h00 introduction
10h15 Emmett Strickland (MoDyCo) Experimental and corpus-based phonetics in Nigerian Pidgin: Challenges and perspectives
11h15 Kévin Vythelingum (Voxygen) Speech synthesis with a foreign accent from low-resource speaker using self-supervised model representations
12h15 lunch
13h30 Marc Evrard et Philippe Boula de Mareuil (LISN) Speech synthesis for Wallon belge accent.
14h30 Imen Laouirine, Fethi Bougares (Elyadata) Transfer Learning based Tunisian Arabic Text-to-Speech System.
15h30 Ana Montalvo (CENATAV) Investigations for TTS in spanish cuban accent.
16h30 round table and discussions
17h00 end of day
 
 
Kévin Vythelingum (Voxygen) Speech synthesis with a foreign accent from low-resource speaker using self-supervised model representations

Self-supervised pretrained models, like Wav2Vec [1], HuBERT [2] or WavLM [3], exhibit excellent performances on many speech tasks such has speech enhancement, automatic speech recognition or speaker diarization. As a result, it shows that representations of these models carry both language and speaker informations. Especially, the authors of kNN-VC [4] demonstrates voice conversion capabilities of WavLM features. Regarding text-to-speech, it is often difficult to model speakers with underrepresented characteristics, like a specific accent. In order to adress this problem, we investigate the use of WavLM features to transfer the accent of speakers to a generic text-to-speech model in a low-resource scenario.

  • [1] Baevski, Alexei, et al. “wav2vec 2.0: A framework for self-supervised learning of speech representations.” Advances in neural information processing systems 33 (2020): 12449-12460.
  • [2] Hsu, Wei-Ning, et al. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units.” IEEE/ACM transactions on audio, speech, and language processing 29 (2021): 3451-3460.
  • [3] Chen, Sanyuan, et al. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing.” IEEE Journal of Selected Topics in Signal Processing 16.6 (2022): 1505-1518.
  • [4] Baas, Matthew, Benjamin van Niekerk, and Herman Kamper. “Voice conversion with just nearest neighbors.” arXiv preprint arXiv:2305.18975 (2023).
     

 
Emmett Strickland (MoDyCo) Experimental and corpus-based phonetics in Nigerian Pidgin: Challenges and perspectives

This talk will present ongoing research aimed at studying the role of pitch and duration in Nigerian Pidgin, a low-resource language of West Africa. The presentation will describe a novel syntactic treebank which combines traditional morphosyntactic annotations with a wide range of phonetic features describing the segmental and suprasegmental features of each syllable. This treebank will then be used to shed light on the prosody of certain syntactic constructions, with a focus on preverbal markers of tense, aspect, and mood (TAM).
Finally, the presentation will describe efforts to implement perceptive experiments to validate the findings from the exploration of the corpus. This is carried out using a pitch-controllable text-to-speech system trained on pre-existing field recordings. This portion of the presentation will notably highlight the difficulties in building a task-specific TTS system from a noisy corpus of spontaneous speech which was not made with speech synthesis in mind.

 
Imen Laouirine, Fethi Bougares (Elyadata) Transfer Learning based Tunisian Arabic Text-to-Speech System.

Being labeled as a low-resource language, the Tunisian dialect has no existing prior TTS research. At elyadata we collected a a mono-speaker speech corpus of +3 hours of a male speaker sampled at 44100 kHz called TunArTTS. This corpus is processed, manually diacritized and used to initiate the development of end-to-end TTS systems for the Tunisian dialect. Various TTS systems using from scratch training and transfer learning were experimented and compared. TunArTTS corpus is publicly available for research purposes along with the baseline TTS system demo.

 

Ana Montalvo (CENATAV) Speech synthesis for Cuban Spanish accent

This talk explores accent-based Text-to-Speech synthesis with a specific focus on the diversity of Spanish accents. We’ll start by discussing the key differences between foreign and regional accents, examining how these distinctions could impact TTS design and implementation.
We’ll then explore how Spanish regional accents shape vowel sounds, consonant pronunciations, and intonation patterns, and discuss how these elements could be incorporated into TTS systems.

 

Marc Evrard et Philippe Boula de Mareuil (LISN) Speech synthesis for Wallon belge accent.

We present a text-to-speech system for Walloon, a minority language spoken in Belgium and France. For this project, we used an audio corpus derived from a translation of Petit Prince. A native speaker was recorded to create the corpus. It was segmented into sentences and phonetized by an automatic (rule-based) system developed in-house specifically for Walloon. The synthesis system is based on the VITS architecture (Variational Inference with adversarial learning for end-to-end Text-to-Speech). Several models were trained under different conditions: individual speaker, phonetic and graphemic transcription, with and without fine-tuning from a model pre-trained on a French language corpus. An objective evaluation has been carried out and a perceptive evaluation campaign by native speakers is currently underway. As things stand, the objective evaluation does not allow us to clearly distinguish a trend between the different models. However, perceptually, it seems that fine-tuning models are preferred only when the training condition corresponds to the reduced corpus.