KUTED – Laboratoire d'Informatique de l'Université du Mans

Oct 7, 2024Emmanuelle BillardSoftware/Corpus, ProductionLST

Description

Kurdish TED (KUTED) is the first Speech-to-Text-Translation (S2TT) dataset for the Central Kurdish language derived from TED Talks and TEDx. The corpus consists of 91,000 pairs, encompassing 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. This dataset is evaluated on speech E2E S2TT, Cascaded S2TT and T2TT tasks.

KUTED can be used for the following tasks:

Speech-to-Text-Translation (EN->CKB)
Speech-to-Speech-Translation (EN->CKB)
Text-to-Text-Translation (EN->CKB and CKB->EN)
Automatic Speech Recognition (EN)
How to Get Started with KUTED

Participants :

Aran Emini (LIUM, Le Mans University)
Antoine Laurent (LIUM, Le Mans University)
Josep Crego (Systran, Paris, France)
Daban Jaaf (Erfurt University, Germany)

Corpus: Kurdish TED (KUTED)

Description