Corpus: Kurdish TED (KUTED)

Licence: CreativeCommons Attribution NonCommercial-ShareAlike 4.0 International License.
URL: https://huggingface.co/datasets/aranemini/kurdishted


Description

Kurdish TED (KUTED) is the first Speech-to-Text-Translation (S2TT) dataset for the Central Kurdish language derived from TED Talks and TEDx. The corpus consists of 91,000 pairs, encompassing 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. This dataset is evaluated on speech E2E S2TT, Cascaded S2TT and T2TT tasks.

 

KUTED can be used for the following tasks:

  • Speech-to-Text-Translation (EN->CKB)
  • Speech-to-Speech-Translation (EN->CKB)
  • Text-to-Text-Translation (EN->CKB and CKB->EN)
  • Automatic Speech Recognition (EN)
  • How to Get Started with KUTED

 

Participants :

  • Aran Emini (LIUM, Le Mans University)
  • Antoine Laurent (LIUM, Le Mans University)
  • Josep Crego (Systran, Paris, France)
  • Daban Jaaf (Erfurt University, Germany)