Corpus: Central Kurdish to English Pseudo-Labeled Data for Speech Translation (Données pseudo-étiquetées de kurde central vers l’anglais pour la traduction de la parole)
In this repository, you will find large-scale pseudo-labeled data, including Central Kurdish audio translated into English. This dataset contains 1.7 million samples, equivalent to 3,000 hours of Kurdish audio, extracted from audiobooks and translated into English using a pipeline that combines a speech recognition system with a machine translation system. The samples have passed several filters, as described in the related paper.