{"id":26743,"date":"2026-01-26T16:24:32","date_gmt":"2026-01-26T15:24:32","guid":{"rendered":"https:\/\/lium.univ-lemans.fr\/?p=26743"},"modified":"2026-01-27T12:03:24","modified_gmt":"2026-01-27T11:03:24","slug":"corpus-commute-kurdish","status":"publish","type":"post","link":"https:\/\/lium.univ-lemans.fr\/en\/corpus-commute-kurdish\/","title":{"rendered":"corpus COMMUTE-Kurdish"},"content":{"rendered":"<div class=\"panel-grid\" id=\"pg-26743-0\" ><div class=\"panel-grid-core\"><div class=\"panel-grid-cell\" id=\"pgc-26743-0-0\" ><div class=\"panel-widget-style\" ><h2 style=\"color: #e5442d;\"><b>Software: <\/b>corpus COMMUTE-Kurdish (corpus COMMUTE-Kurdish)<br\/><\/h2><p><div class=\"table-responsive\" style=\"margin-left:-20px;\" ><table class=\"table\" style=\"border: 0 solid #ffffff;\"><tr class=\"col-sm-12\" ><td style=\"border: 0;\"><br\/><b>Author(s): <\/b><\/td><td style=\"border: 0\"><center><a href=\"https:\/\/lium.univ-lemans.fr\/en\/team\/mohammad-mohammadamini\/\" target=\"_blank\" ><img alt=\"User Pic\" src=https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2024\/04\/Aran-Mohammadamini.jpeg class=\"img-circle img-responsive\" height=\"60\" width=\"60\"><\/a><a href=\"https:\/\/lium.univ-lemans.fr\/en\/team\/mohammad-mohammadamini\/\" target=\"_blank\" ><b style=\"color:#e5442d;\"><span style=\"font-size: 8pt;\">Mohammad  Mohammadamini<\/span><\/b><\/a><\/center><\/td><td style=\"border: 0\"><center><a href=\"http:\/\/perso.univ-lemans.fr\/~mtahon\/\" target=\"_blank\" ><img alt=\"User Pic\" src=https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2019\/01\/JFB_5351_DxO-e1548684812296.jpg class=\"img-circle img-responsive\" height=\"60\" width=\"60\"><\/a><a href=\"http:\/\/perso.univ-lemans.fr\/~mtahon\/\" target=\"_blank\" ><b style=\"color:#e5442d;\"><span style=\"font-size: 8pt;\">Marie Tahon<\/span><\/b><\/a><\/center><\/td><td style=\"border: 0\"><center><a href=\"http:\/\/www.antoine-laurent.fr\/\" target=\"_blank\" ><img alt=\"User Pic\" src=https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2020\/10\/Antoine-Laurent.jpg class=\"img-circle img-responsive\" height=\"60\" width=\"60\"><\/a><a href=\"http:\/\/www.antoine-laurent.fr\/\" target=\"_blank\" ><b style=\"color:#e5442d;\"><span style=\"font-size: 8pt;\">Antoine Laurent<\/span><\/b><\/a><\/center><\/td><\/tr><\/table><\/div><\/div><br\/><\/p><\/div><\/div><\/div><\/div><div class=\"panel-grid\" id=\"pg-26743-1\" ><div class=\"panel-grid-core\"><div class=\"panel-grid-cell\" id=\"pgc-26743-1-0\" ><div class=\"panel-widget-style\" ><h3 style=\"color: #e5442d;\">Description <\/h3>\n<p align=\"justify\">Within the framework of the French COMMUTE project (<a href=\"https:\/\/lium.univ-lemans.fr\/en\/projet-commute\/\">https:\/\/lium.univ-lemans.fr\/en\/projet-commute\/<\/a>), 30 hours of audio data in Central Kurdish language were collected, segmented, transcribed and translated into English. The main objective is to provide the scientific community with a spontaneous speech database, carefully annotated, polyvalent, for the development of speech technology dedicated to Kurdish oral and written language. The database consists of the following annotations:<\/p>\n<ul>\n<li> <strong>Manual segmentation<\/strong>: This data can be used to train automatic speech segmentation models,<\/li>\n<li> <strong>Speaker identity<\/strong>: This annotation is adapted to speaker processing tasks such as speaker diarization or verification,<\/li>\n<li> <strong>Kurdish transcription<\/strong>: Segmented audio files have been automatically transcribed with kurdish speech recognition system developed at LIUM (Le Mans University). All transcriptions have been manually reviewed to correct transcription errors. This data is suitable to train and evaluation automatic speech recognition models, especially in context of spontaneous speech.<\/li>\n<li> <strong>English translation<\/strong>: Kurdish transcriptions have been translated into English by native professional translators. The data is suitable for translation tasks such as speech-to-text, speech-to-speech, and text-to-text translation.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p align=\"justify\">The dataset comes from three Kurdish media, collected with the kind agreement of authors.<\/p>\n<ul>\n<li> <strong>train<\/strong> : 9h10min from Voice of America (podcasts). This subcorpus mainly consists of political and cultural topics. A large amount of this data comes from hard channels, such as telephone. This subset comprises 19 podcasts and 4 951 segments,<\/li>\n<li> <strong>dev <\/strong> : 9h16min  from Kurdistan24 (TV channel). This subset comprises 8 podcasts and 5 676 segments,<\/li>\n<li> <strong>test<\/strong> : 11h9min from the media network Rudaw, and contains 23 podcasts and 7 248 segments from various domains (economy, sport, art, science, etc.).<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3 style=\"color: #e5442d;\">Data Format<\/h3>\n<p align=\"justify\">For each long audio file, the following fields are provided in an accompanied TSV file with the same name as its WAV speech file:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2026\/01\/Commute-data.png\" alt=\"\" width=\"650\" height=\"371\" class=\"aligncenter size-full wp-image-26744\" srcset=\"https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2026\/01\/Commute-data.png 650w, https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2026\/01\/Commute-data-300x171.png 300w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h3 style=\"color: #e5442d;\">Download Data<\/h3>\n<p align=\"justify\">The train, dev, and test parts of the Kurdish-Commute dataset can be downloaded from the following link.<br \/>\n<a href=\"https:\/\/lium.univ-lemans.fr\/data-ext\/iwslt2026\/commute-kurdish-iwslt2026.zip\">https:\/\/lium.univ-lemans.fr\/data-ext\/iwslt2026\/commute-kurdish-iwslt2026.zip<\/a><\/p>\n<p>&nbsp;<\/p>\n<h3 style=\"color: #e5442d;\">IWSLT 2026 challenge rules<\/h3>\n<ol>\n<li>Complementary data<\/li>\n<ul>\n<li><strong>Common Voice<\/strong>: All parts of Common Voice are allowed to be used.<br \/>\nDataset:  <a href=\"https:\/\/datacollective.mozillafoundation.org\/datasets\/cmj8u3oxx004lnxxbfr04zvrt\">https:\/\/datacollective.mozillafoundation.org\/datasets\/cmj8u3oxx004lnxxbfr04zvrt<\/a><\/li>\n<li><strong>Giganet TTS<\/strong> : 10 hours of TTS data from one male speaker.<br \/>\nDataset: <a href=\"https:\/\/huggingface.co\/datasets\/TTS4ALL\/Kurdish_TTS\">https:\/\/huggingface.co\/datasets\/TTS4ALL\/Kurdish_TTS<\/a><br \/>\nDocumentation: <a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S2352340924007194\">https:\/\/www.sciencedirect.com\/science\/article\/pii\/S2352340924007194<\/a><\/li>\n<li><strong>Asosoft Text Corpus<\/strong>:<br \/>\nDataset: <a href=\"https:\/\/github.com\/AsoSoft\/AsoSoft-Text-Corpus\">https:\/\/github.com\/AsoSoft\/AsoSoft-Text-Corpus<\/a>,<br \/>\nDocumentation:  <a href=\"https:\/\/doi.org\/10.1093\/llc\/fqy074\">https:\/\/doi.org\/10.1093\/llc\/fqy074<\/a><\/li>\n<\/ul>\n<p align=\"justify\">The participants can use the provided resource for training any model in their proposed pipelines and solutions including ASR, TTS, S2TT, MT, LLM models. <\/p>\n<p>&nbsp;<\/p>\n<li>Libraries<\/li>\n<p align=\"justify\">The Asosoft library including normalization, g2p, number conversion, etc for Central Kurdish can be used.<\/p>\n<p>Asosoft library <a href=\"https:\/\/pypi.org\/project\/asosoft\/\">https:\/\/pypi.org\/project\/asosoft\/ <\/a><\/p>\n<p>&nbsp;<\/p>\n<li>Evaluation protocol<\/li>\n<ul>\n<li>BLEU and Chrf++ will be main evaluation metrics <\/li>\n<li>A baseline Whisper model is trained giving the following results on the dev and test parts:<br \/>\n<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2026\/01\/Commute-baseline-1024x173.png\" alt=\"\" width=\"1024\" height=\"173\" class=\"aligncenter size-large wp-image-26745\" srcset=\"https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2026\/01\/Commute-baseline-1024x173.png 1024w, https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2026\/01\/Commute-baseline-300x51.png 300w, https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2026\/01\/Commute-baseline-768x130.png 768w, https:\/\/lium.univ-lemans.fr\/wp-content\/uploads\/2026\/01\/Commute-baseline.png 1254w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<li>Baseline model<\/li>\n<p align=\"justify\">The baseline Whisper v3 model can be downloaded from the following link. The baseline model is fine-tuned on the train part of Kurdish-Commute dataset. <\/p>\n<p>&nbsp;<\/p>\n<li>Evaluation<\/li>\n<p align=\"justify\">The final evaluation will be done based on the BLEU score and ChrF++ scores on test partition. The Kurdish and English transcriptions of dev partition is already shared with participants in order to evaluate the performance of their systems. The transcriptions of the test partition will be shared later  The evaluation link on the test set will be shared from this section later. <\/p>\n<\/ol>\n<p>&nbsp;<\/p>\n<h3 style=\"color: #e5442d;\">Important dates<\/h3>\n<p>The deadlines will be same as IWSLT. <\/p>\n<p>&nbsp;<br \/>\n<strong>Organizers:<\/strong><\/p>\n<ul>\n<li>Mohammad Mohammadamini, LIUM, Le Mans University, France <\/li>\n<li>Marie Tahon,  LIUM, Le Mans University, France <\/li>\n<li>Antoine Laurent,  PyannoteAI &#038; LIUM, Le Mans University, France<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3 style=\"color: #e5442d;\">Contact person:<\/h3>\n<p><strong>Mohammad Mohammadamini<\/strong>: mohammad.mohammadamini(@)univ-lemans.fr<\/p><\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Software: corpus COMMUTE-Kurdish (corpus COMMUTE-Kurdish)Author(s): Mohammad MohammadaminiMarie TahonAntoine LaurentDescription Within the framework of the French COMMUTE project (https:\/\/lium.univ-lemans.fr\/en\/projet-commute\/), 30 hours of audio data in Central Kurdish language were collected, segmented, transcribed and translated into English. The main objective is to provide the scientific community with a spontaneous speech database, carefully annotated, polyvalent, for the development [&hellip;]<\/p>\n<p class=\"more-link style2\"><a href=\"https:\/\/lium.univ-lemans.fr\/en\/corpus-commute-kurdish\/\"  class=\"themebutton\"><span class=\"more-text\">READ MORE<\/span><span class=\"more-icon\"><i class=\"fa fa-angle-right fa-lg\"><\/i><\/span><\/a><\/p>\n","protected":false},"author":14,"featured_media":17309,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[48,47],"tags":[49],"acf":[],"_links":{"self":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts\/26743"}],"collection":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/comments?post=26743"}],"version-history":[{"count":6,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts\/26743\/revisions"}],"predecessor-version":[{"id":26751,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/posts\/26743\/revisions\/26751"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/media\/17309"}],"wp:attachment":[{"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/media?parent=26743"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/categories?post=26743"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lium.univ-lemans.fr\/en\/wp-json\/wp\/v2\/tags?post=26743"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}