Corpus: ALLIES (Corpus ALLIES)

The ALLIES Corpus was produced within the European CHIST-Era project ALLIES. The ALLIES project enabled to carry out a campaign for the evaluation of Broadcast News across time diarization systems using French data. This project is an extension of the previous ESTER, REPERE and ETAPE evaluation campaigns that were carried out for the French language in this field.

This corpus is based on the material that was used for the ESTER 1&2 (including 128 files from EPAC), REPERE and ETAPE evaluation packages with New data collected since 2014 (see ELRA Catalogue: for respective packages). The ALLIES corpus was built as an extension of the previous produced corpora. It contains corrected annotations from the previous evaluation materials as well as new audio data with corresponding transcriptions. Corrections include corrected names of speakers and re-segmentation.

The segmentation tasks consist of segmentation in sound events, speaker tracking and speaker segmentation, detailed as follows:

  • For the sound event segmentation, the task consists of tracking the parts which contain music (with or without speech) and the parts which contain speech (with or without music).
  • The speaker tracking task consists in detecting the parts of the document that correspond to a given speaker.
  • The speaker segmentation consists of segmenting the document in speakers and grouping the parts spoken by the same speaker.



  • 1176 WAV files (around 500 hours of speech)
  • 1176 TRS files (speaker turns and orthographic transcriptions)
  • A train/test partition
    • Train 545 + 128 files
    • DiarTest-SeenShows 181 files with shows already present in the train split
    • DiarTest-UnseenShows 286 files with shows that are not in the train split
    • FullTest-CleanAnnot 35 files manually checked with music and noise annotations.


Overall, the ALLIES Corpus contains about 900 hours of news broadcast, including orthographic transcriptions, speaker annotations and segmentation.