Corpus: TED-LIUM Release 3

Licence: Creative Commons BY-NC-ND 3.0 (attribution/non-commercial/no-derivatives)


This is the TED-LIUM corpus release 3,
licensed under Creative Commons BY-NC-ND 3.0 (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en).

 

All talks and text are property of TED Conferences LLC.

 

This new TED-LIUM release was made through a collaboration between the Ubiqus company and the LIUM (University of Le Mans, France)

 

Contents:

– 2351 audio talks in NIST sphere format (SPH), including talks from TED-LIUM 2: be careful, same talks but not same audio files (only these audio file must be used with the TED-LIUM 3 STM files)
–> 452 hours of audio
– 2351 aligned automatic transcripts in STM format

– TEDLIUM 2 dev and test data: 19 TED talks in SPH format with corresponding manual transcriptions (cf. ‘legacy’ distribution below).

– Dictionary with pronunciations (159848 entries), same file as the one included in TED-LIUM 2
– Selected monolingual data for language modeling from WMT12 publicly available corpora: these files come from the TED-LIUM 2 release, but have been modified to get a tokenization more relevant for English language

 

Two corpus distributions:

  • the legacy one, on which the dev and test datasets are the same as in TED-LIUM 2 (and TED-LIUM 1).
  • the ‘speaker adaptation’ one, especially designed for experiments on speaker adaptation.

 

More details are given in this paper:

François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Estève, “TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation”, submitted to the 20th International Conference on Speech and Computer (SPECOM 2018), September 2018, Leipzig, Germany

A preprint version is available on arxiv (and in the doc/ directory): https://arxiv.org/abs/1805.04699

 

 

You can download it (~51GB) here: https://projets-lium.univ-lemans.fr/ted-lium/release3/

 

SPH format info:

Channels: 1
Sample Rate: 16000
Precision: 16-bit
Bit Rate: 256k
Sample Encoding: 16-bit Signed Integer PCM