This is the TED-LIUM corpus release 1,
licensed under Creative Commons BY-NC-ND 3.0 (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en).
The TED-LIUM corpus is English-language TED talks, with transcriptions, sampled at 16kHz. It contains about 118 hours of speech.
More details are given in this paper:
A. Rousseau, P. Deléglise, and Y. Estève, “TED-LIUM: an automatic speech recognition dedicated corpus”, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), May 2012.
Please cite this reference if you use these data in your research work.
You can download it (~20GB) here: https://projets-lium.univ-lemans.fr/ted-lium/release1/
SPH format info:
Sample Rate: 16000
Bit Rate: 256k
Sample Encoding: 16-bit Signed Integer PCM