Classic Corpora in LDC’s Catalog: TIMIT | Linguistic Data Consortium

The TIMIT Acoustic-Phonetic Continuous Speech Corpus is another of the classic releases in LDC’s Catalog. Designed for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems, it contains recordings of 630 American English speakers each reading ten phonetically rich sentences, for a total of 6300 utterances comprising 2342 distinct sentences. Data collection and annotation were a joint effort by Texas Instruments, the Massachusetts Institute of Technology and SRI International, and the data release was prepared by NIST (National Institute of Standards and Technology).

TIMIT was among the first publications that appeared with the launch of LDC’s catalog in 1993. It remains one of the Consortium’s top ten distributed corpora and may be the single most widely-used speech database. Despite its age and small size relative to modern data sets, TIMIT’s wide range of phonetically-representative inputs, its time-aligned lexical and phonemic transcripts, and its easy availability through the LDC Catalog have contributed to its widespread use and continued popularity. Thousands of researchers remember its famous first sentence: “she had your dark suit in greasy wash water all year”.

LDC continues the TIMIT series with its Global TIMIT project which aims to create a series of corpora in a variety of languages with TIMIT-like features. (Chanchaochai et al., 2018). Data sets published from that project include: Global TIMIT Learner Treebank English, Global TIMIT Learner Simple English, Global TIMIT Mandarin Chinese – Guanzhong Dialect, and Global TIMIT Mandarin Chinese.

The LDC Catalog features over 900 holdings in more than 90 languages and more data is added each year. All TIMIT corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.