Classic Corpora in LDC’s Catalog: TIMIT

TIMIT cover

The TIMIT Acoustic-Phonetic Continuous Speech Corpus is another of the classic releases in LDC’s Catalog. Designed for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems, it contains recordings of 630 American English speakers each reading ten phonetically rich sentences, for a total of 6300 utterances comprising 2342 distinct sentences. Data collection and annotation were a joint effort by Texas Instruments, the Massachusetts Institute of Technology and SRI International, and the data release was prepared by NIST (National Institute of Standards and Technology).  

TIMIT was among the first publications that appeared with the launch of LDC’s catalog in 1993. It remains one of the Consortium’s top ten distributed corpora and may be the single most widely-used speech database. Despite its age and small size relative to modern data sets, TIMIT’s wide range of phonetically-representative inputs, its time-aligned lexical and phonemic transcripts, and its easy availability through the LDC Catalog have contributed to its widespread use and continued popularity. Thousands of researchers remember its famous first sentence: “she had your dark suit in greasy wash water all year”. 

LDC continues the TIMIT series with its Global TIMIT project which aims to create a series of corpora in a variety of languages with TIMIT-like features. (Chanchaochai et al., 2018). Data sets published from that project include: Global TIMIT Learner Treebank EnglishGlobal TIMIT Learner Simple EnglishGlobal TIMIT Mandarin Chinese – Guanzhong Dialect, and Global TIMIT Mandarin Chinese.  

The LDC Catalog features over 900 holdings in more than 90 languages and more data is added each year. All TIMIT corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.  

LDC Data Life Cycle

LDC Data Life Cycle flow chart

 All data sets in LDC’s public catalog possess the attributes required by best practices for digital repositories. Once a resource is ingested, LDC provides descriptive metadata, robust licensing, and secure archival storage along with the technology and infrastructure for data delivery and access. 

LDC Submissions: share your data today

LDC Submissions is a platform that provides infrastructure and resources for sharing data through the Catalog. After registering for a user account, corpus submitters can create a submission, upload files, and communicate with LDC’s publications team during the review process. After all reviews are complete, the final, release-ready version of your data set is uploaded to the platform and enters the publications queue. 

Sharing your corpus through LDC ensures access to the global research community and the permanent preservation of your data according to best practices for archiving digital language resources. Get started and register for an LDC Submissions user account today.