New Corpora

American and South Asian English Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- English Group: 18 hours of telephone speech in American and South Asian English, labeled for gender, dialect type and noise, collected by LDC to support automatic language identification

Mining Entity Information from Chinese Text: TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014: developed by LDC for the TAC KBP Chinese Regular Slot Filling evaluation track in 2014,  includes queries, manual runs, final rounds of assessment results, and Chinese source documents

Mexican Spanish Speech for Acoustic Modeling: CIEMPIESS Experimentation: developed at the National Autonomous University of Mexico, 22 hours of Mexican Spanish broadcast and read speech with associated transcripts and tools for creating pronouncing dictionaries

Guarani Speech: IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c: developed by Appen, 198 hours of Guarani conversational and scripted telephone speech with transcripts collected in 2014 and 2015 from speakers aged 16 to 67 years old using different telephones in a variety of environments, including the street, a home or office, a public place, and inside a vehicle

Egyptian Arabic/English Word Alignment: BOLT Egyptian-English Word Alignment – Discussion Forum Training: developed by LDC, 400,448 words of Egyptian Arabic discussion forum data and English parallel text enhanced with linguistic tags to indicate word relations

Chinese AMR: Chinese Abstract Meaning Representation 1.0: developed by Brandeis University and Nanjing Normal University, semantic representations of 10,149 sentences from the weblog and discussion forum portions of Chinese Treebank 8.0 (LDC2013T21)

Amateur Videos for Event Detection Tasks: HAVIC MED Progress Test – Videos, Metadata and Annotation: developed by LDC for the HAVIC (Heterogeneous Audio Visual Internet Collection) project, 3,650 hours of web amateur videos with annotation and metadata to support event detection research

Enhanced CALLFRIEND Egyptian Arabic: CALLFRIEND Egyptian Arabic Second Edition: developed by LDC, 25 hours of unscripted telephone conversations between native speakers of Egyptian Arabic, with audio files in .wav format, a simplified directory structure and additional documentation and metadata

WSJ Discourse Relations: Penn Discourse Treebank Version 3.0: the third release in the Penn Discourse Treebank project, over 50,000 tokens of annotated discourse relations from the WSJ section of Treebank-2 (LDC95T7), standardized pairwise annotations, new senses, and tools for annotation, adjudication, and file conversion

Chinese Amateur Web Videos and Transcripts: VAST Chinese Speech and Transcripts: developed by LDC for the VAST (Video Annotation for Speech Technologies) project, 29 hours of Mandarin Chinese audio from amateur web video content with corresponding time-aligned transcripts

Committed Belief Annotation: DEFT Chinese Committed Belief Annotation: developed by LDC, 83,000 tokens of Chinese discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text