New Corpora

Enhanced CALLFRIEND Egyptian Arabic: CALLFRIEND Egyptian Arabic Second Edition: developed by LDC, 25 hours of unscripted telephone conversations between native speakers of Egyptian Arabic, with audio files in .wav format, a simplified directory structure and additional documentation and metadata

WSJ Discourse Relations: Penn Discourse Treebank Version 3.0: the third release in the Penn Discourse Treebank project, over 50,000 tokens of annotated discourse relations from the WSJ section of Treebank-2 (LDC95T7), standardized pairwise annotations, new senses, and tools for annotation, adjudication, and file conversion

Chinese Amateur Web Videos and Transcripts: VAST Chinese Speech and Transcripts: developed by LDC for the VAST (Video Annotation for Speech Technologies) project, 29 hours of Mandarin Chinese audio from amateur web video content with corresponding time-aligned transcripts

Committed Belief Annotation: DEFT Chinese Committed Belief Annotation: developed by LDC, 83,000 tokens of Chinese discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text

Lithuanian Speech: IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b: developed by Appen, 210 hours of Lithuanianconversational and scripted telephone speech with transcripts collected in 2013 and 2014 from speakers aged 16 to 67 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Dialectal Arabic Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- Arabic Group: 117 hours of telephone speech in distinct dialects of colloquial Arabic (Iraqi, Levantine and Maghrebi) collected by LDC for research and technology evaluation in automatic language identification

Translated ATIS Utterances: Multilingual ATlS: developed by Google Inc., 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish, originally collected in English in the early 1990s to support the research and development of speech understanding systems

Egyptian Arabic/English Parallel Web Text: BOLT Arabic Discussion Forum Parallel Training Data: developed by LDC, 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations

Task-oriented Speech from Middle School Students: SRI Speech-Based Collaborative Learning Corpus: developed by SRI International, 120 hours of English speech from 134 US middle school students working collaboratively on mathematics problems with orthographic transcriptions, manual annotation, log files, and supporting documentation; created to determine whether detectable patterns exist in student speech that correlate with collaborative learning indicators and to provide a means of assessing collaboration quality 

Resources for TAC KBP Entity Discovery and Linking Task:TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015:queries, knowledge base (KB) links, equivalence class clusters for NIL entities, entity type information for queries, source documents (Chinese, English and Spanish newswire and web text) and BaseKB the reference KB adopted in 2015