New Corpora

Switchboard Sentiment AnalysisSpeech Sentiment Annotations: developed by Google Inc., sentiment labels (positive, negative, neutral) for approximately 49,500 utterances covering 140 hours of telephone speech from Switchboard-1 Release 2 (LDC97S62) 

Historical English TreebanksPenn Parsed Corpora of Historical English: developed at the University of Pennsylvania, three corpora containing part-of speech and syntactic annotation of English texts from Middle English documents (1100 CE) to the First World War period (1914 CE); includes annotation guidelines, philological information and the CorpusSearch 2 program 

Javanese Speech: IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b: developed by Appen, 204 hours of Javanese conversational and scripted telephone speech with transcripts collected in 2014-2015 from speakers aged 16 to 65 years old using different telephones in various environments, including the street, a home or office, a public place, and inside a vehicle 

Aligned and Tagged Chinese/English Parallel Telephone Speech: BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training: developed by LDC, 158,651 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations; source data consists of translated transcripts from LDC’s CALLHOME and CALLFRIEND Mandarin Chinese collections (LDC96S34LDC96T16LDC96S55

Enhanced CALLFRIEND Mandarin Chinese-Taiwan Dialect: CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition: developed by LDC, 27 hours of unscripted telephone conversations between native speakers of the Taiwan dialect of Mandarin Chinese, updating LDC96S56 with audio files in .wav format, a simplified directory structure and additional documentation and metadata

Chinese Semantic Transparency Dataset: SemTransCNC: developed by The Hong Kong Polytechnic University, contains overall semantic transparency and constituent semantic transparency data for 1,176 dimorphemic Chinese nominal compounds

Event Nugget Detection and Coreference: TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015: English newswire and discussion forum text source documents, gold standard event nugget annotations, coreference information for the nuggets, and tokenized source documents developed by LDC for the Event Nugget Detection and Coreference tasks, which evaluated systems on the detection and coreference of sets of attributes referencing events in unstructured text

Oromo Text, Annotations and Tools for Rapid Event Response: LORELEI Oromo Incident Language Pack: developed by LDC, contains all text data, annotations, supplemental resources and software tools for the Oromo language used in the DARPA LORELEI / LoReHLT 2017 Evaluation for building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks

Knowledge Base for LORELEI Entity Annotation: LORELEI Entity Detection and Linking Knowledge Base: the Knowledge Base (KB) developed by LDC for all LORELEI language pack entity linking annotation; drawn from GeoNames, the CIA World Leaders List, the CIA World Factbook and supplemented with manually-created KB entries

Translated Discussion Forum Treebank: BOLT English Translation Treebank - Chinese Discussion Forum: developed by LDC, 147,432 tokens of web discussion forum data translated from Chinese to English with part-of-speech and syntactic structure annotations following Penn Treebank II style

Chinese Telephone Collection: Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese: 25 hours of Mandarin Chinese telephone speech, manually labeled for gender, dialect type & noise, collected by LDC to support automatic language identification