New Corpora

Egyptian Arabic/English Annotated Telephone SpeechBOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training: developed by LDC, 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations; source data consists of translated transcripts from LDC's CALLHOME and CALLFRIEND Egyptian Arabic collection (LDC97S45,  LDC97T19LDC2002S37LDC2002T38LDC96S49

New Mixer ReleaseMixer 4 and 5 Speech: developed by LDC, 14,185 hours of cross-channel audio recordings of conversational telephone speech, interviews, elicitation exercises and transcript readings from 616 American English speakers, collected in 2007 and used in the 2008 NIST Speaker Recognition Evaluation

Training and Evaluation Data for Distributional Semantic ModelsEVALution: developed by The Hong Kong Polytechnic University, English and Mandarin Chinese data sets -- EVALution 1.0 and EVALution-Man -- containing semantic relations and metadata for training and evaluating distributional semantic models 

Event Argument Extraction and Linking: TAC KBP English Event Argument - Training and Evaluation Data 2014-2015: English newswire and discussion forum source documents, manual runs, assessments, and event hoppers developed by LDC for the Event Argument Extraction and Linking tasks which required systems to extract and link event arguments from unstructured text

Cognitive Properties of Chinese Words: Chinese CogBank: developed for use in metaphor understanding and generation, 232,497 "word-property" pairs comprised of 83,104 words and 100,195 properties with associated frequencies to measure a property’s importance

English Data and Annotations for Reading System Development: Machine Reading Phase 1 IC Training Data: developed by LDC, 248 English newswire source documents and 116 standoff annotation files annotated with instances of explicit and non-explicit relations and their arguments, constituting the training data for the IC (Core Domain) task in the DARPA Machine Reading program

Dholuo Speech: IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b: developed by Appen, 204 hours of Dholuo conversational and scripted telephone speech with transcripts collected in 2014-2015 from speakers aged 16 to 67 years old using different telephones in various environments, including the street, a home or office, a public place, and inside a vehicle

New AMR Release: Abstract Meaning Representation (AMR) Annotation Release 3.0: developed by LDC, SDL/Language Weaver, Inc., the University of Colorado, and the Information Sciences Institute, a semantic treebank of over 59k English natural language sentences; updates the second version (LDC2017T10) with new data, more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.

Lexical Database of Chinese Words and Nonwords: Database of Word Level Statistics – Mandarin: lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese developed by The Hong Kong Polytechnic University

Spanish Read Speech and Transcripts: LibriVox Spanish: 73 hours of Spanish read speech from 154 native and non-native speakers (77 men and 77 women) and transcripts developed by native Spanish speakers; audio data from Spanish audiobooks developed by LibriVox