New Corpora

Event Argument Extraction and Linking: TAC KBP English Event Argument - Training and Evaluation Data 2014-2015: English newswire and discussion forum source documents, manual runs, assessments, and event hoppers developed by LDC for the Event Argument Extraction and Linking tasks which required systems to extract and link event arguments from unstructured text

Cognitive Properties of Chinese Words: Chinese CogBank: developed for use in metaphor understanding and generation, 232,497 "word-property" pairs comprised of 83,104 words and 100,195 properties with associated frequencies to measure a property’s importance

English Data and Annotations for Reading System Development: Machine Reading Phase 1 IC Training Data: developed by LDC, 248 English newswire source documents and 116 standoff annotation files annotated with instances of explicit and non-explicit relations and their arguments, constituting the training data for the IC (Core Domain) task in the DARPA Machine Reading program

Dholuo Speech: IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b: developed by Appen, 204 hours of Dholuo conversational and scripted telephone speech with transcripts collected in 2014-2015 from speakers aged 16 to 67 years old using different telephones in various environments, including the street, a home or office, a public place, and inside a vehicle

New AMR Release: Abstract Meaning Representation (AMR) Annotation Release 3.0: developed by LDC, SDL/Language Weaver, Inc., the University of Colorado, and the Information Sciences Institute, a semantic treebank of over 59k English natural language sentences; updates the second version (LDC2017T10) with new data, more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.

Lexical Database of Chinese Words and Nonwords: Database of Word Level Statistics – Mandarin: lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese developed by The Hong Kong Polytechnic University

Spanish Read Speech and Transcripts: LibriVox Spanish: 73 hours of Spanish read speech from 154 native and non-native speakers (77 men and 77 women) and transcripts developed by native Spanish speakers; audio data from Spanish audiobooks developed by LibriVox

Chinese Conversations and Transcripts, Metadata: Magic Data Chinese Mandarin Conversational Speech: developed by Beijing Magic Data Technology Co., Ltd., 10 hours of Mandarin conversational speech from 60 native speakers recorded on multiple devices and presented in multiple forms, totaling 60 hours with corresponding transcripts and metadata (topic, collection date, mobile device, speaker demographic information)

Egyptian Arabic/English Annotated SMS/Chat: BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training: developed by LDC, 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations

Resources for TAC KBP EDL Track: TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017: queries, knowledge base (KB) links, equivalence class clusters for NIL entities, and entity type information developed by LDC for end-to-end entity extraction, linking and clustering