New Corpora

New AMR Release: Abstract Meaning Representation (AMR) Annotation Release 3.0: developed by LDC, SDL/Language Weaver, Inc., the University of Colorado, and the Information Sciences Institute, a semantic treebank of over 59k English natural language sentences; updates the second version (LDC2017T10) with new data, more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.

Lexical Database of Chinese Words and Nonwords: Database of Word Level Statistics – Mandarin: lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese developed by The Hong Kong Polytechnic University

Spanish Read Speech and Transcripts: LibriVox Spanish: 73 hours of Spanish read speech from 154 native and non-native speakers (77 men and 77 women) and transcripts developed by native Spanish speakers; audio data from Spanish audiobooks developed by LibriVox

Chinese Conversations and Transcripts, Metadata: Magic Data Chinese Mandarin Conversational Speech: developed by Beijing Magic Data Technology Co., Ltd., 10 hours of Mandarin conversational speech from 60 native speakers recorded on multiple devices and presented in multiple forms, totaling 60 hours with corresponding transcripts and metadata (topic, collection date, mobile device, speaker demographic information)

Egyptian Arabic/English Annotated SMS/Chat: BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training: developed by LDC, 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations

Resources for TAC KBP EDL Track: TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017: queries, knowledge base (KB) links, equivalence class clusters for NIL entities, and entity type information developed by LDC for end-to-end entity extraction, linking and clustering

English Discussion Forums Annotated for Committed Belief: DEFT English Committed Belief Annotation: developed by LDC, 950,000 words of English discussion forum text annotated to mark the level of commitment displayed by the author to the truth of the propositions expressed in the text

Enhanced CALLFRIEND American English-Non-Southern Dialect: CALLFRIEND American English-Non-Southern Dialect Second Edition: developed by LDC, 26 hours of unscripted telephone conversations between native speakers of non-Southern dialects of American English, updated with audio files in .wav format, a simplified directory structure and additional documentation and metadata

TAC KBP Cold Start Evaluation Documents: TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017: Chinese, English and Spanish newswire and web text source documents, queries, assessments, manual runs and final assessments developed by LDC for the Cold Start task which required systems to identify all entities and to build a knowledge base of the relations between them

Amharic Speech: IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b: developed by Appen, 204 hours of Amharic conversational and scripted telephone speech with transcripts collected in 2014 from speakers aged 16 to 67 years old using different telephones in various environments, including the street, a home or office, a public place, and inside a vehicle