New Corpora

Nahuatl Field Collections and Related Resources: Ethnobotanical Research and Language Documentation of Nahuatl: 190 hours of field recordings conducted in the Sierra Nororiental and Sierra Norte regions of Puebla, Mexico; audio and video recordings of native Nahuatl speakers during the collection of particular plants; partial transcripts (Nahuatl and Spanish); a Highland Puebla Nahuat dictionary; botanical and ethnobotanical data; and speaker metadata

New Chinese AMR Release: Chinese Abstract Meaning Representation 2.0: developed by Brandeis University and Nanjing Normal University, a cumulative release of semantic representations for 20k Chinese sentences from Chinese Treebank 8.0 (LDC2013T21); includes the original Chinese AMR data, a converted English AMR format & a Chinese syntactic dependency tree format each divided into training, development and test sets

Co-reference Annotation on Egyptian-Arabic Informal Text: BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech: developed by Raytheon BBN Technologies, co-reference annotation performed on BOLT treebank annotation covering noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers and verbs

Virtual Science Tutor Interactions: MyST Children’s Conversational Speech: developed by Boulder Learning Inc., 470 hours of English speech from 1371 students (grades 3-5) answering open-ended questions, includes transcripts, a pronunciation dictionary, and dev, test and train partitions for ASR

Egyptian Arabic CTS Treebank: BOLT Egyptian Arabic Treebank – Conversational Telephone Speech: 150,000+ tokens with part-of-speech annotation, morphology, gloss and syntactic tree annotation from CALLHOME transcripts, developed by LDC for the DARPA BOLT program

Tamil Dysarthric Speech: The SSNCE Database of Tamil Dysarthric Speech: developed by the Speech Lab, SSN College of Engineering, India, and the Indian National Institute of Empowerment of Persons with Multiple Disabilities, 8 hours of Tamil speech data (words & sentences), time-aligned transcripts and metadata from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers)

Annotated English Newswire for Phrasal Paraphrase Detection: ESPADA (Extended Syntactic Phrase Alignment DAtaset): annotated parse trees and alignment on English sentential paraphrases from NIST’s OpenMT evaluation corpora, extends SPADE (LDC2018T09) by adding new annotated data for training/testing phrasal paraphrase detection and phrase representation models to SPADE's development and test sets

Chinese-English Parallel SMS/Chat: BOLT Chinese SMS/Chat Parallel Training Data: 1.8 million tokens of Chinese SMS/Chat data with English translations, developed by LDC for DARPA’s BOLT program