New Corpora

Egyptian Arabic/English Parallel Web Text: BOLT Arabic Discussion Forum Parallel Training Data: developed by LDC, 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations

Task-oriented Speech from Middle School Students: SRI Speech-Based Collaborative Learning Corpus: developed by SRI International, 120 hours of English speech from 134 US middle school students working collaboratively on mathematics problems with orthographic transcriptions, manual annotation, log files, and supporting documentation; created to determine whether detectable patterns exist in student speech that correlate with collaborative learning indicators and to provide a means of assessing collaboration quality 

Resources for TAC KBP Entity Discovery and Linking Task: TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015: queries, knowledge base (KB) links, equivalence class clusters for NIL entities, entity type information for queries, source documents (Chinese, English and Spanish newswire and web text) and BaseKB the reference KB adopted in 2015

Enhanced  Mandarin Telephone Speech and Transcripts: HUB5 Mandarin Telephone Speech and Transcripts Second Edition:developed by LDC, 19 hours of Mandarin speech from 42 unscripted telephone conversations with transcripts of contiguous 5-30 minute segments/call; .wav audio format, Pinyin transcripts, forced alignment, updated documentation and metadata

German Speech for Social Characteristic Detection: Nautilus Speaker Characterization:developed by the Technical University of Berlin, 155 hours of conversational speech from 300 German speakers aged 18 to 35 years (126 males and 174 females) with no marked dialect or accent, annotated for speaker social characteristics, such as personality, charisma and voice attractiveness

English Relation Extraction: TAC Relation Extraction Dataset: developed by The Stanford NLP Group, a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the 2009-2014 NIST TAC KBP Englishslot filling evaluations 

Chinese Mandarin Read Speech: AISHELL-1: developed by Beijing Shell Shell Technology Co. Ltd., 520 hours of Chinese Mandarin speech from 400 speakers across different accent areas in China with transcripts recorded in an indoor, quiet environment on three different devices to assist speech recognition development in 11 domains including smart homes, autonomous driving, entertainment, finance and science and technology

Brazilian Portuguese Speech for Education: Avatar Education Portuguese: developed by the University of Pernambuco, 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions, 1,400 utterances by 1,400 speakers transcribed at the word level (without time alignments) and at the phoneme level (with time alignment labels) for use in educational contexts, such as online learning 

Dialectal Arabic Treebank of Informal Web DataBOLT Egyptian Arabic Treebank - Discussion Forum: developed by LDC, Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation created for the DARPA Broad Operational Language Translation (BOLT) Program; the annotations follow Penn Arabic Treebank (PATB) guidelines and contain 440,448 tokens before clitics were split and 508,548 tree tokens after clitics were split 

Telugu Speech: IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a: developed by Appen, 201 hours of Telugu conversational and scripted telephone speech with transcripts collected in 2013 and 2014 from speakers aged 16 to 65 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle