New Corpora

Chinese Mandarin Read Speech: AISHELL-1: developed by Beijing Shell Shell Technology Co. Ltd., 520 hours of Chinese Mandarin speech from 400 speakers across different accent areas in China with transcripts recorded in an indoor, quiet environment on three different devices to assist speech recognition development in 11 domains including smart homes, autonomous driving, entertainment, finance and science and technology

Brazilian Portuguese Speech for Education: Avatar Education Portuguese: developed by the University of Pernambuco, 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions, 1,400 utterances by two speakers transcribed at the word level (without time alignments) and at the phoneme level (with time alignment labels) for use in educational contexts, such as online learning 

Dialectal Arabic Treebank of Informal Web DataBOLT Egyptian Arabic Treebank - Discussion Forum: developed by LDC, Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation created for the DARPA Broad Operational Language Translation (BOLT) Program; the annotations follow Penn Arabic Treebank (PATB) guidelines and contain 440,448 tokens before clitics were split and 508,548 tree tokens after clitics were split 

Telugu Speech: IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a: developed by Appen, 201 hours of Telugu conversational and scripted telephone speech with transcripts collected in 2013 and 2014 from speakers aged 16 to 65 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Concrete Annotation Schema: Concretely Annotated English Gigaword: developed by Johns Hopkins University's Human Language Technology Center of Excellence,  adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to English Gigaword Fifth Edition (LDC2011T07)

TAC KBP Evaluation and Training Data: TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014: developed by LDC for the Slot Filling evaluation track  focused on  mining information about entities from text, includes queries, manual runs and assessment results

Arabic-French Parallel  Text: TRAD Arabic-French Parallel Text -- Newswire: developed by ELDA for the PEA-TRAD project, French translations of 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21)

Arabic, Chinese & English Web Text for Information Retrieval: BOLT Information Retrieval Comprehensive Training and Evaluation: all data produced by LDC in support of the DARPA BOLT IR task including annotations, source documents and scoring software

Amateur Web Video: HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation: 53 hours of user-generated videos with annotation and metadata developed for the HAVIC project and the NIST-sponsored Multimedia Event Detection task

Spanish Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- Spanish: 23 hours of telephone speech in Spanish collected by LDC to support research and technology evaluation in automatic language identification

Kazakh Speech: IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a: developed by Appen, 203 hours of Kazakh conversational and scripted telephone speech with transcripts collected in 2013 and 2014 from speakers aged 16 to 64 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle