New Corpora

Variability within and across English speakers: UCLA Variability Speaker Database: developed by UCLA Speech Processing and Auditory Perception Laboratory, 34 hrs of speech & transcripts from 202 English speakers making vowel sounds, reading sentences, giving instructions, engaging in neutral, happy & annoyed conversations, having a telephone conversation & responding to a video  

Egyptian Arabic SMS/Chat Treebank: BOLT Egyptian Arabic Treebank – SMS/Chat: 435,000+ tokens with part-of-speech, morphology and syntactic tree annotation from collected/donated Egyptian Arabic SMS/chat texts, developed by LDC for the DARPA BOLT program 

Multi-channel, Multi-language CTS for Speaker ID: RATS Speaker Identification: 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech retransmitted over eight channels (17k total hours) with speech segment annotations, developed by LDC for DARPA’s RATS program

100 Million Word Arabic Dictionary: Classical Arabic Dictionary: from texts dating between 431 and 1104 CE, principally books and essays, along with word occurrences, source documents and related metadata

Alignments for Discourse Relations: DiscAlign for Penn and RST Discourse Treebanks: developed by Saarland University, 6,700 alignments for discourse annotations in Penn Discourse Treebank Version 2.0 (LDC2008T05) and RST Discourse Treebank (LDC2002T07), available to all PDTB 2.0 and RST-DT licensees

Spanish Read Speech, Transcripts: Wikipedia Spanish Speech and Transcripts: 25 hours of speech from 193 speakers reading Wikipedia articles, with transcripts and speaker metadata

Egyptian Arabic-English Parallel SMS/Chat: BOLT Egyptian Arabic SMS/Chat Parallel Training Data: 723,000 tokens of Egyptian Arabic SMS/Chat data with English translations, developed by LDC for DARPA’s BOLT program