New Corpora

Email Communications: Avocado Research Email Collection: 279 email accounts from a defunct technology company inclusive of emails, attachments and processed personal folders with metadata describing folder structure, email characteristics, calendar entries and contacts for use in social network analysis, e-discovery and related fields

Multichannel Noisy Telephone Speech: RATS Speech Activity Detection: developed by LDC, 3,000 hours of multilingual conversational telephone speech passed through eight distinct transceiver configurations with severely degraded audio signals typical of various radio communication channels

Spanish and Catalan Text: SenSem Databank: developed by GRIAL, syntactic and semantic annotation for over 35,000 sentences, approximately one million words of Spanish and 700,000 words of Catalan translated from the Spanish

Open Relation Extraction: Benchmarks for Open Relation Extraction: developed by the University of Alberta, 14,000 sentences from the New York Times Annotated Corpus (LDC2008T19) and Treebank-3 (LDC99T42) designed to contain benchmarks for the task of open relation extraction (ORE), along with sample extractions of ORE methods and evaluation scripts

New Additions to the Fisher and CALLHOME Series: Fisher and CALLHOME Spanish-English Speech Translation: developed by Johns Hopkins University, 38 hours of speech with defined training, development and held-out test sets gathered from audio and transcript releases from Fisher Spanish (LDC2010T04) and CALLHOME Spanish (LDC96T17

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 2 Arabic Broadcast News Speech Part 2: 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet and MTC featuring news programs from eleven broadcasters

GALE Phase 2 Arabic Broadcast News Transcripts Part 2: the complete set of corresponding transcripts including 920,730 tokens in plain-text, tab delimited format with UTF-8 encoding 

GALE Phase 3 Chinese Broadcast Conversation Speech Part 1: 126 hours of Mandarin Chinese broadcast conversation speech collected in 2007 by LDC and Hong University of Science and Technology featuring interviews, call-in programs and roundtable discussions from five broadcasters

GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1: the complete set of corresponding transcripts including 1,556,904 tokens in plain-text, tab delimited format with UTF-8 encoding 

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3: 242,020 tokens of word aligned Chinese-English parallel text enriched with linguistic tags.