New Corpora

Large Vocabulary Continuous Speech Recognition with the Wall Street Journal (WSJ): multichannel WSJ recordings building on WSJCAM0 Cambridge Read News and CSR-1 (WSJ0) Complete

Multi-Channel WSJ Audio: 100 hours of speech from 45 British English speakers reading WSJ texts recorded in three meeting-type scenarios with a variety of microphones

English Hyponyms: Domain-Specific Hyponym Relations: 5,000 English hyponym relations taken from Wikipedia articles in five domains including data mining, computer networks, data structures, Euclidean geometry and microbiology

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Arabic-English Parallel Aligned Treebank -- Web Training: 69,766 tokens of word aligned Arabic and English parallel text with treebank annotations from Arabic source web data (newsgroups, weblogs) collected by LDC in 2004 and 2005

GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web: 344,680 tokens of word aligned Arabic and English parallel text collected from Arabic source newswire and web data by LDC in 2006-2008

GALE Phase 2 Chinese Broadcast News Parallel Text Part 1: 30 source-translation document pairs comprising 198,350 characters of translated material collected by LDC from Chinese broadcast news programming between 2005 and 2007 

MALACH (Multilingual Access to Large Spoken Archives) Project: methods for improved access to large multinational spoken archives with the focus to advance the state of the art of automatic speech recognition and information retrieval

USC-SFI MALACH Interviews and Transcripts Czech: 229 hours of interviews, with 143 hours transcribed, from 420 individual Holocaust survivors and witnesses recorded in quiet and noisy environments