New Corpora

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web: 344,680 tokens of word aligned Arabic and English parallel text collected from Arabic source newswire and web data by LDC in 2006-2008

GALE Phase 2 Chinese Broadcast News Parallel Text Part 1: 30 source-translation document pairs comprising 198,350 characters of translated material collected by LDC from Chinese broadcast news programming between 2005 and 2007 

GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2: over 141,000 tokens of word aligned Arabic and English parallel text with treebank annotations from Arabic source broadcast programming collected by LDC in 2007 and 2008

MALACH (Multilingual Access to Large Spoken Archives) Project: methods for improved access to large multinational spoken archives with the focus to advance the state of the art of automatic speech recognition and information retrieval

USC-SFI MALACH Interviews and Transcripts Czech: 229 hours of interviews, with 143 hours transcribed, from 420 individual Holocaust survivors and witnesses recorded in quiet and noisy environments

Arabic Speech Database: King Saud University Arabic Speech Database: 590 hours of recorded Arabic speech from 269 male and female speakers in quiet and noisy environments with speaker metadata

Open Machine Translation (OpenMT) Evaluation Series: LDC’s ongoing effort to support research in machine translation (MT) technologies that translate all forms of text between human languages

NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source: 20 files that include the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT 2012 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set

CALLFRIEND Farsi Second Edition: the release of the entire collection of telephone conversations among native Farsi speakers -- originally recorded in 1995 and 1996 – and their transcripts

CALLFRIEND Farsi Second Edition Speech:  over 42 hours of telephone conversations (100 recordings) among native Farsi speakers in the continental United States making a single call to a family member or friend living in the United States  

CALLFRIEND Farsi Second Edition Transcripts:  the complete set of transcripts in three formats (Romanized, Arabic-script and XML)