New Corpora

Arabic Handwriting: KHATT: Handwritten Arabic Text: developed by King Fahd University of Petroleum & Minerals, Technical University of Dortmund and Braunschweig University of Technology, unrestricted Arabic handwriting from 1,000 distinct male and female writers representing diverse countries, age groups, handedness and education levels, designed to promote research in areas such as text recognition and writer identification

English Syllable Pronunciation: Articulation Index LSCP: developed by Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure, revisions and enhancements to a subset of Articulation Index (LDC2005S22) – a corpus of 20 American English speakers pronouncing over 25,000 syllables – that include the addition of forced alignment to sound files, time alignment of syllable utterances and format conversions

German Children’s Text: Karlsruhe Children's Text: developed by the Cooperative State University Baden-Württemberg, University of Education and Karlsruhe Institute of Technology, 14,000 freely written German sentences from more than 1700 children in grades one through eight digitized, corrected for orthography and with spelling errors annotated at the grapheme and syllable level and for morphology and syntax

Spanish Newswire Text: ACE 2007 Spanish DevTest - Pilot Evaluation: developed by LDC, the complete set of Spanish development and test data from the 2007 ACE technology evaluation consisting of newswire data annotated for entities and temporal expressions 

Spanish, Catalan and Portuguese Opinion Text: NewSoMe Corpus of Opinion in News Report: compiled at Barcelona Media, 200 documents of news reports in each of Spanish, Catalan and Portuguese annotated manually for opinion including topic, segment, cue, subjectivity, polarity and intensity 

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 4 Chinese Newswire Parallel Sentences: 627 source-translation document pairs comprising 90,434 tokens of Chinese source text and its English translation 

GALE Phase 4 Chinese Broadcast News Parallel Sentences: 40 source-translation document pairs comprising 156,249 tokens of Chinese source text and its English translation

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4: 243,038 tokens of word aligned Chinese and English parallel text enriched with linguistic tags

GALE Phase 3 and 4 Arabic Newswire Parallel Text: 156,775 tokens of Arabic source text and its English translation