New Corpora

Spanish, Catalan and Portuguese Opinion Text: NewSoMe Corpus of Opinion in News Report: compiled at Barcelona Media, 200 documents of news reports in each of Spanish, Catalan and Portuguese annotated manually for opinion including topic, segment, cue, subjectivity, polarity and intensity 

Arabic Written Essays and Spoken Recordings: Arabic Learner Corpus: developed by the University of Leeds, 282,732 words in 1,585 materials produced by 942 students from 67 nationalities studying Arabic at pre-university and university levels in Saudi Arabia

Penn Treebank Revised: English News Text Treebank: Penn Treebank Revised: developed by LDC with funding through a gift from Google Inc., automated and manual revisions – including revised tokenization, part-of-speech and syntactic treebank annotation – for the Penn Treebank Wall Street Journal data including 1,203,648 word-level tokens in 49,191 sentence-level tokens – in all 2,312 of the original PTB WSJ files

Turkish Web Text: TS Wikipedia: 1.6 million processed and tokenized Turkish Wikipedia pages, including part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams

Navigational Telephone Speech: The Walking Around Corpus: developed by Stony Brook University, 33 hours of navigational telephone dialogues from 72 speakers in pairs including visual materials used to elicit dialogues, data about the speakers’ relationship, spatial abilities and memory performance and more 

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 3 Arabic Broadcast Conversation Speech Part 1: 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1: the complete set of corresponding transcripts including 733,233 tokens in plain-text, tab delimited format with UTF-8 encoding 

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4: 243,038 tokens of word aligned Chinese and English parallel text enriched with linguistic tags

GALE Phase 3 and 4 Arabic Newswire Parallel Text: 156,775 tokens of Arabic source text and its English translation