New Corpora

New Arabic Treebank release: Arabic Treebank – Weblog: developed by LDC, Arabic weblog data with part-of-speech, morphology, gloss and syntactic tree annotation, including 243,117 source tokens before clitics were split and 308,996 tree tokens after clitics were split.

English and Spanish Blogs: NewSoMe Corpus of Opinion in Blogs: compiled at Barcelona Media, 108 English documents and 191 Spanish documents consisting of blogs annotated manually for opinion including topic, segment, cue, subjectivity, polarity and intensity. 

Multilingual Dependency Treebanks: 

2006 CoNLL Shared Task - Ten Languages: dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish.

2006 CoNLL Shared Task - Arabic & Czech: Arabic and Czech dependency treebanks used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing.

Arabic Handwriting: KHATT: Handwritten Arabic Text: developed by King Fahd University of Petroleum & Minerals, Technical University of Dortmund and Braunschweig University of Technology, unrestricted Arabic handwriting from 1,000 distinct male and female writers representing diverse countries, age groups, handedness and education levels, designed to promote research in areas such as text recognition and writer identification

English Syllable Pronunciation: Articulation Index LSCP: developed by Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure, revisions and enhancements to a subset of Articulation Index (LDC2005S22) – a corpus of 20 American English speakers pronouncing over 25,000 syllables – that include the addition of forced alignment to sound files, time alignment of syllable utterances and format conversions

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 3 Chinese Broadcast News Speech: 150 hours of Mandarin Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong Kong University of Science and Technology

GALE Phase 3 Chinese Broadcast News Transcripts: the complete set of corresponding transcripts including 1,933,695 tokens in plain-text, tab delimited format with UTF-8 encoding 

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 4 Chinese Weblog Parallel Sentences: 231 source-translation document pairs, comprising 92,501 tokens of Chinese source text and its English translation. 

GALE Phase 4 Chinese Newswire Parallel Sentences: 627 source-translation document pairs comprising 90,434 tokens of Chinese source text and its English translation