Treebanks are fully parsed corpora that are manually annotated for syntactic structure at the sentence level and for part-of-speech or morphological information at the token level. Every token in every sentence is annotated. Treebanks support the creation and training of parsers and taggers, work on machine translation and speech recognition, and research on joint syntactic and semantic role labeling. Enhanced versions of Arabic, Chinese and English Treebanks were developed for the BOLT program by LDC and Brandeis University.
The development of the Arabic Treebank spans the DARPA TIDES, GALE, and BOLT programs to include newswire, broadcast, web and chat data part-of-speech/morphologically tagged and syntactically annotated.
LDC and Columbia University collaborated to develop morphological annotation guidelines for Egyptian Arabic based on Columbia’s Egyptian Arabic Morphological Analyzer CALIMA. In order to address the prevalence of Romanized script (Arabizi) in Egyptian Arabic SMS and chat data, the partners produced an orthography that normalizes spelling to facilitate morphological analysis and downstream annotation.
The Chinese Treebank Project based at Brandeis University represents over a decade’s work and likewise spans the DARPA GALE and BOLT programs. The latest version, Chinese Treebank 9.0, consists of approximately two million words of source data structured text (e.g.,newswire), broadcast programming and web data. It continues to grow in the BOLT program with the addition of SMS and conversational telephone speech. The Chinese Treebank has three layers of annotation -- word segmentation, part-of-speech (POS) tagging, and phrase structure style syntactic parsing -- guided by three sets of corresponding guidelines.
Starting with the Penn Treebank, the English Treebank has grown to include multiple genres that were POS-tagged and syntactically annotated at LDC under the DARPA TIDES and GALE programs. Enhancements for the BOLT project include Penn Treebank-style annotation on new genres (discussion forum, chat, SMS, conversational telephone speech) and on translated data and updated annotation guidelines. The annotation pipeline takes advantage of NLP tools such as POS taggers and parsers to provide the input to human annotation, and a TAG-based approach to Treebank quality control and parser evaluation has been integrated into the annotation pipeline.
Arabic Treebank Guidelines
Chinese Treebank Guidelines
English Treebank Guidelines
Treebank annotation is performed with various tools. Both Arabic Treebank and English use TreeEditor for syntactic annotation. SelectPOS was developed for Arabic part-of-speech/morphological annotation and Emacs was tailored to serve the purpose of English part-of-speech annotation. SAMA and CALIMA are for Arabic morphological analysis.