BOLT Treebank

Treebanks are fully parsed corpora that are manually annotated for syntactic structure at the sentence level and for part-of-speech or morphological information at the token level. Every token in every sentence is annotated. Treebanks support the creation and training of parsers and taggers, work on machine translation and speech recognition, and research on joint syntactic and semantic role labeling. Enhanced versions of Arabic, Chinese and English Treebanks were developed for the BOLT program by LDC and Brandeis University. 

Arabic Treebank

The development of the Arabic Treebank spans the DARPA TIDES, GALE, and BOLT programs to include newswire, broadcast, web and chat data part-of-speech/morphologically tagged and syntactically annotated.

LDC and Columbia University collaborated to develop morphological annotation guidelines for Egyptian Arabic based on Columbia’s Egyptian Arabic Morphological Analyzer CALIMA. In order to address the prevalence of Romanized script (Arabizi) in Egyptian Arabic SMS and chat data, the partners produced an orthography that normalizes spelling to facilitate morphological analysis and downstream annotation. 

Chinese Treebank

The Chinese Treebank Project based at Brandeis University represents over a decade’s work and likewise spans the DARPA GALE and BOLT programs. The latest version, Chinese Treebank 8.0, consists of approximately 1.6 million words of source data structured text (e.g.,newswire), broadcast programming and web data. It continues to grow in the BOLT program with the addition of SMS and conversational telephone speech. The Chinese Treebank has three layers of annotation -- word segmentation, part-of-speech (POS) tagging, and phrase structure style syntactic parsing -- guided by three sets of corresponding guidelines.

English Treebank

Starting with the Penn Treebank, the English Treebank has grown to include multiple genres that were POS-tagged and syntactically annotated at LDC under the DARPA TIDES and GALE programs.  Enhancements for the BOLT project include Penn Treebank-style annotation on new genres (discussion forum, chat, SMS, conversational telephone speech) and on translated data and updated annotation guidelines. The annotation pipeline takes advantage of NLP tools such as POS taggers and parsers to provide the input to human annotation, and a TAG-based approach to Treebank quality control and parser evaluation has been integrated into the annotation pipeline.

Annotation Guidelines

Arabic Treebank Guidelines

Arabic Treebank Guidelines include Penn Arabic Treebank Guidelines

Arabic Treebank Morphological Analysis and POS Annotation 

Guidelines for Treebank Annotation of Speech Effects and Disfluency for the Penn Arabic Treebank

Chinese Treebank Guidelines

Chinese Segmentation Guidelines

Chinese POS-tagging Guidelines

Chinese Bracketing Guidelines 

English Treebank Guidelines

Addendum to the Penn Treebank II Style Bracketing Guidelines 

SMS/Chat Treebank Guidelines 

Supplementary Guidelines for ETTB 2.0 

Bracketing Webtext: An Addendum to Penn Treebank II Guidelines 

Bracketing Guidelines for Treebank II Style 

Annotation Tools

Treebank annotation is performed with various tools. Both Arabic Treebank and English use TreeEditor for syntactic annotation. SelectPOS was developed for Arabic part-of-speech/morphological annotation and Emacs was tailored to serve the purpose of English part-of-speech annotation. SAMA and CALIMA are for Arabic morphological analysis.