BOLT Treebank
Treebanks are fully parsed corpora that are manually annotated for syntactic structure at the sentence level and for part-of-speech or morphological information at the token level. Every token in every sentence is annotated. Treebanks support the creation and training of parsers and taggers, work on machine translation and speech recognition, and research on joint syntactic and semantic role labeling. Enhanced versions of Arabic, Chinese and English Treebanks were developed for the BOLT program by LDC and Brandeis University.
Arabic Treebank
The development of the Arabic Treebank spans the DARPA TIDES, GALE, and BOLT programs to include newswire, broadcast, web and chat data part-of-speech/morphologically tagged and syntactically annotated.
LDC and Columbia University collaborated to develop morphological annotation guidelines for Egyptian Arabic based on Columbia’s Egyptian Arabic Morphological Analyzer CALIMA. In order to address the prevalence of Romanized script (Arabizi) in Egyptian Arabic SMS and chat data, the partners produced an orthography that normalizes spelling to facilitate morphological analysis and downstream annotation.
Chinese Treebank
The Chinese Treebank Project based at Brandeis University represents over a decade’s work and likewise spans the DARPA GALE and BOLT programs. The latest version, Chinese Treebank 9.0, consists of approximately two million words of source data structured text (e.g.,newswire), broadcast programming and web data. It continues to grow in the BOLT program with the addition of SMS and conversational telephone speech. The Chinese Treebank has three layers of annotation -- word segmentation, part-of-speech (POS) tagging, and phrase structure style syntactic parsing -- guided by three sets of corresponding guidelines.
English Treebank
Starting with the Penn Treebank, the English Treebank has grown to include multiple genres that were POS-tagged and syntactically annotated at LDC under the DARPA TIDES and GALE programs. Enhancements for the BOLT project include Penn Treebank-style annotation on new genres (discussion forum, chat, SMS, conversational telephone speech) and on translated data and updated annotation guidelines. The annotation pipeline takes advantage of NLP tools such as POS taggers and parsers to provide the input to human annotation, and a TAG-based approach to Treebank quality control and parser evaluation has been integrated into the annotation pipeline.
Annotation Guidelines
Arabic Treebank Guidelines
Arabic Treebank Guidelines include Penn Arabic Treebank Guidelines
Arabic Treebank Morphological Analysis and POS Annotation
Guidelines for Treebank Annotation of Speech Effects and Disfluency for the Penn Arabic Treebank
Chinese Treebank Guidelines
Chinese Segmentation Guidelines
Chinese POS-tagging Guidelines
English Treebank Guidelines
Addendum to the Penn Treebank II Style Bracketing Guidelines
Supplementary Guidelines for ETTB 2.0
Bracketing Webtext: An Addendum to Penn Treebank II Guidelines
Bracketing Guidelines for Treebank II Style
Annotation Tools
Treebank annotation is performed with various tools. Both Arabic Treebank and English use TreeEditor for syntactic annotation. SelectPOS was developed for Arabic part-of-speech/morphological annotation and Emacs was tailored to serve the purpose of English part-of-speech annotation. SAMA and CALIMA are for Arabic morphological analysis.