BOLT Word Alignment

Word alignment plays a critical role in statistical machine translation (SMT), determining translation correspondence between parallel sentences, resulting in links between individual words, phrases and groups. Tokens that do not have any match in the parallel sentence are explicitly marked, and links may be categorized for their syntactic or semantic function. Incorporating word level alignments into the parameter estimation of SMT systems reduces alignment error rate and further improves translation quality.

Automatic word alignment has received extensive attention as a vital component of all SMT approaches. With the availability of manually word-aligned data, supervised methods such as Maximum Entropy-based models have shown promising results. To support machine translation training in the BOLT program, LDC created a large amount of word aligned data in Chinese—English and Egyptian Arabic--English and from discussion forums, text messages and chat.

Annotation Guidelines

Chinese Word Alignment Guidelines

Chinese Tagging Guidelines

Egyptian Arabic Word Alignment Guidelines