BOLT Co-reference

Co-reference annotation, provided for BOLT by Raytheon BBN Technologies, captures the part of human language interpretation that links definite references in the text to the respective entities in the discourse. Annotators link together names, pronouns, and definite descriptions that refer to the same entity, providing crucial information for systems performing semantic interpretation. Noun phrase mentions of events are also linked to verb phrases that describe the event.

BOLT Translation

LDC provided translations for a large volume of BOLT Chinese and Egyptian Arabic source data, including discussion forum threads in Phase 1 and text message (SMS) and chat conversations in Phase 2. Data was selected and segmented into sentence units (discussion forum) and message units (SMS/chat) for translation That ensured that only the appropriate data was translated and also allowed for the alignment of the resulting parallel text at the sentence or message level. 

BOLT Discussion Forum Data Collection

LDC collected threaded posts from online discussion forums in each of three languages: Chinese, Egyptian Arabic and English.

In order to create a corpus with both a high volume of data and a reasonable concentration of threads that met content and language requirements, LDC used a three-stage collection strategy: manual data scouting fed the corpus with appropriate content, a semi-supervised harvesting process augmented the corpus with larger quantities of automatically-harvested data, and a human triage task selected harvested content for the downstream pipeline.

BOLT Data Collection

BOLT developed technology that enables English speakers to retrieve and understand information from informal foreign language sources including chat, text messaging and spoken conversations. The genres of interest to BOLT were characterized by inherent variation and inconsistency, motivating the development of new collection and annotation methods.