BOLT Co-reference
Co-reference annotation, provided for BOLT by Raytheon BBN Technologies, captures the part of human language interpretation that links definite references in the text to the respective entities in the discourse. Annotators link together names, pronouns, and definite descriptions that refer to the same entity, providing crucial information for systems performing semantic interpretation. Noun phrase mentions of events are also linked to verb phrases that describe the event.
BOLT Translation
LDC provided translations for a large volume of BOLT Chinese and Egyptian Arabic source data, including discussion forum threads in Phase 1 and text message (SMS) and chat conversations in Phase 2. Data was selected and segmented into sentence units (discussion forum) and message units (SMS/chat) for translation That ensured that only the appropriate data was translated and also allowed for the alignment of the resulting parallel text at the sentence or message level.
BOLT Discussion Forum Data Collection
LDC collected threaded posts from online discussion forums in each of three languages: Chinese, Egyptian Arabic and English.
In order to create a corpus with both a high volume of data and a reasonable concentration of threads that met content and language requirements, LDC used a three-stage collection strategy: manual data scouting fed the corpus with appropriate content, a semi-supervised harvesting process augmented the corpus with larger quantities of automatically-harvested data, and a human triage task selected harvested content for the downstream pipeline.
BOLT Data Collection
BOLT developed technology that enables English speakers to retrieve and understand information from informal foreign language sources including chat, text messaging and spoken conversations. The genres of interest to BOLT were characterized by inherent variation and inconsistency, motivating the development of new collection and annotation methods.
Papers and Presentations
Stephanie Strassel, Alexis Mitchell, Shudong Huang
Multilingual Resources for Entity Extraction
ACL 2003: 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, July 7-12
Multilingual and Mixed-language Named Entity Recognition Workshop: Combining Statistical and Symbolic Models
Available: Paper in PDF
Annotation Tasks and Specifications
ACE 2008
ACE 2008 tasks included Local (within-document) EDR (Entity Detection and Recognition) and RDR (Relation Detection and Recognition) for English and Arabic. ACE 2008 also included a pilot task for Global (cross-document) EDR and RDR for English and Arabic.