BOLT Discussion Forum Data Collection
LDC collected threaded posts from online discussion forums in each of three languages: Chinese, Egyptian Arabic and English.
In order to create a corpus with both a high volume of data and a reasonable concentration of threads that met content and language requirements, LDC used a three-stage collection strategy: manual data scouting fed the corpus with appropriate content, a semi-supervised harvesting process augmented the corpus with larger quantities of automatically-harvested data, and a human triage task selected harvested content for the downstream pipeline.
Data scouting was facilitated through a customized user interface developed by LDC for BOLT that recorded judgments about language, topic, and other properties for each scouted thread. This information informed the automatic harvesting process by helping to identify forums that were likely to contain additional useful threads. In the backend harvesting process, URLs submitted by data scouts were grouped by host site. For each site, configuration files were written for harvesting and conversion of the raw web data into a standard XML format.
While all harvested data was made available to BOLT performers, only a small subset was selected for manual translation and annotation to create BOLT training, development and evaluation sets. It was important that the data selected for annotation met requirements for language and content; it was also highly desirable that the selected data was high-value; i.e. that it did not duplicate the salient features of existing training data. For these reasons, data scouting and automatic harvesting were followed by a manual triage process. Sentence segmentation was then performed on the selected data in order to provide a stable basis for later linguistic annotation activities including translation and syntactic analysis.