BOLT SMS and Chat Data Collection

LDC collected naturally occurring SMS and Chat data in Chinese, Egyptian Arabic and English.

LDC developed a robust message collection system which integrated participant enrollment components, live collection components (real-time capture of SMS or chat messages between pairs of consented, enrolled users) and donation collection components (user contribution of archived SMS or chat messages).

A customized page within LDC’s WebAnn framework was created for users to register and agree to participate in the live collection, message donation, or both. This front end handled user authentication and connection to the generalized enrollment infrastructure. Users’ real names were not collected. Each enrolled user was assigned a unique ID and could choose a username that would identify them to other participants. Users were asked to provide scheduling and contact information (e.g. phone number for live SMS collection, user ID for live chat collection). Demographic information and other personal information could be provided at the user’s option. All personal identifying information was stored in a separate, secure database never linked to the corpus data.

For live collection of SMS, LDC’s collection system MCol chose a pair of conversation partners and initiated a session by sending a text message to designated participants. Users replied to MCol’s invitation text message by proceeding to text with one another. MCol intercepted each user’s message and relayed it transparently to the other party. During live collection of chat the MCol bot created a chat room and added designated participants to the room, where they could exchange messages. The bot simply “sat” in the room and recorded the exchanged messages.

Participants who enrolled in donation collection uploaded their existing SMS and chat messages using a simple web interface created for the project. Because of the wide range of phone systems and chat clients used by potential participants, LDC conducted surveys prior to collection to identify the most popular systems and apps among user populations (languages and countries) of interest. As a result, message donation from a range of clients and apps was supported, including:

  • SMS: iMessage, Android SMS, Symbian SMS, Viber, BlackBerry
  • Chat: WhatsApp, QQ, Google chat, Skype chat, Yahoo Messenger

LDC also developed custom parsers to process incoming message archives.

Users were instructed how to locate existing SMS and chat messages on their device, create an archive file, and export the file for upload to LDC’s collection platform. The data archive was first uploaded to a temporary holding tank, where an automated process detected its file format and selected the appropriate message parser. The parser then divided the archive into individual conversations and performed some simple sanity checks. Any personal identifying information contained in message metadata (e.g. phone numbers or usernames) was automatically removed during parsing. The parsed conversations were then presented back to the user in a simple web GUI that allowed the user to edit or remove any part of the archive they did not wish to donate. After the user was satisfied with the edited archive, they clicked a button to allow it to be uploaded to the collection database. Only conversations that the user explicitly approved were stored in the database; the original unedited archive was deleted from the temporary holding tank.

All collected conversations (whether live or donated) were saved to the collection database where they were subject to post-hoc automatic validation, including language identification and duplicate content detection. Conversations not in the target languages were flagged and subject to manual review. Occasional duplicated conversations were found and flagged. Single-sided conversations were also flagged for removal. Conversations that passed the validation stage were migrated to a centralized conversation and message database where they could be accessed for manual auditing, selection and segmentation.

After collection and automated processing, all messages were manually audited by project staff who had received special training in the protection of participants’ privacy. Manual auditing was necessary for two reasons: to determine which conversations were suitable for downstream translation and annotation tasks and to exclude any messages or conversations that contained personal identifying information or sensitive content that had not already been redacted by the participant. Auditors used a web-based GUI to flag unacceptable content at the individual message level or the entire conversation, and to split or merge messages to create message units of appropriate size and semantic integrity for translation and downstream annotation.

SMS/Chat Data collected for BOLT