NPS Internet Chatroom Conversations, Release 1.0
consists of 10,567 English posts (45,068 tokens) gathered from
age-specific chat rooms of various online chat services in October
and November 2006. Each file is a text recording from one of these
chat rooms for a short period on a particular day. Users should be
aware that some of the conversations in this corpus feature subjects
and language that some people may find offensive or objectionable,
including discussions of a sexual nature. This corpus was developed
by researchers at the Department of Computer Science, Naval Postgraduate School, Monterey,
Although much work has been accomplished in Natural Language Processing (NLP)
in traditional written and spoken language domains, relatively little has been
performed in the newer computer-mediated communication (CMC) domains enabled
by the Internet, such as text-based chat. One factor inhibiting research in
this area has been the dearth of annotated CMC corpora available to the broader
research community, despite the increasing use of CMC in a variety of areas
and applications. NPS Internet Chatroom Conversations is one of the first text-based
chat corpora tagged with lexical and discourse information. This corpus might
be used to develop stochastic NLP applications that perform tasks such as conversation
thread topic detection, author profiling, entity identification, and social
Each post is annotated with a chat dialog-act tag, and individual tokens within
each post are annotated with part-of-speech tags. 3,507 tokenized posts were
automatically tagged using a part-of-speech tagger trained on the Penn
Treebank corpora, combined with a regular expression that identified privacy-masked
user names and emoticons. Similarly, simple regular expression matching was
employed to assign an initial chat dialog-act to each of this subset of posts.
This initial tagging was verified by hand (with corrections made where necessary).
The remaining 7,060 posts were POS-tagged using a POS tagger that was trained
on the newly hand-tagged chat data and the Penn Treebank corpora. Dialog-act
tagging on the remaining posts was accomplished using a back-propagation neural
network trained on 21 features of the initial dialog-act-labeled posts. The
tagging of this second group of posts was also manually verified (and corrected
where necessary). Ultimately, all of the 10,567 privacy-masked posts, representing
45,068 tokens, were annotated with manually verified part-of-speech and dialog
Filenames consist of date, target age group, and number of posts. For example,
the file 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat
room on October 19, 2006. The posts have been privacy-masked in two ways. First,
all usernames have been changed to generic names of the form "UserN",
where N is a unique integer consistently used for each respective poster across
all files. The posts were then read by humans to remove other personally identifiable
information. Within each file, usernames are prepended with the date and chat room
portions of the filename. So in the above filename example, UserN becomes 10-19-20sUserN.
Please examine this sample for an example of the data in this corpus.
 Eric N. Forsyth and Craig H. Martell, "Lexical and Discourse Analysis of
Online Chat Dialog," Proceedings of the First IEEE International Conference on
Semantic Computing (ICSC 2007), pp. 19-26, September 2007.
 T. Wu, F. M. Khan, T. A. Fisher, L. A. Shuler and W. M. Pottenger,
"Posting act tagging using transformation-based learning," Proceedings of the
Workshop on Foundations of Data Mining and Discovery, IEEE International
Conference on Data Mining, December 2002.
 A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P.
Taylor, R. Martin, C. Van Ess-Dykema and M. Meteer, "Dialogue act modeling for
automatic tagging and recognition of conversational speech," Computational
Linguistics, vol. 26, no. 3, pp. 339-373, 2000.
 M. Zitzen and D. Stein, "Chat and conversation: a case of transmedial stability?"
Linguistics, vol. 42, no. 5, pp. 983-1021, 2004.
Portions © 2010 Trustees of the University of Pennsylvania