2000 Communicator Dialogue Act Tagged was produced by Linguistic Data
Consortium (LDC) catalog number LDC2004T15 and ISBN 1-58563-305-4.
This corpus is an addendum to the 2000 Communicator Evaluation
corpus produced by the LDC in 2002. This addendum contains
annotations on the transcriptions of the system and user
utterances as taken from the logfiles of the 2000 Communicator
Dialogue Act annotations are provided for system utterances in
the dialogues. The dialogue act tags follow the DATE (Dialogue
Act Tagging for Evaluation) scheme. In addition, both
system and user utterances are tagged for named entities.
For further description of the 2000 Communicator Evaluation
corpus, please refer to the main publication from 2002 (LDC2002S56).
The complete Dialogue Act annotated corpus is available as a single XML text
file totalling approximately 16 MB.
The total number of dialogues is 648.
There are 314,223 words (tokens) and 1,403,985 unique words.
Each dialogue is segmented into system and user turns. The total
number of turns for the entire corpus is 24,728 (13,013
system turns and 11,715 user turns).
Except for one system, no utterance segmentation was done within the
turns in the logfiles. The number of utterances is therefore
the same as the number of turns. Utterance
segmentation is carried out and reflected as the dialogue act
segmentation. The total number of tagged dialogue acts is 22,701
with 61 unique tags. There are a total of 275,938 words in the system
utterances and a total of 38,285 words in the user utterances.
Dialogue Act tagging was done automatically via pattern
matching with human-labeled dialogue utterances used by the nine
different participating Communicator Systems. Named entity
tagging also followed the same methodology.
This research was conducted using funding from the following grant number
and funding agency: DARPA - contract MDA972-99-3-0003.
There are no updates available at this time.
Portions © 2004 Trustees of the University of Pennsylvania