MDE RT-03 Training Data Text and Annotations corpus was
produced by Linguistic Data Consortium (LDC), catalog number
LDC2004T12 and ISBN 1-58563-301-1.
This data was originally created to support the DARPA EARS
(Efficient, Affordable, Reusable Speech-to-Text) Program in
Metadata Extraction (MDE). The goal of EARS MDE is to enable
technology that can take raw Speech-to-Text output and refine it
into forms that are of more use to humans and to downstream
The data in this release consists of English Conversational Telephone
Speech (CTS) and Broadcast News (BN) transcripts and annotations. The
corresponding speech data is available as MDE RT-03 Training Data Speech .
There are 633 files, totalling approximately 747 MB with a total
of 764,978 tokens. The transcripts and annotations cover
approximately 20 hours of Broadcast News and over 40 hours of
Conversational Telephone Speech data. The annotated data was
originally developed to support the DARPA EARS Metadata Extraction
(MDE) Program, and was distributed as training data for the RT-03F
The CTS data was drawn from the Switchboard-1 Release 2 corpus.
The BN speech data was drawn from the 1997 English Broadcast News Speech (HUB4) corpus, from four distinct sources:
||American Broadcasting Company
||National Broadcasting Company
||Public Radio International
||Cable News Network
The transcripts within this corpus have been annotated for various kinds of
metadata. The goal of MDE is to enable technology that can take raw
Speech-To-Text output and refine it into forms that are of more use to
humans and to downstream automatic processes. In simple terms, this means
the creation of automatic transcripts that are maximally readable. To this
end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE,
annotators identify four types of fillers: filled pauses like "uh" and
"um," discourse markers like "you know," asides and parentheticals, and
editing terms like "sorry" and "I mean." Edit disfluencies are also
identified; the full extent of the disfluency (or string of adjacent
disfluencies) and interruption points are tagged. Annotators further
identify SUs (alternately semantic units, sense units, syntactic units,
slash units or sentence units); that is, units within the discourse that
function to express a complete thought or idea on the part of a speaker.
As with disfluency annotation, the goal of SU labeling is to improve
transcript readability, here by creating a transcript in which information
is presented in small, structured, coherent chunks rather than long turns
or stories. There are four types of sentence-level SUs: statements,
questions, backchannels and incomplete SUs. To enhance inter-annotator
consistency, the annotation task also identifies a number of sub-sentence
SU boundaries (coordination and clausal SUs). The docs directory contains the complete set
of SimpleMDE annotation guidelines used to create this data.
The data appears in two formats. The AG Atlas (ag.xml) format represents the
native annotation format, and utilizes the Annotation Graph Library. This data is
best explored using the LDC MDE Toolkit, which is freely available at http://www.ldc.upenn.edu/Projects/MDE/Tools.
The data is also provided in RTTM format developed by NIST to support
the EARS Program. The RTTM format labels each token in the reference
transcript according to the properties it displays: lexeme vs. non-lexeme;
edit, filler, SU, etc.
Please click here for a RTTM file example.
General information about the EARS MDE Annotation effort,
including free annotation tools, annotation guidelines and
additional information can be found at LDC's
EARS MDE Project Page.
There are no updates available at this time.
Portions (c) 1998 American Broadcasting
Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public
Radio International, (c) 1997 National Cable Satellite Corporation, (c)
2004 Trustees of the University of Pennsylvania
The World is a co-production of Public Radio International and the
British Broadcasting Corporation and is produced at WGBH Boston.