Introduction
This corpus was created by Linguistic Data Consortium to provide
training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation,
part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text)
Program. This data set has been created and distributed by Linguistic
Data Consortium. This data was previously released to the EARS MDE
community as LDC2004E31.
The goal of MDE is to enable technology that can take raw Speech-to-Text
output and refine it into forms that are of more use to humans and to
downstream automatic processes. In simple terms, this means the
creation of automatic transcripts that are maximally readable. This
readability might be achieved in a number of ways: flagging non-content
words like filled pauses and discourse markers for optional removal; marking
sections of disfluent speech; and creating boundaries between natural
breakpoints in the flow of speech so that each sentence or other
meaningful unit of speech might be presented on a separate line within
the resulting transcript. Natural capitalization, punctuation and
standardized spelling, plus sensible conventions for representing
speaker turns and identity are further elements in the readable
transcript. LDC has defined a SimpleMDE annotation task specification
and has annotated English telephone and broadcast news data to
provide training data for MDE.
In this release, some original annotations contained in LDC2004E31 have
been re-mapped to new MDE elements to support better annotation
consistency. In particular, the mapping affects Discourse Responses
(DR), Discourse Markers (DM) and Backchannel SUs (BC). A description of
the original mapping proposed by ICSI appears in 3) below, with complete
documentation of the mapping rules contained in the
docs/drmap-discussion directory. The scripts used to apply the
mapping can be found in the docs/scripts/drmap directory.
Samples
For an example of this corpus, please review the following xml samples.
Content Copyright
Portions © 2004 Trustees of the University of Pennsylvania,© 2003 American Broadcasting Company,© 2003 National Broadcasting Company,© 2003 Public Radio International,© 2003 Cable News Network, Inc. All Rights Reserved,© 2003 National Cable Satellite Corporation
The World is a co-production of Public Radio International and the
British Broadcasting Corporation and is produced at WGBH Boston. |