English CTS Treebank with Structural Metadata, Linguistic Data Consortium
(LDC) catalog number LDC2009T01 and isbn 1-58563-476-X, consists of metadata
and syntactic structure annotations for 144 English telephone conversations,
or 140,000 words, from data used in the EARS
(Effective, Affordable, Reusable Speech-to-Text program. English CTS Treebank
with Structural Metadata was created to support EARS work in English.
It applies EARS metadata extraction annotations and Penn Treebank methods to
conversations from Switchboard-1
Release 2 (LDC97S62) and from data collected for EARS under the Fisher Protocol
(released in EARS as LDC2004E16, LDC2004E29 and LDC2005E73).
The purpose of the EARS program was to develop robust speech recognition technology
to address a range of languages and speaking styles. LDC provided conversational
and broadcast speech and transcripts, annotations, lexicons and texts for
language modeling in each of the EARS languages (Arabic, Chinese, English).
LDC also supported a metadata extraction
(MDE) research evaluation, the goal of which was to enable technology to
take raw speech-to-text (STT) output and to refine it into forms of more use to
humans and to downstream automatic processes. In simple terms, this means the
creation of automatic transcripts that are maximally readable. This readability
might be achieved in a number of ways: removing non-content words like filled
pauses and discourse markers from the text; removing sections of disfluent speech;
and creating boundaries between natural breakpoints in the flow of speech so
that each sentence or other meaningful unit of speech might be presented on
a separate line within the resulting transcript. Natural capitalization, punctuation
and standardized spelling, plus sensible conventions for representing speaker
turns and identity are further elements in the readable transcript. Some of
the data developed by LDC for the MDE task is contained in the LDC Catalog,
MDE Training Data Speech, LDC2005S16 and RT-04
MDE Training Data Text/Annotations, LDC2005T24.
The telphone speech used in English CTS Treebank with Structural Metadata was
drawn from Switchboard-1
Release 2 (LDC97S62) and from data collected for EARS under the Fisher Protocol
(released in EARS as LDC2004E16, LDC2004E29 and LDC2005E73). The speech for
all files was recorded on two channels with a sampling rate of 8000 Hz and was
encoded in ulaw format.
The Fisher data was carefully transcribed by LDC staff using RT-04
Transcription Specification, Version 3.1; for the Switchboard data, transcripts
developed at the Institute for Signal and Information Processing at Mississippi
State University were used.
Structural Metadata Annotation
The transcribed data was annotated to SimpleMDE
V6.2 , an annotation task defined by LDC that consisted of the following
elements: Edit Disfluencies (repetitions, revisions, restarts and complex disfluencies),
Fillers (including, e.g., filled pauses and discourse markers) and SUs, or syntactic/semantic
units. Each of these elements is described below:
- Edit Disfluencies: Edit disfluencies, or speech repairs,
occur when speakers correct or alter their utterances or abandon them entirely
and start over. Edit disfluencies have a more complex internal structure than
fillers, consisting of the original utterance (reparandum), an interruption
point, an optional editing phase and a correction. There are four types of
disfluencies annotated in SimpleMDE: repetitions; revisions; restarts; and
complex disfluencies, which consist of multiple or nested edits. In SimpleMDE,
annotators labeled only the deletable region (DELREG) of the disfluency which
corresponded to the reparandum. In cases where the reparandum contained multiple
disfluent utterances, annotators identified the maximal extent of the disfluent
portion, starting with the left edge of the first disfluency and continuing
to the right edge (IP) of the final disfluency.
- Fillers: While the term filler has traditionally been synonymous
with filled pause, SimpleMDE uses the term to encompass a broad set of vocalized
space-fillers: filled pauses (FPs), discourse markers (DMs), explicit editing
terms (EETs) and asides/parentheticals (A/Ps). Excepting the last category,
fillers can be understood as words that do not alter the propositional content
of the material into which they are inserted. For example, FPs include nonlexemes,
such as um or ah, that speakers use to indicate hesitation
or to maintain control of a conversation. A DM is a word or phrase that functions
primarily as a structuring unit of spoken language, such as actually,
now, anyway, see, basically, so,
I mean, well, let's see, you know,
like, you see. DMs often signal the speaker's intention
to mark a boundary in discourse, like a change in speaker or the beginning
of a new topic. There is no exhaustive list of DMs for a given language due
to their wide range of functions, colloquial variations, and the difficulty
of defining them precisely. Furthermore, words that are used as discourse
markers can be used for other purposes. EETs occur during an edit disfluency
and consist of an overt statement (e.g., I'm sorry) from the
speaker recognizing the disfluency. Asides and parentheticals (A/Ps) are different
from the other filler types in that they convey semantic information in the
form of a short side comment before returning to the main topic. This may
be either on a new topic (asides) or on the same topic of the larger utterance
(parentheticals). Both break up the stream of discourse and are often accompanied
by noticeable prosodic features.
- Syntactic Units: One of the goals of MDE annotation is the
identification of all units within the discourse that function to express
a complete thought or idea on the part of the speaker.Within MDE these elements
are called SUs (Syntactic, Semantic or Slash Units). As with disfluency annotation,
the goal of SU labeling is to improve transcript readability by presenting
information in small, structured, coherent chunks. There are four sentence-level
SUs. Statements are complete SUs that function as a declarative statement
and are marked with /.; questions are complete SUs that function as an interrogative
and are marked with /?. Backchannels are an open class of words uttered by
the non-dominant speaker to indicate engagement in the conversation and are
marked with /@. Incomplete SUs occur when an utterance does not constitute
a grammatically complete sentence, phrase or continuer, and does not express
a complete thought; these are marked with /-. To enhance inter-annotator consistency,
there are also sentence-internal clausal and coordinating SUs (/, and /&).
Parsing and Treebank Annotation
The existing MDE annotations were converted from RTTM format into a format
appropriate for the automatic parser, enabling the generation of accurate parses
in a form that would require as little hand modification by the Treebank team
as possible. RTTM is a format developed by NIST (National Institute for Standards
and Technology) for the EARS program that labeled each token in the reference
transcript according to the properties it displays (e.g., lexeme versus non-lexeme,
edit, filler, SU). The initial parse trees were produced using an entropy-based parser, which was trained on Switchboard transcripts supplemented
with Wall Street Journal data (with a 4:1 ratio). These parses served as the
starting point for a manual process which corrected the initial pass for each
To provide high quality parses, scripts were used to separate the edited material
from the fluent part of each SU prior to parsing it using the MDE annotations.
The edits were then parsed and reinserted into the tree for presentation to
the annotators. Some important issues are listed below:
- Words were tokenized in Syntactic Units using LDC's scripts.
- All of the punctuation provided in the markup was maintained in the SU for
parsing because it was likely to enhance parse accuracy and was expected to
appear in the final tree annotations.
- For parsing complex edits, contiguous edits were concatenated into one unit
for parsing. In a few cases, edits occur across SUs in MDE annotations.
- Special treatment was required in the scripts for regions unannotated for
MDE, complex edits, and SUs that were comprised solely of edited material.
- The string was "EDITED" as the non-terminal tag for edit regions
inserted into the fluent parse trees. Additionally a terminal node for the
IP ((DISFL-IP +) was added at the end of the edits in an attempt to make the
tree follow the conventions used in the Switchboard Treebank.
Manual treebank annotation was performed in accordance with existing treebank
guidelines for conversational telephone speech as well as in accordance with
revised general guidelines for treebanking.
For an example of the data in this corpus, please listen to this audio sample (wav) and view its parse tree (PDF). Note that the opening greeting of the conversation has been omitted in the parse tree. Only the discussion on holidays is present.
Portions © 2004-2005, 2009 Trustees of the University of Pennsylvania