Introduction
NXT Switchboard Annotations, brings together in NITE
XML, a single XML format, the multiple layers of annotation performed on
a transcript subset from Switchboard
1- Release 2, LDC97S62. NXT Switchboard Annotations was developed in a collaboration
among researchers from Edinburgh University, Stanford University and the University
of Washington.
The original Switchboard corpus is a collection of spontaneous telephone conversations
between previously unacquainted speakers of American English on a variety of
topics chosen from a pre-determined list. A subset of one million words from
those conversations was annotated for syntactic structure and disfluencies as
part of the Penn Treebank project.
Phonetic transcripts were generated by the International
Computer Science Institute, University of California Berkeley and later
corrected by the Institute for Signal Information Processing, Mississippi State
Univeristy. The Penn Treebank transcripts provided the basis for the NXT Switchboard
corpus, and the noun phrases from that subset were annotated for animacy. The
Treebank transcript was then aligned with the corresponding subset from the
corrected Mississippi
State (MS-State) transcript in order to provide word timing information.
Focus/contrast and prosodic annotations, as well as phone/syllable alignment
were next added to the annotations. The previous annotations of dialog acts
and prosody were converted to NITE XML. Lastly, hand annotations for markables
were added to provide information about their animacy and information structure,
including coreferential links.
NXT Annotation
NXT is an open source toolkit that
enables mutiple linguistic annotations to be assembled into a unified database.
It uses a stand-off XML data format that consists of several XML files that
point to each other. The NXT format provides a data model that describes how
the various annotations for a corpus relate to one another. For that reason,
it does not impose any particular linguistic theory or any particular markup
structure. Instead, users define their annotations in a "metadata"
file that expresses their contents and how they relate to each other in terms
of the graph structure for the corpus annotations overall. The relationships
that can be defined in the data model draw annotations together into a set of
intersecting trees, but also allow arbitrary links between annotations over
the top of this structure, giving a representation that is highly expressive,
easier to process than arbitrary graphs and structured in a way that helps data
users. NXT's other core component is a query language designed specifically
for working with data conforming to this data model. Together, the data model
and query language allow annotations to be treated as one coherent set containing
both structural and timing information.
The data in NXT Swtichboard Annotations
was converted from the Penn Treebank bracketed format in which the Switchboard
corpus was originally distributed using an XML-based tool for syntactic query
that comes with a ready-made Switchboard converter. Conversion was performed
using a set of XSL stylesheets to extract each of the multiple XML files associated
with one dialogue. The data was divided into separate XML files representing
the orthographic transcription, syntax, turn structure, disfluencies and movement,
or the relationship between traces and their sources. Transcription consists
of a flat list of terminals: words, punctuation, traces, and so on. Syntax starts
with a flat list of parses and works down through nonterminals, grounding in
terminals (which are in the transcription file, but are referenced by pointers
that indicate they are to be treated as if they were part of the tree itself).
Turn structure is simply a flat list of turns that themselves contain parses
as children, again via pointers into the syntax file. Yet another file couples
reparanda and repairs into disfluencies by pointing to the appropriate nonterminals
using named roles. A movement file similarly links sources with their target
traces. While this representation may seem awkward, it has advantages over the
original arrangement. First, it places the information in a single tree structure,
with co-indexing for the crossing links that are sometimes required for disfluency
and movement. Secondly, it facilitates querying the crossing structures, since
they are treated on a par with other structures within the data. Although this
ease is not particularly important for the initial, syntactic data, it is crucial
for a correct understanding of discourse phenomena such as coreference. Third,
separating the tags into their various types makes it easier to add data using
external processes (part-of-speech taggers, named entity recognizers, and the
like). Fourth, different people can change different data files at the same
time without conflict, as long as neither edit the files they point to and both
are able to lock complete paths of files pointing to the data they are revising.
Last, a data set can be loaded in whole or in part, speeding up some processing.
The NITE XML Toolkit itself treats the data seamlessly no matter whether it
is in one file or many.
Licensing
This corpus is made available to
LDC not-for-profit members and all nonmembers under the
Creative Commons Attribution-Noncommercial Share Alike 3.0 license. NXT
Switchboard Annotations is available to LDC's for-profit members under the terms
of their For-Profit Membership Agreements.
Samples
For an example of the data in this corpus, please consult the Getting Started section of the provider's web site.
Content Copyright
Portions © 1992, 1993, 1997,
1999, 2009 Trustees of the University of Pennsylvania |