Introduction
The Switchboard-1 Telephone Speech Corpus (LDC97S62) was
originally collected by Texas Instruments in 1990-1, under DARPA
sponsorship. The first release of the corpus was published by
NIST and distributed by the LDC in 1992-3. Since that release, a
number of corrections have been made to the data files as
presented on the original CD-ROM set and all copies of the
first pressing have been distributed.
Switchboard is a collection of about 2,400 two-sided telephone
conversations among 543 speakers (302 male, 241 female) from all areas
of the United States. A computer-driven robot operator system
handled the calls, giving the caller appropriate recorded prompts,
selecting and dialing another person (the callee) to take part in a
conversation, introducing a topic for discussion and recording the
speech from the two subjects into separate channels until the
conversation was finished. About 70 topics were provided, of which
about 50 were used frequently. Selection of topics and callees was
constrained so that: (1) no two speakers would converse together more
than once and (2) no one spoke more than once on a given topic.
Data
In this release, assembled and published by the LDC, all known
errors affecting the original publication of speech files were
corrected. In addition, modifications have been made to the contents
of the NIST Sphere headers of all speech files, to identify each file
as being part of the new release and to make the usage of the
sample_count header field consistent with standard Sphere usage.
(In particular, the sample_count field should reflect the number of
samples on each channel in the file. In the initial release, this
field was improperly set to be the total number of samples in both
channels of the file this has been corrected in the new release.)
Since the 1997 release, the Switchboard transcripts have been
carefully revised at ISIP and additional
problems have been discovered and patched. Three speech files,
part of the original release, were inadvertently left off the
1997 revision. After corpus users noted some problems in the
original speaker attribution table, LDC audited the problem calls
and corrected the attributions. The latest version of ISIP
transcriptions, the ISIP update of the ICSI phonetic
transcriptions, and corrected word alignments are all available at http://www.ece.msstate.edu/research/isip/projects/switchboard/.
The LDC makes the transcript summaries available via http.
Researchers have used SWB-1 data for various
annotation projects including discourse annotation/speech acts,
part-of-speech tagging and parsing, up-to-date orthographic
transcriptions, and phonetic transcriptions. This summary
documents which files have been used for the various annotations. In
addition to the index of these file characteristics, there is also
a table detailing speaker attributes.
Updates
03/26/2013: Three previously missing files were added to this release. (sw02289.sph, sw04361.sph, sw04379.sph) File tables and documentation were updated to reflect the addition of these files. Please contact ldc@ldc.upenn.edu to obtain this update. All copies of this corpora obtained after the above date already include this update.
09/29/2011: Added a file list, available through online docs, to reflect its release on DVD. Also, an updated readme reflects these changes.
11/12/2007: Updated and corrected speaker and call tables are now available online in the corpus documentation directory http://www.ldc.upenn.edu/Catalog/docs/LDC97S62/ or as a single compressed tar file file at:
ftp://ftp.ldc.upenn.edu/pub/ldc/public_data/swb1_corrected_tables.tar.gz
09/2008: The Switchboard Dialog Act Corpus is a version of Switchboard-1 Release 2 tagged with a shallow discourse tagset of approximately 60 basic dialog act tags and combinations. The discourse tag-set used is an augmentation of the Discourse Annotation and Markup System of Labeling (DAMSL) tag-set and is referred to as the SWBD-DAMSL labels. These annotations were created in 1997 at the University of Colorado at Boulder, with the goal of building better language models for automatic speech recognition of the Switchboard domain. To that end, the label-set incorporates both traditional sociolinguistic and discourse-theoretic rhetorical relations/adjacency-pairs as well as some more form-based models. This corpus contains labels for 1155 5-minute conversations comprising 205,000 utterances and 1.4 million words. The Switchboard Dialog Act Corpus is now available online at:
ftp://ftp.ldc.upenn.edu/pub/ldc/public_data/swb1_dialogact_annot.tar.gz
Content Copyright
Portions © 1992, 1993, 1997 Trustees of the University of Pennsylvania |