Introduction
The Switchboard-2 Phase III Audio corpus was produced by the Linguistic Data
Consortium; catalog number LDC2002S06 and ISBN number 1-58563-222-8. This
release contains speech data files ONLY, along with documentation describing
speaker information (sex, age, education, city and state where raised), call
information (date, time, call duration, Personal Identification Numbers,
topic), and audit information (channel quality, background noise). The data
files are not compressed.
The Switchboard-2 Phase III collection was focused primarily in the American
South. The collection commenced on October 21, 1997 and was completed on
January 1, 1998. The project's goal was to target native speakers of English
in the American South, balanced by gender, to participate in (10+) five to six minute
conversations on a variety of telephone (land line) handsets.
Data
The speech data was collected for research, development, and evaluation of
automatic systems for speech-to-text conversion, talker identification,
language identification and speech signal detection purposes.
During the collection period, the LDC collected a total of 2,728 calls, or
5,456 sides, from 640 participants (292 Male, 348 Female), under varied
environmental conditions.
Each speech file consists of a 1,024-byte ASCII-formatted Sphere header,
followed by two-channel interleaved mu-law sample data. The mu-law samples
represent the actual digital data transmission from the telephone service
provider (MCI), as captured separately for each side of the telephone
conversation by the LDC's telephone collection platform. The header also
indicates the caller_pin, callee_pin, topic_id.
The speech files are named according to the following pattern:
sw_NNNNN.sph
where the five-digit string "NNNNN" represents the conversation-id; this
string is used to identify all speech files and to identify the calls in the
associated data base tables that provide information about the calls and
participants (i.e. callstat.tbl, master.tbl).
Other documentation files available on the publication are:
| 0readme.1st | Field information for all database tables |
| swb_callaudit.tbl | Audit results for each channel |
| swb_callaudit.txt | Document describing audit table |
| swb_callstats.tbl | Information about recorded calls |
| swb_callstats.txt | Document describing callstats table |
| swb_callsubjects.tbl | Demographic information |
| swb_callsubjects.txt | Document describing callsubjects table |
| topics.txt | List of proposed call topics |
There are a total of 2,657 data files (=~ 222 hours of audio)
Updates
No updates are available at this time.
Content Copyright
Portions © 1997-2002 Trustees of the University of Pennsylvania |