Introduction
WTIMIT 1.0 is a wideband mobile telephony
derivative
of TIMIT
Acoustic-Phonetic Continuous Speech Corpus (TIMIT,
LDC93S1). TIMIT contains wideband speech recordings (i.e.,
sampled at 16 kHz) of 630 speakers in American English from eight
major dialectic regions, each reading ten phonetically rich
sentences. The TIMIT speech corpus was completed in 1993, being
intended for acoustic-phonetic studies as well as for development
and evaluation of automatic speech recognition (ASR) systems. In
the meantime, five TIMIT derivatives have been developed:
FFMTIMIT, NTIMIT, CTIMIT, HTIMIT, and
STC-TIMIT. The FFMTIMIT
(LDC96S32) corpus (Free-Field Microphone TIMIT) consists of
the original TIMIT database, being recorded by a free-field
microphone. NTIMIT
(LDC93S2) (Network TIMIT) serves as a telephone bandwidth
adjunct to TIMIT, containing its speech files transmitted over a
telephone handset and the NYNEX telephone network, subject to a
large variety of channel conditions. For the cellular bandwidth
speech
corpus CTIMIT
(LDC96S30), the original TIMIT recordings were passed through
cellular telephone
circuits. The HTIMIT
(LDC98S67) corpus (Handset TIMIT) offers a TIMIT subset of 192
male and 192 female speakers through different telephone handsets
for the study of telephone transducer effects on speech. For the
single-channel telephone
corpus STC-TIMIT
(LDC2008S03), the TIMIT recordings were sent through a real
and, in contrast to NTIMIT, single telephone channel.
While some of these derivative TIMIT corpora consist of wideband
speech, others are telephony corpora representing narrowband
speech, i.e., sampled at 8 kHz and containing frequency components
from about 300 Hz to 3.4 kHz. Until now, no real-world wideband
telephony speech corpus has been publicly available. Due to
upcoming wideband speech codecs, such as G.722, G.722.1, G.722.2
(i.e., Adaptive Multi-Rate Wideband, AMR-WB), and G.711.1,
wideband telephony speech transmission is already feasible
nowadays, even in an increasing number of mobile networks. Hence,
a wideband telephone bandwidth adjunct to TIMIT is desirable for a
wide range of scientific investigations, as well as development
and evaluation of systems, e.g., Interactive Voice Response (IVR)
systems. WTIMIT 1.0 (Wideband Mobile TIMIT) contains the
recordings of the original TIMIT speech files after transmission
over a real 3G AMR-WB mobile network.
WTIMIT 1.0 is organized according to the original TIMIT
corpus. The training subset consists of 4620 speech files, while
the test subset contains 1680 speech files. The speech format of
the WTIMIT corpus is raw (i.e., no header information) and
specified as follows:
- 16 kHz sampling rate
- 16 bit, 1-channel linear PCM
sampling format
- little-endian byte order
- signed
Data
Data preparation was conducted by converting the original TIMIT speech files into
raw data (i.e., dropping the first 1024 bytes of header information) and concatenating
them to 11 signal chunks of at most 30 minutes duration. In order to allow
precise de-concatenation after transmission, and in order to be able to examine
codec influence and channel distortion, each signal chunk is preceded by a
4 s calibration tone. It comprises 2 s of a 1 kHz sine wave followed by another
2 s of a linear sweep from 0 to 8 kHz. After having stored the prepared speech chunks on
a laptop PC, they are ready for transmission over T-Mobile's AMR-WB-capable
3G mobile network in The Hague, The Netherlands.
At the sending end, the speech chunks were played back by a
laptop PC. Via an IEEE 1394 link (FireWire), the data was
transmitted digitally to an external DAC (digital-to-analog
converter) of type RME Fireface 400. The analog signal was then
fed electrically into the microphone input of the transmitting
Nokia 6220 mobile phone. For this purpose, an audio quality test
cable for Nokia mobile phones was used. Prior to the actual
transmission, the output attenuation of the DAC was adjusted such
as to prevent analog saturation at the input circuit of the phone
while ensuring optimal dynamic range. Furthermore, a call to the
phone at the receiving end, a second mobile phone of type Nokia
6220, was established for each speech chunk separately. Using the
field test monitoring software of the phones, we confirmed that
they were situated in different network cells at all times during
transmission; moreover, we verified that the proper speech codec,
the widely used AMR-WB at a constant data rate of 12.65 kbit/s,
was being employed. Note that this bitrate is by far the most
widely used one. Furthermore, the internal microphone equalization
of the transmitting mobile phone was switched off.
At the receiving end, the analog headphone output of the
receiving mobile phone was connected electrically to an ADC
(analog-to-digital converter) of type RME Fireface 400. The analog
input gain of the latter device was adjusted once initially to
exploit the dynamic range of the ADC. Sampling was performed at a
rate of 48 kHz, the native sampling rate of the ADC, and with 16
bit precision. The digital speech signals were transferred to a
laptop PC again via an IEEE 1394 link and recorded onto a hard
drive. The transmitted speech chunks were decimated from 48 kHz to
16 kHz sampling rate using a high-quality lowpass filter. Finally,
they were de-concatenated by maximizing the cross-correlation
between them and the original speech files. We followed the
de-concatenation methodology of STC-TIMIT, as described
in STC-TIMIT:
Generation of a Single-channel Telephone Corpus, in order to
assure a precise sample alignment to the TIMIT speech
files. Hence, utterances in WTIMIT 1.0 can be considered to be
time-aligned with an average precision of 0.0625 ms (one sample)
with those of TIMIT. Basically, TIMIT's original label files
(*.TXT, *.WRD, *.PHN) are valid for WTIMIT as well. However,
misalignments of about 10 to 20 ms were found to be frequently
produced by the channel mainly during speech pauses. Parts of the
affected speech files are therefore slightly misaligned against
the original label information. These channel effects may be
related to the packet switching domain in the UMTS Core
Network. Depending on the traffic load in the network, packets are
buffered and queued, which results in a variable packet delay
(jitter).
If you have any problems, questions or suggestions concerning
WTIMIT, please send a brief email to Tim Fingscheidt (Technische
Universität Braunschweig, Braunschweig, Germany):
fingscheidt@ifn.ing.tu-bs.de.
Samples
Please examine the following samples for an example of the data in this corpus (raw audio has been converted to wav for purposes of demonstration):
Acknowledgement
The authors would like to thank Mr. Dirk Kistowski-Cames,
Deutsche Telekom AG, Bonn, Germany, for providing general project
support and SIM cards, and Mr. Petri Lang, T-Mobile NL, The Hague,
The Netherlands, for local support and SIM cards. Thanks also to
Mr. Panu Nevala, Nokia, Oulu, Finland, for providing the prepared
mobile phones, which are in that form not available on the market.
This work was funded by German Research Foundation (DFG) under grant no. FI 1494/2-1.
Content Copyright
Portions © 2009, 2010 Tim Fingscheidt, © 1993, 2010 Trustees of the University of Pennsylvania |