LDC93S3A - Complete Resource Management corpus 2.0
LDC93S3B - Part 1 (RM1) of the Resource Management Corpus 2.0
LDC93S3C - Part 2 (RM2) of the Resource Management Corpus 2.0
The DARPA Resource Management Continuous Speech Corpora (RM) consists
of digitized and transcribed speech for use in designing and
evaluating continuous speech recognition systems. There are two main
sections, often referred to as RM1 and RM2. RM1 contains three
sections, Speaker-Dependent (SD) training data,
Speaker-Independent (SI) training data and test and
evaluation data. RM2 has an additional and larger SD
data set, including test material.
All RM material consists of read sentences modeled after a naval
resource management task. The complete corpus contains over 25,000
utterances from more than 160 speakers representing a variety of
American dialects. The material was recorded at 16KHz, with 16-bit
resolution, using a Sennheiser HMD-414 headset microphone. All discs
conform to the ISO-9660 data format.
Resource Managment SD and SI Training and Test Data (RM1)
The Speaker-Dependent (SD) Training Data contains 12
subjects, each reading a set of 600 "training sentences," two "dialect"
sentences and ten "rapid adaptation" sentences, for a total of 7,344
recorded sentence utterances. The 600 sentences designated as
training cover 97 of the lexical items in the corpus.
The Speaker-Independent (SI) Training Data contains
80 speakers, each reading two "dialect" sentences plus 40 sentences from
the Resource Management text corpus, for a total of 3,360 recorded
sentence utterances. Any given sentence from a set of 1,600 Resource
Management sentence texts was recorded by two subjects, while no
sentence was read twice by the same subject.
RM1 contains all SD and SI system test material used in
five DARPA benchmark tests conducted in March and October of 1987, June
1988, and February and October 1989, along with scoring and diagnostic
software and documentation for those tests. Documentation is also
provided outlining use of the Resource Management training and test
material at CMU in development of the SPHINX system. Example output
and scored results for state-of-the-art speaker-dependent and
speaker-independent systems (i.e. the BBN BYBLOS and CMU SPHINX
systems) for the October 1989 benchmark tests are included.
Extended Resource Management Speaker-Dependent Corpus (RM2)
This set forms a speaker-dependent extension to the Resource
Management (RM1) corpus. The corpus consists of a total of 10,508
sentence utterances (two male and two female speakers each speaking 2,652
sentence texts). These include the 600 "standard" Resource Management
speaker-dependent training sentences, two dialect calibration sentences,
ten rapid adaptation sentences, 1,800 newly-generated extended training
sentences, 120 newly-generated development-test sentences and 120
newly-generated evaluation-test sentences. The evaluation-test
material on this disc was used as the test set for the June 1990 DARPA
SLS Resource Management Benchmark Tests (see the Proceedings).
The RM2 corpus was recorded at Texas Instruments. The NIST speech
recognition scoring software originally distributed on the RM1 "Test"
Disc was adapted for RM2 sentences and is included in this publication. |