Introduction
2007 NIST Language Recognition Evaluation Test Set consists of 66 hours
of conversational telephone speech segments in the following languages and dialects:
Arabic, Bengali, Chinese (Cantonese), Mandarin Chinese (Mainland, Taiwan), Chinese
(Min), English (American, Indian), Farsi, German, Hindustani (Hindi, Urdu),
Korean, Russian, Spanish (Caribbean, non-Caribbean), Tamil, Thai and Vietnamese.
The goal of the NIST (National Institute
of Standards and Technology) Language
Recognition Evaluation (LRE) is to establish the baseline of current performance
capability for language recognition of conversational telephone speech and to
lay the groundwork for further research efforts in the field. NIST conducted
three previous language recognition evaluations, in 1996,
2003 and 2005.
The most significant differences between those evaluations and the 2007 task
were the increased number of languages and dialects, the greater emphasis on
a basic detection task for evaluation and the variety of evaluation conditions.
Thus, in 2007, given a segment of speech and a language of interest to be detected
(i.e., a target language), the task was to decide whether that target language
was in fact spoken in the given telephone speech segment (yes or no), based
on an automated analysis of the data contained in the segment. Further information
regarding this evaluation can be found in the evaluation plan which is included in
the documentation for this release.
The training data for LRE 2007 consists of the following:
- 2003
NIST Language Recognition Evaluation, LDC2006S31. This material is comprised
of: (1) approximately 46 hours of conversational telephone speech segments
in the target languages and dialects; and (2) the 1996 LRE test data (conversational
telephone speech in Arabic (Egyptian colloquial), English (General American,
Southern American), Farsi, French, German, Hindi, Japanese, Korean, Mandarin
Chinese (Mainland, Taiwan), Spanish (Caribbean, non-Caribbean), Tamil and
Vietnamese).
- 2005
NIST Language Recognition Evaluation, LDC2008S05. This release consists
of approximately 44 hours of conversational telephone speech in English (American,
Indian), Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish
(Mexican) and Tamil.
- Supplemental test data to be released by LDC in late 2009, 2007 NIST Language
Recognition Evaluation Supplemental Training Data, LDC2009S05.
Data
Each speech file in the test data is one side of a "4-wire" telephone
conversation represented as 8-bit 8-kHz mu-law format. There are 7530 speech
files in SPHERE (.sph) format for a total of 66 hours of speech. The speech
data was compiled from LDC's CALLFRIEND, Fisher Spanish and Mixer 3 corpora
and from data collected by Oregon Health and Science University, Beaverton,
Oregon.
The test segments contain three nominal durations of speech: 3 seconds, 10
seconds and 30 seconds. Actual speech durations vary, but were constrained to
be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively.
Non-speech portions of each segment were included in each segment so that a
segment contained a continuous sample of the source recording. Therefore, the
test segments may be significantly longer than the speech duration, depending
on how much non-speech was included. Unlike previous evaluations, the nominal
duration for each test segment was not identified.
Samples
For an example of the data in this corpus, please listen to this audio sample.
Content Copyright
Portions © 2005 Oregon Health and Science University, © 1996, 2006,
2009 Trustees of the University of Pennsylvania |