|

|
|
2007 NIST Language Recognition Evaluation Supplemental Training Set
| |
| Item Name: | 2007 NIST Language Recognition Evaluation Supplemental Training Set |
| Authors: | Alvin Martin, Audrey Le, Dave Graff, Jan van Santen |
| LDC Catalog No.: | LDC2009S05 |
| ISBN: | 1-58563-530-8 |
| Release Date: | Nov 20, 2009 |
| Data Type: | speech |
| Sample Rate: | 8000 Hz |
| Sampling Format: | 8 bit u-law |
| Data Source(s): | telephone speech |
| Project(s): | NIST LRE |
| Application(s): | language identification |
| Language(s): | Bengali, Cantonese, Egyptian Arabic, Min Nan Chinese, Russian, Spanish, Taiwan Mandarin, Tamil, Thai, Urdu, Wu Chinese |
| Language ID(s): | arz, ben, cmn, nan, rus, spa, tam, tha, urd, wuu, yue |
| Distribution: | 1 DVD, Web Download |
| Member fee: | $0 for 2009 members |
| Non-member Fee: | US $1500.00 |
| Reduced-License Fee: | US $750.00 |
| Extra-Copy Fee: | US $200.00 |
| Non-member License: | yes |
| Online documentation: | yes |
| Licensing Instructions: | Subscription Members, Standard Members, Non-Members |
| Citation: | Alvin Martin, et al. 2009 2007 NIST Language Recognition Evaluation Supplemental Training Set Linguistic Data Consortium, Philadelphia |
|
| Introduction
2007 NIST Language Recognition Evaluation Supplemental Training Se
consists of 118 hours of conversational telephone speech segments in the following
languages and dialects: Arabic (Egyptian colloquial), Bengali, Min Nan Chinese,
Wu Chinese, Taiwan Mandarin, Cantonese, Russian, Mexican Spanish, Thai, Urdu
and Tamil.
The goal of the NIST (National Institute
of Standards and Technology) Language
Recognition Evaluation (LRE) is to establish the baseline of current performance
capability for language recognition of conversational telephone speech and to
lay the groundwork for further research efforts in the field. NIST conducted
three previous language recognition evaluations, in 1996,
2003 and 2005.
The most significant differences between those evaluations and the 2007 task
were the increased number of languages and dialects, the greater emphasis on
a basic detection task for evaluation and the variety of evaluation conditions.
Thus, in 2007, given a segment of speech and a language of interest to be detected
(i.e., a target language), the task was to decide whether that target language
was in fact spoken in the given telephone speech segment (yes or no), based
on an automated analysis of the data contained in the segment.
The supplemental training material in this release consists of the following:
- Approximately 53 hours of conversational telephone speech segments in Arabic
(Egyptian colloquial), Bengali, Cantonese, Min Nan Chinese,Wu Chinese, Russian,
Thai and Urdu. This material is taken from LDC's CALLHOME, CALLFRIEND and
Mixer collections.
- Approximately 65 hours of full telephone conversations in Mandarin Chinese
(Taiwan), Spanish (Mexican) and Tamil. This material was collected by Oregon
Health and Science University (OHSU), Beaverton, Oregon. The test segments
used in the 2005
NIST Language Recognition Evaluation were derived from these full conversations.
In addition to the supplemental material contained in this release, the training
data for the 2007
NIST Language Recognition Evaluation consisted of data from previous LRE
evaluation test sets, namely, 2003
NIST Language Recognition Evaluation and 2005
NIST Language Recognition Evaluation.
Samples
For an example of the data in this corpus, please listen to this sample of the Egyptian Arabic data from the data set.
Content Copyright
Portions © 2005 Oregon Health and Science University, © 1996, 2006,
2009 Trustees of the University of Pennsylvania |
|
|