Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



2007 NIST Language Recognition Evaluation
Test Set

Item Name: 2007 NIST Language Recognition Evaluation
Test Set
Authors: Alvin Martin, Audrey Le
LDC Catalog No.: LDC2009S04
ISBN: 1-58563-529-4
Release Date: Oct 20, 2009
Data Type: speech
Sample Rate: 8000 Hz
Sampling Format: u-law
Data Source(s): telephone speech
Project(s): NIST LRE
Application(s): language identification
Language(s): Arabic, Bengali, Cantonese, English, Hindi, Korean, Mandarin Chinese, Russian, Spanish, Tamil, Thai, Vietnamese
Language ID(s): arb, ben, cmn, czo, eng, hin, kor, rus, spa, tam, tha, vie
Distribution: 1 DVD
Member fee: $0 for 2009 members
Non-member Fee: US$1500.00
Reduced-License Fee: N/A
Extra-Copy Fee: US$200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Alvin Martin, Audrey Le
2009
2007 NIST Language Recognition Evaluation
Test Set

Linguistic Data Consortium, Philadelphia

Introduction

2007 NIST Language Recognition Evaluation Test Set consists of 66 hours of conversational telephone speech segments in the following languages and dialects: Arabic, Bengali, Chinese (Cantonese), Mandarin Chinese (Mainland, Taiwan), Chinese (Min), English (American, Indian), Farsi, German, Hindustani (Hindi, Urdu), Korean, Russian, Spanish (Caribbean, non-Caribbean), Tamil, Thai and Vietnamese.

The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release.

The training data for LRE 2007 consists of the following:

  • 2003 NIST Language Recognition Evaluation, LDC2006S31. This material is comprised of: (1) approximately 46 hours of conversational telephone speech segments in the target languages and dialects; and (2) the 1996 LRE test data (conversational telephone speech in Arabic (Egyptian colloquial), English (General American, Southern American), Farsi, French, German, Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish (Caribbean, non-Caribbean), Tamil and Vietnamese).
  • 2005 NIST Language Recognition Evaluation, LDC2008S05. This release consists of approximately 44 hours of conversational telephone speech in English (American, Indian), Hindi, Japanese, Korean, Mandarin Chinese (Mainland, Taiwan), Spanish (Mexican) and Tamil.
  • Supplemental test data to be released by LDC in late 2009, 2007 NIST Language Recognition Evaluation Supplemental Training Data, LDC2009S05.

Data

Each speech file in the test data is one side of a "4-wire" telephone conversation represented as 8-bit 8-kHz mu-law format. There are 7530 speech files in SPHERE (.sph) format for a total of 66 hours of speech. The speech data was compiled from LDC's CALLFRIEND, Fisher Spanish and Mixer 3 corpora and from data collected by Oregon Health and Science University, Beaverton, Oregon.

The test segments contain three nominal durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively. Non-speech portions of each segment were included in each segment so that a segment contained a continuous sample of the source recording. Therefore, the test segments may be significantly longer than the speech duration, depending on how much non-speech was included. Unlike previous evaluations, the nominal duration for each test segment was not identified.

Samples

For an example of the data in this corpus, please listen to this audio sample.

Content Copyright

Portions © 2005 Oregon Health and Science University, © 1996, 2006, 2009 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.