|LDC98S69 - Speech data |
LDC98T26 - Transcripts
This release of HUB5 Mandarin training data consists of 42 calls
derived from the CALLFRIEND Mandarin
Chinese Mainland Dialect (Language ID) collection. The
transcribed data is intended as additional training data in support of
the project on Large Vocabulary Conversational Speech Recognition
(LVCSR), also sponsored by the U.S. Department of Defense. The
transcripts cover a contiguous 5-30 minute segment taken from a
recorded conversation lasting up to 30 minutes.
Speakers were solicited by the LDC to participate in this
telephone speech collection effort via the internet, publications
(advertisements) and personal contacts. A total of 200 call
originators were found, each of whom placed a telephone call via a
toll-free robot operator maintained by the LDC. Access to the robot
operator was possible via a unique Personal Identification Number
(PIN) issued by the recruiting staff at the LDC when the caller
enrolled in the project. The participants were made aware that their
telephone call would be recorded, as were the call recipients. The
call was allowed only if both parties agreed to being recorded. Each
caller was allowed to talk up to 30 minutes. Upon successful
completion of the call, the caller was paid $20 (in addition to making
a free long-distance telephone call). Each caller was allowed to
place only one telephone call. They were given no
guidelines concerning what they should talk about. Once a caller was
recruited to participate, he/she was given a free choice of whom to
call. Most participants called family members or close friends. All
calls originated in North America and were placed to various locations
within North America.
HUB5 Mandarin speech and transcript data may be
obtained by emailing firstname.lastname@example.org.
There are no updates at this time.