The DARPA EARS (Effective, Affordable, Reusable Speech-to-Text) program is
developing robust speech recognition technology to address a range of languages
and speaking styles and "produce powerful new speech-to-text (automatic transcription)
technology whose outputs are substantially richer and much more accurate than
currently possible." LDC provides conversational and broadcast speech and
transcripts, annotations, and lexicons and texts for language modeling in each
of the EARS languages.
Careful Transcription
LDC creates high-accuracy verbatim transcripts with rich markup for use as
Speech-to-Text devtest and evaluation data. The following documents
provide additional information about careful transcription for STT research:
Metadata Extraction
LDC is creating annotated resources to support a metadata extraction (MDE)
research evaluation. The goal of MDE is to enable technology that can take the raw
STT output and refine it into forms that are of more use to humans and to
downstream automatic processes. The following documents provide additional
information about annotation in support of MDE research:
The table below summarizes English, Chinese and Arabic resources relevant to EARS categorized by language and resource type.
|
|
English |
|
Broadcast Speech |
1997
HUB4 English Evaluation Speech and Transcripts - 3 hours of radio and tv
news stories
|
|
Careful Transcriptions |
1996
English Broadcast News Transcripts (Hub-4) - 104 hours of transcripted
broadcasts |
|
Quick Transcriptions |
TDT
Pilot Study Corpus - 16,000 news stories TDT4 Multilanguage Text Version 1.1: contact LDC for a pre-release copy (LDC catalog no.: LDC2003E21)
|
|
Metadata Annotations |
EARS-only training data releases (contact LDC)
|
|
Conversational Speech |
SWITCHBOARD-1
Release 2 - 2,400 two sided telephone conversations |
|
Careful Transcriptions |
CALLHOME
American English Transcripts - 18.3 hours of transcribed speech |
|
Pronouncing Lexicon |
CALLHOME
American English Lexicon - 90,988 lexical entries |
|
Language Modeling Text |
North
American News Text Corpus - texts formatted with TIPSTER
|
|
|
Chinese |
|
Broadcast Speech |
1997
HUB-4 Broadcast News Evaluation Non English Test - 1 hour each of Spanish
and Mandarin news broadcasts
|
|
Careful Transcriptions |
1997
Mandarin Broadcast News Transcripts - transcripts for 30 hours of news
broadcasts |
|
Quick Transcriptions |
TDT2
Multilanguage Text Version 4.0 - transcripts for recorded Mandarin news
broadcasts |
|
Conversational Speech |
CALLFRIEND
Mandarin Chinese-Mainland - 60 Mandarin telephone conversations 5-30
minutes each |
|
Careful Transcriptions |
CALLHOME
Mandarin Chinese Transcripts - 120 transcripted Mandarin telephone
conversations |
|
Pronouncing Lexicon |
CALLHOME Mandarin Chinese Lexicon - 44,405 words with phonological, morphological and frequency information |
|
Language Modeling Text |
Mandarin
Chinese News Text - 250 million GB-encoded text characters
|
|
|
Arabic |
|
Arabic Dialects Transcription Guidelines |
Levantine Arabic Transcription Guidelines - (under development) |
|
Broadcast Speech |
EARS-only training data releases (contact LDC)
|
|
Careful Transcriptions |
Callhome Egyptian Arabic Transcripts Supplement - transcripts for CallHome Egyptian Arabic Speech Supplement |
|
Quick Transcriptions |
TDT 3 Arabic Text : contact
LDC for a pre-release copy (LDC catalog no.: LDC2002E32 ) |
|
Conversational Speech |
CALLHOME
Egyptian Arabic Speech - 120 Egyptian Colloquial Arabic telephone
conversations |
|
Careful Transcriptions |
CALLHOME
Egyptian Arabic Transcripts - transcripts for 120 Egyptian Colloquial
Arabic telephone conversations |
|
Pronouncing Lexicon |
Egyptian Colloquial Arabic Lexicon - electronic pronunciation dictionary of Egyptian Colloquial Arabic |
|
Language Modeling Text |
Arabic
Newswire Part 1 - 2,337 Arabic text data files tagged using TIPSTER
|
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help
| Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data
(c) 1996-1999 Linguistic Data Consortium,
University of Pennsylvania. All Rights
Reserved.
Please send technical questions to online-service@ldc.upenn.edu,
Member sales questions to ldc@ldc.upenn.edu.
Last modified: Wed Jun 26 09:44:41 2002