LDC Publications Relevant to GALE
Approved GALE sites may request copies of any corpora listed below. Your site's designated data contact person should email LDC's membership group at ldc@ldc.upenn.edu, requesting data by Catalog ID and Title.
In addition to resources created specifically for the program, GALE sites are eligible to receive general LDC publications as well as selected e-corpora created for other sponsored programs. General LDC publications are also available to all LDC members; many are also available to non-members through individual corpus licenses . Click on Catalog ID for more information about general release corpora. E-corpora are only available to GALE sites and to members of specified user communities. Most e-corpora will be made available as general LDC publications in future.
(updated 1.27.2009)
Corpora Created for the GALE Program
General LDC Releases Relevant to GALE
Catalog ID | Title | Status |
LDC2002T01 | Multiple-Translation Chinese Corpus | LDC Publication |
LDC2002S06 | Switchboard-2 Phase III Audio | LDC Publication |
LDC2002S10 | 1998 HUB5 English Evaluation | LDC Publication |
LDC2002S13 | 2001 HUB5 English Evaluation | LDC Publication |
LDC2002S12 | 2001 HUB5 Mandarin Evaluation | LDC Publication |
LDC2002S11 | 1997 HUB4 English Evaluation Speech and Transcripts | LDC Publication |
LDC2002L27 | Chinese-English Translation Lexicon Version 3.0 | LDC Publication |
LDC2002S22 | 1997 HUB5 Arabic Evaluation | LDC Publication |
LDC99L22 | Egyptian Colloquial Arabic Lexicon | LDC Publication |
LDC2002T39 | 1997 HUB5 Arabic Transcripts | LDC Publication |
LDC2002S37 | Callhome Egyptian Arabic Speech Supplement | LDC Publication |
LDC2002T38 | Callhome Egyptian Arabic Transcripts Supplement | LDC Publication |
LDC2002S24 | 1997 HUB5 German Evaluation | LDC Publication |
LDC2002L49 | Buckwalter Arabic Morphological Analyzer Version 1.0 | LDC Publication |
LDC2002S25 | 1997 HUB5 Spanish Evaluation | LDC Publication |
LDC2003T05 | English Gigaword | LDC Publication |
LDC2003T06 | Arabic Treebank: Part 1 v 2.0 | LDC Publication |
LDC2003T01 | 2001 HUB5 Mandarin Transcripts | LDC Publication |
LDC2003T07 | Arabic Treebank: Part 1 - 10K-word English Translation | LDC Publication |
LDC2003T03 | 1997 HUB5 German Transcripts | LDC Publication |
LDC2003T04 | 1997 HUB5 Spanish Transcripts | LDC Publication |
LDC2003T02 | 1998 HUB5 English Transcripts | LDC Publication |
LDC2003T09 | Chinese Gigaword | LDC Publication |
LDC2003T12 | Arabic Gigaword | LDC Publication |
LDC2003T11 | ACE-2 Version 1.0 | LDC Publication |
LDC2003T17 | Multiple-Translation Chinese (MTC) Part 2 | LDC Publication |
LDC2003T18 | Multiple-Translation Arabic (MTA) Part 1 | LDC Publication |
LDC2004T09 | TIDES Extraction (ACE) 2003 Multilingual Training Data | LDC Publication |
LDC2004T05 | Chinese Treebank Version 4.0 | LDC Publication |
LDC2004T11 | Arabic Treebank: Part 3 v 1.0 | LDC Publication |
LDC2004S08 | MDE RT-03 Training Data Speech | LDC Publication |
LDC2004T12 | MDE RT-03 Training Data Text and Annotations | LDC Publication |
LDC2004T07 | Multiple-Translation Chinese (MTC) Part 3 | LDC Publication |
LDC2004T08 | Hong Kong Parallel Text | LDC Publication |
LDC2004T17 | Arabic News Translation Text Part 1 | LDC Publication |
LDC2004S07 | Switchboard Cellular Part 2 Audio | LDC Publication |
LDC2004S11 | 2002 Rich Transcription Broadcast News and Conversational Telephone Speech | LDC Publication |
LDC2004L02 | Buckwalter Arabic Morphological Analyzer Version 2.0 | LDC Publication |
LDC2004S13 | Fisher English Training Speech Part 1 Speech | LDC Publication |
LDC2004T19 | Fisher English Training Speech Part 1, Transcripts | LDC Publication |
LDC2005T01 | Chinese Treebank 5.0 | LDC Publication |
LDC2005S07 | Levantine Arabic QT Training Data Set 3 Speech | LDC Publication |
LDC2005T03 | Levantine Arabic QT Training Data Set 3 Transcripts | LDC Publication |
LDC2005T07 | ACE Time Normalization (TERN) 2004 English Training Data V1.0 | LDC Publication |
LDC2005T05 | Multiple-Translation Arabic (MTA) Part 2 | LDC Publication |
LDC2005T09 | ACE 2004 Multilingual Training Corpus | LDC Publication |
LDC2005T06 | Chinese News Translation Text Part 1 | LDC Publication |
LDC2005S13 | Fisher English Training Part 2, Speech | LDC Publication |
LDC2005T19 | Fisher English Training Part 2, Transcripts | LDC Publication |
LDC2005S11 | TDT4 Multilingual Broadcast News Speech Corpus | LDC Publication |
LDC2005T16 | TDT4 Multilingual Text and Annotations | LDC Publication |
LDC2005T10 | Chinese English News Magazine Parallel Text | LDC Publication |
LDC2005S14 | Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) | LDC Publication |
LDC2005T12 | English Gigaword Second Edition | LDC Publication |
LDC2005T14 | Chinese Gigaword Second Edition | LDC Publication |
LDC2005T23 | Chinese Proposition Bank 1.0 | LDC Publication |
LDC97S66 | 1996 English Broadcast News Dev and Eval (Hub-4) | LDC Publication |
LDC97S44 | 1996 English Broadcast News Speech (Hub-4) | LDC Publication |
LDC97T22 | 1996 English Broadcast News Transcripts (Hub-4) | LDC Publication |
LDC98S71 | 1997 English Broadcast News Speech (Hub-4) | LDC Publication |
LDC98T28 | 1997 English Broadcast News Transcripts (Hub-4) | LDC Publication |
LDC2001S91 | 1997 HUB-4 Broadcast News Evaluation Non English Test Material | LDC Publication |
LDC98S73 | 1997 Mandarin Broadcast News Speech (Hub-4NE) | LDC Publication |
LDC98T24 | 1997 Mandarin Broadcast News Transcripts (Hub-4NE) | LDC Publication |
LDC93T1 | ACL/DCI | LDC Publication |
LDC99L23 | American English Spoken Lexicon | LDC Publication |
LDC2001T55 | Arabic Newswire Part 1 | LDC Publication |
LDC2000T43 | BLLIP 1987-89 WSJ Corpus Release 1 | LDC Publication |
LDC96S46 | CALLFRIEND American English-Non-Southern Dialect | LDC Publication |
LDC96S47 | CALLFRIEND American English-Southern Dialect | LDC Publication |
LDC96S49 | CALLFRIEND Egyptian Arabic | LDC Publication |
LDC96S55 | CALLFRIEND Mandarin Chinese-Mainland Dialect | LDC Publication |
LDC96S56 | CALLFRIEND Mandarin Chinese-Taiwan Dialect | LDC Publication |
LDC97L20 | CALLHOME American English Lexicon (PRONLEX) | LDC Publication |
LDC97S42 | CALLHOME American English Speech | LDC Publication |
LDC97T14 | CALLHOME American English Transcripts | LDC Publication |
LDC97S45 | CALLHOME Egyptian Arabic Speech | LDC Publication |
LDC97T19 | CALLHOME Egyptian Arabic Transcripts | LDC Publication |
LDC96L15 | CALLHOME Mandarin Chinese Lexicon | LDC Publication |
LDC96S34 | CALLHOME Mandarin Chinese Speech | LDC Publication |
LDC96T16 | CALLHOME Mandarin Chinese Transcripts | LDC Publication |
LDC96L14 | CELEX2 | LDC Publication |
LDC2001T11 | Chinese Treebank Version 2.0 | LDC Publication |
LDC95T11 | European Language Newspaper Text | LDC Publication |
LDC2000T50 | Hong Kong Hansards Parallel Text | LDC Publication |
LDC2000T47 | Hong Kong Laws Parallel Text | LDC Publication |
LDC2000T46 | Hong Kong News Parallel Text | LDC Publication |
LDC98S69 | Hub-5 Mandarin Telephone Speech Corpus | LDC Publication |
LDC98T26 | Hub-5 Mandarin Transcripts | LDC Publication |
LDC95T13 | Mandarin Chinese News Text | LDC Publication |
LDC2001T02 | Message Understanding Conference (MUC) 7 | LDC Publication |
LDC95T21 | North American News Text Corpus | LDC Publication |
LDC98T30 | North American News Text Supplement | LDC Publication |
LDC97S62 | SWITCHBOARD-1 Release 2 | LDC Publication |
LDC2001S13 | Switchboard Cellular Part 1 Audio | LDC Publication |
LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | LDC Publication |
LDC2001T14 | Switchboard Cellular Part 1 Transcription | LDC Publication |
LDC98S75 | Switchboard-2 Phase 1 | LDC Publication |
LDC99S79 | Switchboard-2 Phase II | LDC Publication |
LDC98T25 | TDT Pilot Study Corpus | LDC Publication |
LDC2000S92 | TDT2 Careful Transcription Audio | LDC Publication |
LDC2000T44 | TDT2 Careful Transcription Text | LDC Publication |
LDC99S84 | TDT2 English Audio | LDC Publication |
LDC99T37 | TDT2 English Text, Version 2 | LDC Publication |
LDC2001S93 | TDT2 Mandarin Audio Corpus | LDC Publication |
LDC99T38 | TDT2 Mandarin Text | LDC Publication |
LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | LDC Publication |
LDC2001S94 | TDT3 English Audio | LDC Publication |
LDC2001S95 | TDT3 Mandarin Audio | LDC Publication |
LDC2001T58 | TDT3 Multilanguage Text Version 2.0 | LDC Publication |
LDC93T3A | TIPSTER Complete | LDC Publication |
LDC2000T52 | TREC Mandarin | LDC Publication |
LDC2000T51 | TREC Spanish | LDC Publication |
LDC98S72 | Taiwanese Putonghua Speech and Transcripts | LDC Publication |
LDC99T42 | Treebank-3 | LDC Publication |
LDC94T4B-1 | UN Parallel Text (English) | LDC Publication |
LDC94T4B-3 | UN Parallel Text (Spanish) | LDC Publication |
LDC2004S10 | Santa Barbara Corpus of Spoken American English III | LDC publication |
LDC2004T14 | Proposition Bank I | LDC publication |
LDC2004T19 | Fisher English Training Speech Part 1, Transcripts | LDC publication |
LDC2004T23 | Prague Arabic Dependency Treebank 1.0 | LDC publication |
LDC2005S15 | HKUST Mandarin Telephone Speech, Part 1 | LDC publication |
LDC2005S16 | MDE RT04 Training Data Speech | LDC publication |
LDC2005S25 | Santa Barbara Corpus of Spoken American English Part-IV | LDC publication |
LDC2005T01U01 | Chinese Treebank 5.1 | LDC publication |
LDC2005T02 | Arabic Treebank: Part 1 v 3.0 (POS with full vocal.+ syntactic analysis | LDC publication |
LDC2005T08 | Discourse Graphbank | LDC publication |
LDC2005T13 | CCGbank | LDC publication |
LDC2005T20 | Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis) | LDC publication |
LDC2005T24 | MDE RT-04 Training Data Text/Annotations | LDC publication |
LDC2005T32 | HKUST Mandarin Telephone Transcript Data, Part 1 | LDC publication |
LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus | LDC publication |
E-Corpora Made for Other Programs Available to GALE
Catalog ID | Title | Status |
LDC2005E18 | ACE 2005 Multilingual Training Data V4.0 | ACE E-corpus |
LDC2005E25 | ACE 2005 Consistency Analysis - Training Data | ACE E-corpus |
LDC2004E39 | ACE 2004 English Consistency Analysis Data | ACE E-corpus |
LDC2004E18 | RT-04F STT Multilingual Speech Development Data V1.1 Re-release | EARS E-corpus |
LDC2004E16 | RT-04 MDE DevTest Set #1 Version 1.2 | EARS E-corpus |
LDC2004E10 | RT-04F STT Multilingual Speech Development Data - Supplement | EARS E-corpus |
LDC2004E19 | RT-04F STT Multilingual Transcripts Devlopment Data V1.2 | EARS E-corpus |
LDC2004E24 | RT-04 MDE Annotation Consistency Study | EARS E-corpus |
LDC2004E29 | RT-04 MDE DevTest Set #2 V1.2 | EARS E-corpus |
LDC2004E33 | EARS MDE Diarization Scoring Package | EARS E-corpus |
LDC2004E65 | Levantine Arabic QT Training Data Set 2 Speech | EARS E-corpus |
LDC2004E66 | Levantine Arabic QT Training Data Set 2 Transcripts V1.2 | EARS E-corpus |
LDC2004E67 | RT-04F STT Chinese CTS Development Data Speech | EARS E-corpus |
LDC2004E68 | RT-04F STT Chinese CTS Development Data Transcripts | EARS E-corpus |
LDC2004E47 | RT-04 MDE Non-English Pilot Corpus V1.0 | EARS E-corpus |
LDC2004E28 | RT-04 STT Transcription Consistency Study | EARS E-corpus |
LDC2005E26 | Switchboard-1 Quick Transcripts from BBN/WordWave | EARS E-corpus |
LDC2005E73 | EARS RT04 Evaluation Transcripts and MDE Annotations | EARS E-corpus |
LDC2005E74 | EARS RT04 Evaluation Audio | EARS E-corpus |
LDC2002E14 | Chinese English Translation Lexicon Version 3-beta | TIDES E-corpus |
LDC2002E17 | English Translation of Chinese Treebank Version 1 beta | TIDES E-corpus |
LDC2002E19 | Hong Kong Hansard Parallel Text Version 2 beta | TIDES E-corpus |
LDC2002E16 | Hong Kong News Parallel Text Version 2 beta | TIDES E-corpus |
LDC2002E15 | UN Arabic English Parallel Text Version 1 beta | TIDES E-corpus |
LDC2002E18 | Xinhua Chinese English Parallel News Text Version 1 beta | TIDES E-corpus |
LDC2002E50 | Name-Annotated TDT Corpus Supplement for ACE | TIDES E-corpus |
LDC2002E54 | Multiple-Translation Arabic Corpus | TIDES E-corpus |
LDC2002E53 | Multiple-Translation Chinese Corpus 2.0 | TIDES E-corpus |
LDC2002E58 | Sinorama Chinese English Parallel Text | TIDES E-corpus |
LDC2003E01 | Chinese <-> English Name Entity Lists Version 1.0 beta | TIDES E-corpus |
LDC2003E05 | Arabic Translation Corpus Part 1 | TIDES E-corpus |
LDC2003E06 | Chinese Treebank 3.0 | TIDES E-corpus |
LDC2003E07 | Chinese Treebank English Parallel Corpus | TIDES E-corpus |
LDC2003E09 | Arabic News Translation Corpus Part 2 | TIDES E-corpus |
LDC2003E08 | Chinese News Translation Corpus Part 1 | TIDES E-corpus |
LDC2003E15 | HARD GovDocs | TIDES E-corpus |
LDC2003E25 | Hong Kong News Parallel Text | TIDES E-corpus |
LDC2004E07 | Arabic News Translation Corpus Part 3 | TIDES E-corpus |
LDC2004E08 | Arabic English Parallel News Text Part 1 | TIDES E-corpus |
LDC2004T02 | Arabic Treebank: Part 2 v 2.0 | TIDES E-corpus |
LDC2004E09 | Hong Kong Hansard Parallel Text | TIDES E-corpus |
LDC2004E11 | Arabic News Translation Corpus Part 4 | TIDES E-corpus |
LDC2004E13 | UN Arabic English Parallel Text | TIDES E-corpus |
LDC2004E12 | UN Chinese English Parallel Text | TIDES E-corpus |
LDC2004E38 | ACE 2003 Evaluation Data (for 2004 DevTest) | TIDES E-corpus |
LDC2005E47 | Chinese English News Magazine Parallel Text | TIDES E-corpus |
LDC2005E11 | Arabic Treebank: Part 3 v.1.0 (POS + Syntactic Analysis of total corpus) | TIDES E-corpus |
LDC2005E12 | 2005 MSE Arabic-English Clusters V1.2 | TIDES E-corpus |
LDC2005E14 | MSE 2005 Sample Summary Topic | TIDES E-corpus |
LDC2005E13 | 2005 MSE Arabic-English Summaries V1.2 | TIDES E-corpus |
LDC2004E71 | ATB Part 3 (a) v.1.1 | TIDES E-corpus |
LDC2005E46 | Arabic Treebank English Translation | TIDES E-corpus |
LDC2002E27 | Chinese English Translation Dictionary v3.0 | TIDES E-corpus |
LDC2004E46 | DUC 2004 Arabic-English Summaries | TIDES E-corpus |
LDC2004E42 | HARD 2004 Reference Annotations | TIDES E-corpus |
LDC2004E41 | TDT5 Multilanguage Text Corpus | TIDES E-corpus |
LDC2004E45 | TDT5-2004 Reference Annotations - Version 3.0 | TIDES E-corpus |
LDC2005E16 | HARD Annotations 2003 | TIDES E-corpus |
LDC2005E17 | HARD Annotations 2004 | TIDES E-corpus |