Approved GALE sites may request copies of any corpora listed below. Your site's designated data contact person should email LDC's membership group at ldc@ldc.upenn.edu [1], requesting data by Catalog ID and Title.
In addition to resources created specifically for the program, GALE sites are eligible to receive general LDC publications as well as selected e-corpora created for other sponsored programs. General LDC publications are also available to all LDC members; many are also available to non-members through individual corpus licenses . Click on Catalog ID for more information about general release corpora. E-corpora are only available to GALE sites and to members of specified user communities. Most e-corpora will be made available as general LDC publications in future.
(updated 1.27.2009)
Corpora Created for the GALE Program
General LDC Releases Relevant to GALE
Catalog ID | Title | Status |
LDC2002T01 [2] | Multiple-Translation Chinese Corpus | LDC Publication |
LDC2002S06 [3] | Switchboard-2 Phase III Audio | LDC Publication |
LDC2002S10 [4] | 1998 HUB5 English Evaluation | LDC Publication |
LDC2002S13 [5] | 2001 HUB5 English Evaluation | LDC Publication |
LDC2002S12 [6] | 2001 HUB5 Mandarin Evaluation | LDC Publication |
LDC2002S11 [7] | 1997 HUB4 English Evaluation Speech and Transcripts | LDC Publication |
LDC2002L27 [8] | Chinese-English Translation Lexicon Version 3.0 | LDC Publication |
LDC2002S22 [9] | 1997 HUB5 Arabic Evaluation | LDC Publication |
LDC99L22 [10] | Egyptian Colloquial Arabic Lexicon | LDC Publication |
LDC2002T39 [11] | 1997 HUB5 Arabic Transcripts | LDC Publication |
LDC2002S37 [12] | Callhome Egyptian Arabic Speech Supplement | LDC Publication |
LDC2002T38 [13] | Callhome Egyptian Arabic Transcripts Supplement | LDC Publication |
LDC2002S24 [14] | 1997 HUB5 German Evaluation | LDC Publication |
LDC2002L49 [15] | Buckwalter Arabic Morphological Analyzer Version 1.0 | LDC Publication |
LDC2002S25 [16] | 1997 HUB5 Spanish Evaluation | LDC Publication |
LDC2003T05 [17] | English Gigaword | LDC Publication |
LDC2003T06 [18] | Arabic Treebank: Part 1 v 2.0 | LDC Publication |
LDC2003T01 [19] | 2001 HUB5 Mandarin Transcripts | LDC Publication |
LDC2003T07 [20] | Arabic Treebank: Part 1 - 10K-word English Translation | LDC Publication |
LDC2003T03 [21] | 1997 HUB5 German Transcripts | LDC Publication |
LDC2003T04 [22] | 1997 HUB5 Spanish Transcripts | LDC Publication |
LDC2003T02 [23] | 1998 HUB5 English Transcripts | LDC Publication |
LDC2003T09 [24] | Chinese Gigaword | LDC Publication |
LDC2003T12 [25] | Arabic Gigaword | LDC Publication |
LDC2003T11 [26] | ACE-2 Version 1.0 | LDC Publication |
LDC2003T17 [27] | Multiple-Translation Chinese (MTC) Part 2 | LDC Publication |
LDC2003T18 [28] | Multiple-Translation Arabic (MTA) Part 1 | LDC Publication |
LDC2004T09 [29] | TIDES Extraction (ACE) 2003 Multilingual Training Data | LDC Publication |
LDC2004T05 [30] | Chinese Treebank Version 4.0 | LDC Publication |
LDC2004T11 [31] | Arabic Treebank: Part 3 v 1.0 | LDC Publication |
LDC2004S08 [32] | MDE RT-03 Training Data Speech | LDC Publication |
LDC2004T12 [33] | MDE RT-03 Training Data Text and Annotations | LDC Publication |
LDC2004T07 [34] | Multiple-Translation Chinese (MTC) Part 3 | LDC Publication |
LDC2004T08 [35] | Hong Kong Parallel Text | LDC Publication |
LDC2004T17 [36] | Arabic News Translation Text Part 1 | LDC Publication |
LDC2004S07 [37] | Switchboard Cellular Part 2 Audio | LDC Publication |
LDC2004S11 [38] | 2002 Rich Transcription Broadcast News and Conversational Telephone Speech | LDC Publication |
LDC2004L02 [39] | Buckwalter Arabic Morphological Analyzer Version 2.0 | LDC Publication |
LDC2004S13 [40] | Fisher English Training Speech Part 1 Speech | LDC Publication |
LDC2004T19 [41] | Fisher English Training Speech Part 1, Transcripts | LDC Publication |
LDC2005T01 [42] | Chinese Treebank 5.0 | LDC Publication |
LDC2005S07 [43] | Levantine Arabic QT Training Data Set 3 Speech | LDC Publication |
LDC2005T03 [44] | Levantine Arabic QT Training Data Set 3 Transcripts | LDC Publication |
LDC2005T07 [45] | ACE Time Normalization (TERN) 2004 English Training Data V1.0 | LDC Publication |
LDC2005T05 [42] | Multiple-Translation Arabic (MTA) Part 2 | LDC Publication |
LDC2005T09 [46] | ACE 2004 Multilingual Training Corpus | LDC Publication |
LDC2005T06 [47] | Chinese News Translation Text Part 1 | LDC Publication |
LDC2005S13 [48] | Fisher English Training Part 2, Speech | LDC Publication |
LDC2005T19 [49] | Fisher English Training Part 2, Transcripts | LDC Publication |
LDC2005S11 [50] | TDT4 Multilingual Broadcast News Speech Corpus | LDC Publication |
LDC2005T16 [51] | TDT4 Multilingual Text and Annotations | LDC Publication |
LDC2005T10 [52] | Chinese English News Magazine Parallel Text | LDC Publication |
LDC2005S14 [53] | Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) | LDC Publication |
LDC2005T12 [54] | English Gigaword Second Edition | LDC Publication |
LDC2005T14 [55] | Chinese Gigaword Second Edition | LDC Publication |
LDC2005T23 [56] | Chinese Proposition Bank 1.0 | LDC Publication |
LDC97S66 [57] | 1996 English Broadcast News Dev and Eval (Hub-4) | LDC Publication |
LDC97S44 [58] | 1996 English Broadcast News Speech (Hub-4) | LDC Publication |
LDC97T22 [59] | 1996 English Broadcast News Transcripts (Hub-4) | LDC Publication |
LDC98S71 [60] | 1997 English Broadcast News Speech (Hub-4) | LDC Publication |
LDC98T28 [61] | 1997 English Broadcast News Transcripts (Hub-4) | LDC Publication |
LDC2001S91 [62] | 1997 HUB-4 Broadcast News Evaluation Non English Test Material | LDC Publication |
LDC98S73 [63] | 1997 Mandarin Broadcast News Speech (Hub-4NE) | LDC Publication |
LDC98T24 [64] | 1997 Mandarin Broadcast News Transcripts (Hub-4NE) | LDC Publication |
LDC93T1 [65] | ACL/DCI | LDC Publication |
LDC99L23 [66] | American English Spoken Lexicon | LDC Publication |
LDC2001T55 [67] | Arabic Newswire Part 1 | LDC Publication |
LDC2000T43 [68] | BLLIP 1987-89 WSJ Corpus Release 1 | LDC Publication |
LDC96S46 [69] | CALLFRIEND American English-Non-Southern Dialect | LDC Publication |
LDC96S47 [70] | CALLFRIEND American English-Southern Dialect | LDC Publication |
LDC96S49 [71] | CALLFRIEND Egyptian Arabic | LDC Publication |
LDC96S55 [72] | CALLFRIEND Mandarin Chinese-Mainland Dialect | LDC Publication |
LDC96S56 [73] | CALLFRIEND Mandarin Chinese-Taiwan Dialect | LDC Publication |
LDC97L20 [74] | CALLHOME American English Lexicon (PRONLEX) | LDC Publication |
LDC97S42 [75] | CALLHOME American English Speech | LDC Publication |
LDC97T14 [76] | CALLHOME American English Transcripts | LDC Publication |
LDC97S45 [77] | CALLHOME Egyptian Arabic Speech | LDC Publication |
LDC97T19 [78] | CALLHOME Egyptian Arabic Transcripts | LDC Publication |
LDC96L15 [79] | CALLHOME Mandarin Chinese Lexicon | LDC Publication |
LDC96S34 [80] | CALLHOME Mandarin Chinese Speech | LDC Publication |
LDC96T16 [81] | CALLHOME Mandarin Chinese Transcripts | LDC Publication |
LDC96L14 [82] | CELEX2 | LDC Publication |
LDC2001T11 [83] | Chinese Treebank Version 2.0 | LDC Publication |
LDC95T11 [84] | European Language Newspaper Text | LDC Publication |
LDC2000T50 [85] | Hong Kong Hansards Parallel Text | LDC Publication |
LDC2000T47 [86] | Hong Kong Laws Parallel Text | LDC Publication |
LDC2000T46 [87] | Hong Kong News Parallel Text | LDC Publication |
LDC98S69 [88] | Hub-5 Mandarin Telephone Speech Corpus | LDC Publication |
LDC98T26 [89] | Hub-5 Mandarin Transcripts | LDC Publication |
LDC95T13 [90] | Mandarin Chinese News Text | LDC Publication |
LDC2001T02 [91] | Message Understanding Conference (MUC) 7 | LDC Publication |
LDC95T21 [92] | North American News Text Corpus | LDC Publication |
LDC98T30 [93] | North American News Text Supplement | LDC Publication |
LDC97S62 [94] | SWITCHBOARD-1 Release 2 | LDC Publication |
LDC2001S13 [95] | Switchboard Cellular Part 1 Audio | LDC Publication |
LDC2001S15 [96] | Switchboard Cellular Part 1 Transcribed Audio | LDC Publication |
LDC2001T14 [97] | Switchboard Cellular Part 1 Transcription | LDC Publication |
LDC98S75 [98] | Switchboard-2 Phase 1 | LDC Publication |
LDC99S79 [99] | Switchboard-2 Phase II | LDC Publication |
LDC98T25 [100] | TDT Pilot Study Corpus | LDC Publication |
LDC2000S92 [101] | TDT2 Careful Transcription Audio | LDC Publication |
LDC2000T44 [102] | TDT2 Careful Transcription Text | LDC Publication |
LDC99S84 [103] | TDT2 English Audio | LDC Publication |
LDC99T37 [104] | TDT2 English Text, Version 2 | LDC Publication |
LDC2001S93 [105] | TDT2 Mandarin Audio Corpus | LDC Publication |
LDC99T38 [106] | TDT2 Mandarin Text | LDC Publication |
LDC2001T57 [107] | TDT2 Multilanguage Text Version 4.0 | LDC Publication |
LDC2001S94 [108] | TDT3 English Audio | LDC Publication |
LDC2001S95 [109] | TDT3 Mandarin Audio | LDC Publication |
LDC2001T58 [110] | TDT3 Multilanguage Text Version 2.0 | LDC Publication |
LDC93T3A [111] | TIPSTER Complete | LDC Publication |
LDC2000T52 [112] | TREC Mandarin | LDC Publication |
LDC2000T51 [113] | TREC Spanish | LDC Publication |
LDC98S72 [114] | Taiwanese Putonghua Speech and Transcripts | LDC Publication |
LDC99T42 [115] | Treebank-3 | LDC Publication |
LDC94T4B-1 [116] | UN Parallel Text (English) | LDC Publication |
LDC94T4B-3 [117] | UN Parallel Text (Spanish) | LDC Publication |
LDC2004S10 [118] | Santa Barbara Corpus of Spoken American English III | LDC publication |
LDC2004T14 [119] | Proposition Bank I | LDC publication |
LDC2004T19 [41] | Fisher English Training Speech Part 1, Transcripts | LDC publication |
LDC2004T23 [120] | Prague Arabic Dependency Treebank 1.0 | LDC publication |
LDC2005S15 [121] | HKUST Mandarin Telephone Speech, Part 1 | LDC publication |
LDC2005S16 [122] | MDE RT04 Training Data Speech | LDC publication |
LDC2005S25 [123] | Santa Barbara Corpus of Spoken American English Part-IV | LDC publication |
LDC2005T01U01 [124] | Chinese Treebank 5.1 | LDC publication |
LDC2005T02 [125] | Arabic Treebank: Part 1 v 3.0 (POS with full vocal.+ syntactic analysis | LDC publication |
LDC2005T08 [126] | Discourse Graphbank | LDC publication |
LDC2005T13 [127] | CCGbank | LDC publication |
LDC2005T20 [128] | Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis) | LDC publication |
LDC2005T24 [129] | MDE RT-04 Training Data Text/Annotations | LDC publication |
LDC2005T32 [130] | HKUST Mandarin Telephone Transcript Data, Part 1 | LDC publication |
LDC2005T33 [131] | BBN Pronoun Coreference and Entity Type Corpus | LDC publication |
E-Corpora Made for Other Programs Available to GALE
Catalog ID | Title | Status |
LDC2005E18 | ACE 2005 Multilingual Training Data V4.0 | ACE E-corpus |
LDC2005E25 | ACE 2005 Consistency Analysis - Training Data | ACE E-corpus |
LDC2004E39 | ACE 2004 English Consistency Analysis Data | ACE E-corpus |
LDC2004E18 | RT-04F STT Multilingual Speech Development Data V1.1 Re-release | EARS E-corpus |
LDC2004E16 | RT-04 MDE DevTest Set #1 Version 1.2 | EARS E-corpus |
LDC2004E10 | RT-04F STT Multilingual Speech Development Data - Supplement | EARS E-corpus |
LDC2004E19 | RT-04F STT Multilingual Transcripts Devlopment Data V1.2 | EARS E-corpus |
LDC2004E24 | RT-04 MDE Annotation Consistency Study | EARS E-corpus |
LDC2004E29 | RT-04 MDE DevTest Set #2 V1.2 | EARS E-corpus |
LDC2004E33 | EARS MDE Diarization Scoring Package | EARS E-corpus |
LDC2004E65 | Levantine Arabic QT Training Data Set 2 Speech | EARS E-corpus |
LDC2004E66 | Levantine Arabic QT Training Data Set 2 Transcripts V1.2 | EARS E-corpus |
LDC2004E67 | RT-04F STT Chinese CTS Development Data Speech | EARS E-corpus |
LDC2004E68 | RT-04F STT Chinese CTS Development Data Transcripts | EARS E-corpus |
LDC2004E47 | RT-04 MDE Non-English Pilot Corpus V1.0 | EARS E-corpus |
LDC2004E28 | RT-04 STT Transcription Consistency Study | EARS E-corpus |
LDC2005E26 | Switchboard-1 Quick Transcripts from BBN/WordWave | EARS E-corpus |
LDC2005E73 | EARS RT04 Evaluation Transcripts and MDE Annotations | EARS E-corpus |
LDC2005E74 | EARS RT04 Evaluation Audio | EARS E-corpus |
LDC2002E14 | Chinese English Translation Lexicon Version 3-beta | TIDES E-corpus |
LDC2002E17 | English Translation of Chinese Treebank Version 1 beta | TIDES E-corpus |
LDC2002E19 | Hong Kong Hansard Parallel Text Version 2 beta | TIDES E-corpus |
LDC2002E16 | Hong Kong News Parallel Text Version 2 beta | TIDES E-corpus |
LDC2002E15 | UN Arabic English Parallel Text Version 1 beta | TIDES E-corpus |
LDC2002E18 | Xinhua Chinese English Parallel News Text Version 1 beta | TIDES E-corpus |
LDC2002E50 | Name-Annotated TDT Corpus Supplement for ACE | TIDES E-corpus |
LDC2002E54 | Multiple-Translation Arabic Corpus | TIDES E-corpus |
LDC2002E53 | Multiple-Translation Chinese Corpus 2.0 | TIDES E-corpus |
LDC2002E58 | Sinorama Chinese English Parallel Text | TIDES E-corpus |
LDC2003E01 | Chinese <-> English Name Entity Lists Version 1.0 beta | TIDES E-corpus |
LDC2003E05 | Arabic Translation Corpus Part 1 | TIDES E-corpus |
LDC2003E06 | Chinese Treebank 3.0 | TIDES E-corpus |
LDC2003E07 | Chinese Treebank English Parallel Corpus | TIDES E-corpus |
LDC2003E09 | Arabic News Translation Corpus Part 2 | TIDES E-corpus |
LDC2003E08 | Chinese News Translation Corpus Part 1 | TIDES E-corpus |
LDC2003E15 | HARD GovDocs | TIDES E-corpus |
LDC2003E25 | Hong Kong News Parallel Text | TIDES E-corpus |
LDC2004E07 | Arabic News Translation Corpus Part 3 | TIDES E-corpus |
LDC2004E08 | Arabic English Parallel News Text Part 1 | TIDES E-corpus |
LDC2004T02 | Arabic Treebank: Part 2 v 2.0 | TIDES E-corpus |
LDC2004E09 | Hong Kong Hansard Parallel Text | TIDES E-corpus |
LDC2004E11 | Arabic News Translation Corpus Part 4 | TIDES E-corpus |
LDC2004E13 | UN Arabic English Parallel Text | TIDES E-corpus |
LDC2004E12 | UN Chinese English Parallel Text | TIDES E-corpus |
LDC2004E38 | ACE 2003 Evaluation Data (for 2004 DevTest) | TIDES E-corpus |
LDC2005E47 | Chinese English News Magazine Parallel Text | TIDES E-corpus |
LDC2005E11 | Arabic Treebank: Part 3 v.1.0 (POS + Syntactic Analysis of total corpus) | TIDES E-corpus |
LDC2005E12 | 2005 MSE Arabic-English Clusters V1.2 | TIDES E-corpus |
LDC2005E14 | MSE 2005 Sample Summary Topic | TIDES E-corpus |
LDC2005E13 | 2005 MSE Arabic-English Summaries V1.2 | TIDES E-corpus |
LDC2004E71 | ATB Part 3 (a) v.1.1 | TIDES E-corpus |
LDC2005E46 | Arabic Treebank English Translation | TIDES E-corpus |
LDC2002E27 | Chinese English Translation Dictionary v3.0 | TIDES E-corpus |
LDC2004E46 | DUC 2004 Arabic-English Summaries | TIDES E-corpus |
LDC2004E42 | HARD 2004 Reference Annotations | TIDES E-corpus |
LDC2004E41 | TDT5 Multilanguage Text Corpus | TIDES E-corpus |
LDC2004E45 | TDT5-2004 Reference Annotations - Version 3.0 | TIDES E-corpus |
LDC2005E16 | HARD Annotations 2003 | TIDES E-corpus |
LDC2005E17 | HARD Annotations 2004 | TIDES E-corpus |