Certain LDC data sets are governed by corpus-specific license agreements which supersede the LDC membership agreements and the LDC User Agreement for Non-Members [1] and must therefore be signed by all licensees (members and nonmembers). Below is a list of corpora to which such licenses apply with links to the agreements.
Fax all completed user agreements to +1.215.573.2175 or scan and email them to the Membership Office [2].
- 2024
- 2023
- 2022
- 2021
- 2020
- 2019
- 2018
- 2017
- 2016
- 2015
- 2014
- 2013
- 2012
- 2011
- 2010
- 2009
- 2008
- 2007
- 2006
- 2005
- 2004
- 2002
- 2001
- 2000
- 1999
- 1998
- 1997
- 1996
- 1995
- 1994
- 1993
2024
LDC2024S04 [3] BabyEars Affective Vocalizations
Member/Nonmember [4]
LDC2024S02 [5] Second Language University Speech Intelligibility Corpus
Member/Nonmember [6]
2023
LDC2023L01 [7] Moroccan Arabic - English Lexical Database
Member/Nonmember [8]
LDC2023S05 [9] Samrómur Queries Icelandic Speech 1.0
For-Profit Member [10], Not-For-Profit Member [11], Nonmember [12]
2022
LDC2022S08 [13] MASRI Synthetic
Member/Nonmember [14]
LDC2022S11 [15] Samrómur Children Icelandic Speech 1.0
For-Profit Member [16], Not-For-Profit Member [17], Nonmember [18]
LDC2022S05 [19] Samrómur Icelandic Speech 1.0
For-Profit Member [20], Not-For-Profit Member [21], Nonmember [22]
LDC2022S07 [23] Second DIHARD Challenge Evaluation - SEEDLingS
For-Profit Member [24], Not-For-Profit Member [25], Nonmember [26]
LDC2022S03 [27] Spoken Digits in Hindi and Indian English
Member/Nonmember [28]
2021
LDC2021S01 [29] Althingi Parliamentary Speech
For-Profit Member [30], Not-For-Profit Member [31], Nonmember [32]
LDC2021S02 [33]Columbia Games Corpus
Member/Nonmember [34]
LDC2021S06 [35] Ethnobotanical Research and Language Documentation of Nahuatl
Member/Nonmember [36]
LDC2021S05 [37] MyST Children's Conversational Speech
Member/Nonmember [38]
LDC2021S11 [39] Second DIHARD Challenge Development - SEEDLingS
For-Profit Member [40], Not-For-Profit Member [41], Nonmember [42]
LDC2021S04 [43] The SSNCE Database of Tamil Dysarthric Speech
Member/Nonmember [44]
2020
LDC2020T23 [45] Corpus of Law, Academic, and News
Member/Nonmember [46]
LDC2020L01 [47] Database of Word Level Statistics – Mandarin
Member/Nonmember
[48]
LDC2020T06 [49] EVALution
Member/Nonmember [50]
LDC2020S02 [51] IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b
Member [52], Nonmember
[53]
LDC2020S07 [54] IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b
Member [55], Nonmember [56]
LDC2020S10 [57] IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b
Member, [58] Nonmember [59]
LDC2020T16 [60] Penn Parsed Corpora of Historical English
Member/Nonmember [61]
LDC2020T12 [62] SemTransCNC
Member/Nonmember [63]
2019
LDC2019S22 [64] IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b
Member [65], Nonmember [66]
LDC2019S16 [67] IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c
Member [68], Nonmember [69]
LDC2019S13 [70] First DIHARD Challenge Evaluation - SEEDLingS
LDC2019S10 [71] First DIHARD Challenge Development – SEEDLingS
Member [72], Nonmember [73]
LDC2019S11 [74] USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition
Member [75], Nonmember [76]
LDC2019S08 [77] IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c
Member [78], Nonmember [79]
LDC2019S03 [80] IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b
Member, [81] Nonmember [82]
LDC2019S01 [83] SRI Speech-Based Collaborative Learning Corpus
Member/Nonmember [84]
2018
LDC2018S01 [85] DIRHA English WSJ Audio
Member/Nonmember [86]
LDC2018T05 [87] H2, E2, ERK1 Children's Writing
Member/Nonmember [88]
LDC2018S07 [89] IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b
Member, [90] Nonmember
[91]
LDC2018S13 [92] IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a
Member [93], [93] Nonmember [94]
LDC2018S16 [95] IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a
Member, [96] Nonmember [97]
LDC2018S02 [98] IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e
Member [99], Nonmember [100]
LDC2018S17 [101] Nautilus Speaker Characterization
Member/Nonmember [102]
LDC2018T13 [103] TRAD Arabic-French Parallel Text -- Newsgroup
Member, [104] Nonmember [105]
LDC2018T21 [106] TRAD Arabic-French Parallel Text -- Newswire
Member, [107] Nonmember [108]
LDC2018T02 [109] TRAD Chinese-French Parallel Text -- Blog
Member, [110] Nonmember [111]
LDC2018T17 [112] TRAD Chinese-French Parallel Text -- Broadcast News
Member, [113] Nonmember [114]
2017
LDC2017S21 [115] ASpIRE Development and Development Test Sets
Member, [116] Nonmember [117]
LDC2017T03 [118] First-Year Law Students' Court Memoranda
Member/Nonmember [119]
LDC2017S03 [120] IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b
Member, [121] Nonmember
[122]
LDC2017S22 [123] IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a
Member, [124] Nonmember [125]
LDC2017S08 [126] IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a
Member, [127] Nonmember [128]
LDC2017S05 [129] IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
Member, [130] Nonmember [131]
LDC2017S13 [132] IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
Member, [133] Nonmember [134]
LDC2017S01 [135] IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7
Member, [136] Nonmember [137]
LDC2017S19 [138] IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e
Member, [139] Nonmember [140]
LDC2017S17 [141] Vehicle City Voices - Part 1
Member/Nonmember [142]
2016
LDC2016T22 [143] A Corpus of Chinese-English Parallel Sentences Extracted from Patents
For-Profit Members/For-Profit Nonmembers [144]
LDC2016S04 [145] CHM150
Member/Nonmember [146]
LDC2016S05 [147] Digital Archive of Southern Speech - NLP Version
For-Profit Member [148]
LDC2016T01 [149] H1 Children's Writing
Member/Nonmember [150]
LDC2016S06 [151] IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a
Member, [152] Nonmember [153]
LDC2016S02 [154] IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
Member, [155] Nonmember [156]
LDC2016S08 [157] IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b
Member, [158] Nonmember [159]
LDC2016S12 [160] IARPA Babel Georgian Language Pack IARPA-babel105b-v0.5
Member, [161] Nonmember [162]
LDC2016S09 [163] IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY
Member, [164] Nonmember [165]
LDC2016S13 [166] IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g
Member, [167] Nonmember [168]
LDC2016S10 [169] IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
Member, [170] Nonmember [171]
LDC2016T24 [172] JANA: A Human-Human Dialogues Corpus for Egyptian Dialect
For-Profit Members/For-Profit Nonmembers [173]
2015
LDC2015S10 [174] Arabic Learner Corpus
Member/Nonmember [175]
LDC2015T03 [176] Avocado Research Email Collection
Member/Nonmember [177]
LDC2015S05 [178] Mandarin Chinese Phonetic Segmentation and Tone
Member [179]
2014
LDC2014T24 [180] Boulder Lies and Truth
Member/Nonmember [181]
LDC2014T06 [182] ETS Corpus of Non-Native Written English
Member/Nonmember [183]
LDC2014S02 [184] King Saud University Arabic Speech Database
Member/Nonmember [185]
LDC2014S03 [186] Multi-Channel WSJ Audio
Member/Nonmember [187]
LDC2014S08 [188] United Nations Proceedings Speech
Member/Nonmember [189]
LDC2014S04 [190] USC-SFI MALACH Interviews and Transcripts Czech
Member, [191] Nonmember [192]
2013
LDC2013T06 [193] 1993-2007 United Nations Parallel Text
Member/Nonmember [194]
LDC2013S09 [195] CSC Deceptive Speech
Member/Nonmember [196]
2012
LDC2012T03 [197] 2009 CoNLL Shared Task Part 1
Member/Nonmember [198]
LDC2012T11 [199] American English Nickname Collection
Member/Nonmember [200]
LDC2012S03 [201] Digital Archive of Southern Speech
For-Profit Member [202]
LDC2012S05 [203] USC-SFI MALACH Interviews and Transcripts English
Member [204], Nonmember [205]
2011
LDC2011T04 [206] Indian Language Part-of-Speech Tagset: Sanskrit
Member/Nonmember [207]
2010
LDC2010T06 [208] Chinese Web 5-gram Version 1
Member/Nonmember [209]
LDC2010T16 [210] Indian Language Part-of-Speech Tagset: Bengali
LDC2010T24 [211] Indian Language Part-of-Speech Tagset: Hindi
Member/Nonmember [212]
LDC2010L01 [213] LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1
Member [214]
2009
LDC2009V01 [215] Audiovisual Database of Spoken American English
Member/Nonmember [216]
LDC2009T04 [217] BioProp Version 1.0
Member/Nonmember [218]
LDC2009S01 [219] CSLU: Numbers Version 1.3
LDC2009S03 [220]CSLU: S4X Release 1.2
Member/Nonmember [221]
LDC2009T08 [222] Japanese Web N-gram Version 1
Member/Nonmember [223]
LDC2009T25 [224] Web 1T 5-gram, 10 European Languages Version 1
Member/Nonmember [225]
2008
LDC2008T13 [226] BLLIP North American News Text, Complete
Member [227]
LDC2008T14 [228] BLLIP North American News Text, General Release
Member/Nonmember [229]
LDC2008S06 [230]CSLU: Alphadigit Version 1.3
LDC2008S07 [231]CSLU: ISOLET Spoken Letter Database Version 1.3
LDC2008S02 [232]CSLU: National Cellular Telephone Speech Release 2.3
LDC2008S01 [233]CSLU: Portland Cellular Telephone Speech Version 1.3
Member/Nonmember [221]
LDC2008T22 [234] Czech Academic Corpus 2.0
Member/Nonmember [235]
LDC2008L02 [236] Hindi WordNet
Member/Nonmember [237]
LDC2008T01 [238] Hungarian-English Parallel Text, Version 1.0
Member/Nonmember [239]
LDC2008T15 [240] North American News Text, Complete
Member [241]
LDC2008T16 [242] North American News Text, General Release
Nonmember [243]
2007
LDC2007T22 [244] 2001 Topic Annotated Enron Email Data Set
Member/Nonmember [245]
LDC2007S08 [246] CSLU: Foreign Accented English Release 1.2
LDC2007S18 [247] CSLU: Kids` Speech Version 1.1
LDC2007S13 [248] CSLU: Apple Words and Phrases
LDC2007S05 [249] CSLU: Yes/No Version 1.2
Member/Nonmember [221]
LDC2007S09 [250] Mandarin Affective Speech
Member/Nonmember [251]
LDC2007T19 [252] MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE)
Member [253]
LDC2007S15 [254] Nationwide Speech Project
Member/Nonmember [255]
2006
LDC2006S15 [256] CSLU: Spelled and Spoken Words
LDC2006S14 [257] CSLU: Stories v 1.2
LDC2006S35 [258] CSLU: Multilanguage Telephone Speech Version 1.2
LDC2006S39 [259] CSLU: Names Release 1.3
LDC2006S26 [260] CSLU: Speaker Recognition Version 1.1
LDC2006S16 [261] CSLU: Spoltech Brazilian Portuguese Version 1.0
LDC2006S01 [262] CSLU: Voices
Member/Nonmember [221]
LDC2006T03 [263] Korean Propbank
Member/Nonmember [264]
LDC2006T09 [265] Korean Treebank Annotations Version 2.0
Member/Nonmember [266]
LDC2006S13 [267] N4 NATO Native and Non-Native Speech
Member/Nonmember [268]
LDC2006T01 [269] Prague Dependency Treebank 2.0
Member/Nonmember [270]
LDC2006S30 [271] Speech Controlled Computing
Member/Nonmember [272]
LDC2006T13 [273] Web 1T 5-gram Version 1
Member/Nonmember [274]
2005
LDC2005T35 [275] American National Corpus (ANC) Second Release
Member/Nonmember: Open Portion [276], Restricted Portion [277]
2004
LDC2004L02 [278] Buckwalter Arabic Morphological Analyzer Version 2.0
Member
[279]
LDC2004T23 [280] Prague Arabic Dependency Treebank 1.0
Member/Nonmember [281]
LDC2004T25 [282] Prague Czech-English Dependency Treebank 1.0
Member/Nonmember [283]
2002
LDC2002S11 [284] 1997 HUB4 English Evaluation Speech and Transcripts
Member/Nonmember [285]
LDC2002T26 [286] Korean English Treebank Annotations
Member/Nonmember [287]
2001
LDC2001T62 [288] CETEMpublico
Member/Nonmember [289]
2000
LDC2000S86 [290] 1998 HUB4 Broadcast News Evaluation English Test Material
Member [291]
LDC2000T43 [292] BLLIP 1987-89 WSJ Corpus Release 1
Member/Nonmember [293]
LDC2000T52 [294] TREC Mandarin
Member/Nonmember [295]
LDC2000T51 [296] TREC Spanish
Member/Nonmember [297]
1999
LDC99L22 [298] Egyptian Colloquial Arabic Lexicon
For-Profit Member [299], Not-For-Profit Member [300], Nonmember [301]
LDC99T34 [302] Japanese Business News Text Supplement
Member [303]
LDC99S82 [304] USC Marketplace Broadcast News Speech
LDC99T36 [305] USC Marketplace Broadcast News Transcripts
Member/Nonmember [285]
1998
LDC98T31 [306] 1996 CSR HUB4 Language Model
Member [307]
LDC98S73 [308] 1997 Mandarin Broadcast News Speech (HUB4-NE)
LDC98T24 [309] 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
Member [310]
LDC98L21 [311] COMLEX English Syntax Lexicon
Member/Nonmember [312]
LDC98T30 [313] North American News Text Supplement
Member [314]
LDC98T25 [315] TDT Pilot Study Corpus
Member/Nonmember [316]
1997
LDC97S66 [317] 1996 English Broadcast News Dev and Eval (HUB4)
LDC97S44 [318] 1996 English Broadcast News Speech (HUB4)
LDC97T22 [319] 1996 English Broadcast News Transcripts (HUB4)
Member [320]
LDC97L20 [321] CALLHOME American English Lexicon (PRONLEX)
LDC97L18 [322]CALLHOME German Lexicon
Member [323], Nonmember [324]
LDC97S63 [325] The CMU Kids Corpus
Member/Nonmember [326]
1996
LDC96L17 [327] CALLHOME Japanese Lexicon
LDC96L15 [328] CALLHOME Mandarin Chinese Lexicon
LDC96L16 [329] CALLHOME Spanish Lexicon
Member [323], Nonmember [324]
LDC96L14 [330] CELEX2
Member/Nonmember [331]
LDC96S33 [332] CSR-IV HUB3
Member [333]
LDC96S31 [334] CSR-IV HUB4
Member/Nonmember [335]
LDC96T10 [336] Message Understanding Conference (MUC) 6 Additional News Text
Member/Nonmember [337]
1995
LDC95T6 [338] CSR-III Text
Member [339]
LDC95T11 [340] European Language Newspaper Text
Member [341]
LDC95T8 [342] Japanese Business News Text
Member [303]
LDC95S28 [343] LATINO-40 Spanish Read News
Member/Nonmember [344]
LDC95T13 [345] Mandarin Chinese News Text
Member/Nonmember [346]
LDC95T21 [347] North American News Text Corpus
Member [348]
LDC95T9 [349] Spanish News Text
Member [344]
1994
LDC94T5 [350] ECI Multilingual Text
Member/Nonmember [351]
LDC94T4A [352] UN Parallel Text (Complete)
LDC94T4B-1 [353]UN Parallel Text (English)
LDC94T4B-2 [354]UN Parallel Text (French)
LDC94T4B-3 [355]UN Parallel Text (Spanish)
Member/Nonmember [356]
1993
LDC93T1 [357] ACL/DCI
Member/Nonmember [358]
LDC93T3A [359] TIPSTER Complete
LDC93T3B [360] TIPSTER Volume 1
LDC93T3C [361] TIPSTER Volume 2
LDC93T3D [362] TIPSTER Volume 3
Member/Nonmember [363]