Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome


What's New! What's Free!


LDC’s 20th Anniversary ~ concluding a year of celebration
Spring 2013 LDC Data Scholarship recipients ~ student recipients
Publications pipeline ~ planned publications for this year
Invitation to Join for Membership Year 2013 ~ join for 2013
RTE update for Penn Discourse Treebank ~ available for download
LDC 20th Anniversary Workshop Podcasts ~ listen to staff interviews; podcasts available through LDC blog
2012 User Survey Results ~ now available
Language Resource Wiki ~ meta-resource on language resources
LDC Providing Guidelines ~ enhanced guidelines for submitting corpora for publication by LDC
LDC Data Sheets ~ concise descriptions of LDC projects, operations, and technical capabilities
What's New Archive

New Corpora

1993-2007 United Nations Parallel Text ~ ~673K raw text documents and 520K word alignment documents in the official languages of the UN
GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web ~ 158K tokens of word aligned Chinese and English parallel text enriched with linguistic tags
GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 ~ 123 hours of Arabic broadcast conversation speech
GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 ~ 752K transcribed Arabic broadcast conversation data tokens
NIST 2012 Open Machine Translation (OpenMT) Evaluation ~ 222 Chinese newswire and web data documents with corresponding source and reference files
New Corpora Archive

Employment at the LDC

ACL Anthology ~ A Digital Archive of Research Papers in Computational Linguistics

OLAC ~ Open Language Archives Community

Linguistic Resources
Linguistic Data Consortium

The Linguistic Data Consortium supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards.

map

LDC is supported in part by grant IRI-9528587 from the Information and Intelligent Systems division and grant 9982201 from the Human Computer Interaction Program of the National Science Foundation. LDC's corpus creation efforts are powered in part by Academic Equipment Grant 7826-990 237-US from Sun Microsystems.

About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Thursday, 21-Mar-2013 11:49:20 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.