Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  
What's New! What's Free! Archive

Free Corpora/Software| Press Releases|

Free Corpora


Free Talkbank Corpora. TalkBank is an indisciplinary research project funded by a five year NSF grant to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. The LDC distributes grant-covered copies of the following Talkbank corpora:



Free copies for all of the above corpora are still available; shipping and handling fees apply for data on disc. TalkBank also funded the distribution of 50 free copies of American National Corpus (ANC) Second Release and 100 free copies of SLX Corpus of Classic Sociolinguistic Interviews.


Additional Free Corpora - shipping and handling fees apply for data on disc

American English Nickname Collection ~a compilation of 331K American English nicknames to given name mappings

Asian Spoken Language Sampler ~a variety of speech and transcript samples from LDC's Asian language publications

Buckwalter Arabic Morphological Analyzer Version 1.0 ~Arabic-English prefix, suffix, and stem lexicons supplemented by three morphological compatibility tables

Catalan TimeBank 1.0 ~210 Catalan documents annotated with temporal and event information

English Web Treebank ~50,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure; first 50 copies available at no-cost

FactBank 1.0 ~news text with event mentions annotated with degree of factuality

Indian Language Part-of-Speech Tagset: Bengali ~100K words of manually annotated Bengali text

Indian Language Part-of-Speech Tagset: Hindi ~98K words of manually annotated Hindi text

Indian Language Part-of-Speech Tagset: Sanskrit ~57K words of manually annotated Sanskrit text

LDC Spoken Language Sampler ~a variety of speech, transcript, and lexicon samples from LDC's publications

Malto Speech and Transcripts ~8 hours of transcribed Malto speech data from 27 speakers

ModeS TimeBank 1.0 ~Modern Spanish test annotated with TimeML and SpatialML mark-up

Manually Annotated Sub-Corpus First Release ~ 80K words of spoken and written American English with various annotations

OntoNotes 1.0 ~English and Chinese news text material with Treebank, PropBank, word sense, and coreference annotation

OntoNotes 2.0 ~English and Chinese news text and broadcast news material with Treebank, PropBank, word sense, and coreference annotation

OntoNotes 3.0 ~English and Chinese news text, broadcast news, and broadcast conversation, and Arabic news text material with Treebank, PropBank, word sense, and coreference annotation

OntoNotes 4.0 ~English and Chinese news text, broadcast news, and broadcast conversation, and Arabic news text material with Treebank, PropBank, word sense, and coreference annotation

SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages ~120K words extracted from the OntoNotes corpus and formatted for the SemEval task

Timebank 1.2 ~newstext annotated with temporal information, adding events, times and temporal links between events and times

Unified Linguistic Annotation Text Collection ~effort to create a unified framework for different layers of annotation

Free Web 1T 5-gram Copies Available - LDC would like to thank Google for its kind sponsorship of nearly 200 free copies of the Web 1T 5-gram data for university researchers. To date, all copies have been claimed. The data is available for licensing at the regular Non-member Fee.


Free Software


XTrans Toolkit - tool to support transcription tasks in multiple languages on multiple platforms

Tools for converting SPHERE speech files to other formats. Nearly all LDC speech corpora are published with speech files in NIST SPHERE format. LDC provides two programs that will convert SPHERE files to other formats:
  • sph_convert_2.1 For converting lots of files at once. Suitable for Windows systems. Makes batch conversion at corpus level simpler, but provides less fleixibility and control.
  • sph2pipe_v2.5 For converting one file at a time. Provides more flexibility and control and is suitable for use on all operating systems.

For further information on these tools, please visit LDC's Using page and scroll down to the section entitled "Speech (Digital Audio) Files".

ESPS Software - signal processing programs that can be used for the analysis, manipulation and labeling of speech.

Annotation Graph Toolkit (AGTK) - software infrastructure for linguistic annotation.

Transcriber - tool for segmenting, labeling and transcribing speech.

Champollion - parallel text sentence alignment tool for as many language pairs as possible.

[ top ]

Press Releases


15th Anniversary Monthly Spotlight Archive - as part of our 15th Anniversary celebration in 2007, we highlighted one aspect of the LDC in our monthly newsletters. These features provided our members and data users with a glimpse of the broad range of LDC’s research activities.

Conference Attendance by LDC - recent publisher displays and conference participation by LDC.

Etc. - recent collaborations and grant awards plus other announcements.

Membership Mailbag Archive - to address the questions that our data users have asked, we introduced our Membership Mailbag series of newsletter articles in May 2008. This periodic series answers frequently asked questions about LDC data, the LDC Intranet, and the benefits of an LDC membership.

Member Surveys - LDC conducted two end-of-year surveys to obtain feedback on satisfaction levels with LDC Membership and data releases as well as our corpus catalog, and to gather suggestions on future publications.

Milestones and Celebrations - information on our landmark corpora distributions and events to celebrate our 10th and 15th anniversary years.

Use of LDC Corpora by Students - ways LDC corpora have been used for student research and for teaching purposes at university summer school programs.

[ top ]




About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Friday, 18-Jan-2013 15:58:02 EST
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.