Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  
What's New! What's Free! Archive

Free Corpora/Software| Press Releases|

Free Corpora/Software


Free Talkbank Corpora. TalkBank is an indisciplinary research project funded by a five year NSF grant to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. The LDC distributes grant-covered copies of the following Talkbank corpora:



Free copies for all of the above corpora are still available; a US$30 shipping and handling fee applies for data on disc.


Additional Free Corpora

FactBank 1.0 ~news text with event mentions annotated with degree of factuality

Unified Linguistic Annotation Text Collection ~effort to create a unified framework for different layers of annotation

Timebank 1.2 ~newstext annotated with temporal information, adding events, times and temporal links between events and times

Free Web 1T 5-gram Copies Available - LDC would like to thank Google for its kind sponsorship of nearly 200 free copies of the Web 1T 5-gram data for university researchers. To date, all copies have been claimed. The data is available for licensing at the regular Non-member Fee.

Buckwalter Arabic Morphological Analyzer Version 1.0 ~Arabic-English prefix, suffix, and stem lexicons supplemented by three morphological compatibility tables

Free Software


XTrans Toolkit - tool to support transcription tasks in multiple languages on multiple platforms

Tools for converting SPHERE speech files to other formats. Nearly all LDC speech corpora are published with speech files in NIST SPHERE format. LDC provides two programs that will convert SPHERE files to other formats:
  • sph_convert_2.1 For converting lots of files at once. Suitable for Windows systems. Makes batch conversion at corpus level simpler, but provides less fleixibility and control.
  • sph2pipe_v2.5 For accessing one file at a time. Provides more flexibility and control and is suitable for use on all operating systems.

For further information on these tools, please visit LDC's Using page and scroll down to the section entitled "Speech (Digital Audio) Files".

ESPS Software - signal processing programs that can be used for the analysis, manipulation and labeling of speech.

Annotation Graph Toolkit (AGTK) - software infrastructure for linguistic annotation.

Transcriber - tool for segmenting, labeling and transcribing speech.

Champollion - parallel text sentence alignment tool for as many language pairs as possible.

[ top ]

Press Releases

15th Anniversary Monthly Spotlight Archive - as part of our 15th Anniversary celebration, we highlighted one aspect of the LDC in our monthly newsletters. These features provided our members and data users with a glimpse of the broad range of LDC’s research activities.

Use of LDC Corpora in University Summer Schools - ways LDC corpora have been used for teaching purposes at university summer school programs.

Conference Attendance by LDC - recent publisher displays by LDC.

Newly Updated LDC Papers Page - papers presented by LDC staff at LREC2008 and other conferences.

OLAC Search - search for language resources from dozens of language data centers and language archives.

Member Resources Page! The LDC has new and improved resources for members and membership info. Please check it out!



About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Wednesday, 18-Nov-2009 16:34:56 EST
© 1992-2009 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.