April 2015 Newsletter

Monday, April 20, 2015

New Corpora

GALE Phase 3 and 4 Arabic Broadcast News Parallel Text

The Subglottal Resonances Database

Mandarin Chinese Phonetic Segmentation and Tone


2013 Data Pack available through September 15

Not-for-profit organizations can now create a custom data collection from among LDC’s 2013 releases. The 2013 Data Pack allows users to license eight corpora published in 2013 for a flat rate of US$3500. Selection options include Greybeard, NIST 2012 Open Machine Translation (OpenMT) evaluation and progress sets, Chinese Treebank 8.0, GALE Arabic and Chinese speech and text releases, 1993-2007 United Nations Parallel Text, MADCAT training data, CSC Deceptive Speech and more.  Organizations acquire perpetual rights to the corpora licensed through the pack. The Data Pack is not a membership, and organizations must request all eight data sets at the time of purchase. The 2013 Data Pack is available to not-for-profit organizations for a limited time only, through September 15.

To license the Data Pack and select eight corpora, login or register for an LDC user account and add the 2013 Data Pack and each of the eight data sets to your bin. Follow the check-out procedure, sign all applicable user agreements and select payment via wire transfer, purchase order or check. LDC will adjust the invoice total to reflect the data pack fee.

To pay via credit card, add the 2013 Data Pack to your bin and check out using the system prompts. At the completion of the transaction, send an email to ldc@ldc.upenn.edu indicating the eight data sets to include in your order.


LDC supports NSF data management plans

This month’s publication of The Subglottal Resonances Database is the latest in a series of releases of data developed with National Science Foundation (NSF) funding. Long before researchers were required to develop data management plans, they deposited their research data at LDC in accordance with NSF’s longstanding desire that data generated with program funds should be readily accessible at a reasonable cost.  Well known data sets in the series include The Santa Barbara Corpus of Spoken American English (multiple parts), Propbank and Grassfields Bantu Fieldwork.

NSF now requires researchers to deposit funded data in an accessible, trustworthy archive. LDC’s expertise in data curation, distribution and management and its commitment to the broad accessibility of linguistic data make it the repository of choice for NSF-funded data. Learn more about how LDC can assist in developing and implementing data management plans from the Data Management Plans section on our website or contact ">LDC Data Management Plans.

The Subglottal Resonances Database was developed with the support of NSF Grant No. 0905250. It is available to LDC members at no cost; non-members may license the data set for a fee of $30 plus shipping.