February 2009 Newsletter

Monday, February 16, 2009

New Corpora

Audiovisual Database of Spoken American English

GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 

Announcements

LDC's Corpus Catalog Receives Top OLAC Rating
LDC is pleased to announce that The LDC Corpus Catalog has been awarded a five-star quality rating, the highest rating available, by the Open Language Archives Community (OLAC). OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.  LDC supports OLAC and is among the 37 participating archives who have contributed over 36,000 records to the combined catalog of language resources. OLAC seeks to refine the quality of the metadata in catalog records in order to improve the quality of searching that users can do over that catalog. When resources are described following the best practice guidelines established by OLAC, it increases the likelihood that all the resources returned by a query are relevant (precision) and that all relevant resources are returned (recall).

Certain metadata in the LDC catalog was missing, inaccurate and/or non-compliant with OLAC standards for several fields.  Over a period of a few months, a team at LDC took several steps to make that metadata OLAC-compliant.  Most significantly, the language name and the language ID for over 400 corpora were reviewed and changed when required to conform to the new standard for language identification, ISO 639-3.  Additional efforts focused on providing author information for all corpora and fixing dead links.  Finally, the team added a new metadata field to consistently document the "type" of each resource, using a standard vocabulary from the digital libraries community called DCMI-Type, reliably distinguishing text and sound resources.  The benefits of these revisions include improving LDC's management of resources in the catalog as well as assisting LDC users to quickly identify all corpora which are relevant to their research.

Membership Year 2009 Discounts Still Available!
If you are considering joining for Membership Year 2009 (MY2009), take note that there is still time to save on membership fees.   Any organization which joins or renews membership for 2009 prior to Monday, March 2, 2009, is entitled to a 5% discount on membership fees.  Organizations which held membership for MY2008 can receive a 10% discount on fees, provided they renew prior to March 2, 2009.  For further information on pricing, please consult our Announcements page or contact the LDC.  Information on our planned releases for MY2009 is provided below.

2009 Publications Pipeline
For Membership Year 2009 (MY2009), we anticipate releasing a varied selection of publications. Many publications are still in development, but here is a glimpse of what is in the pipeline for MY2009.  Please note that this list is tentative and subject to modifications.  Our planned publications include:

  • Arabic Gigaword Fourth Edition ~ edition includes our recent newswire collections as well as the contents of Arabic Gigaword Third Edition (LDC2007T40).  In addition to sources found in previous releases such as Xihhuna, Agence France Presse, An Nahar, Al Hayat, this release includes data from several new sources, such as Al Quds, Asharq Al Awasat, and Al Ahram.
  • Chinese Gigaword Fourth Edition ~ edition includes our recent newswire collections as well as the contents of the Chinese Gigaword Third Edition (LDC2007T38). In addition to sources found in previous releases such as Agence France Presse, Central News Agency (Taiwan), Xinhua and Zaobao, this release includes data from several new sources, such as People's Liberation Army Daily, Guangming Daily, and China News Service.
  • Chinese Web 5-gram Corpus Version 1 ~ contains n-grams (unigrams to five-grams) and their observed counts in 880 billion tokens of Chinese web data collected in March 2008. All text was converted to UTF-8. A simple segmenter using the same algorithm used to generate the data is included. The set contains 3.9 billion n-grams total.
  • CoNLL 2008 Shared Task Corpus ~ includes syntactic and semantic dependencies for Treebank-3 (LDC99T42) data. This corpus was developed for the 2008 shared task of the Conference on Natural Language Learning (CoNLL 2008). The syntactic information was created by converting constituent trees from Treebank-3 to dependencies using a set of head percolation rules and a series of other transformations, e.g., named entity boundaries are included from the BBN Pronoun Coreference and Entity Type Corpus (LDC2005T33). The semantic dependencies were created by converting semantic propositions to a dependency representation. The corpus includes propositions centered around both verbal predicates - from Proposition Bank I (LDC2004T14) - and around nominal predicates - from NomBank 1.0 (LDC2008T24).
  • English Gigaword Fourth Edition ~ edition includes our recent collections as well as the contents of the English Gigaword Third Edition (LDC2007T07).  The sources of text data include Agence France Presse, Associated Press, Central News Agency (Taiwan), NY Times, Xinhua and Salon.com
  • GALE Phase 1 Arabic Newsgroup Parallel Text Part 2 ~ 145K words (263 files) of Arabic newsgroup text and its English translation selected from thirty sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 of the DARPA-funded GALE program.
  • GALE Phase 1 Chinese Broadcast Conversation Parallel Text Part 2 ~ total of 24 hours of Chinese broadcast conversation were selected from three sources, China Central TV (CCTV) Phoenix TV, and Voice of America.  This release was used as training data in Phase 1 of the DARPA-funded GALE program.
  • GALE Phase 1 Chinese Newsgroup Parallel Text Part 1 ~  240K characters (112 files) of Chinese newsgroup text and its English translation selected from twenty-five sources.   Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 of the DARPA-funded GALE program.
  • Japanese Web N-gram Corpus Version 1 ~ contains n-grams (unigrams to seven-grams) and their observed counts in 250 billion tokens of Japanese web data collected in July 2007. All text was converted to UTF-8 and segmented using the publicly available segmenter MeCab. The set contains 3.2 billion n-grams total.
  • NIST MetricsMATR08 Development Data ~ contains sample data extracted from the NIST Open Machine Translation (MT) 2006 evaluation.  Data includes the English machine translations from 8 systems and the human reference translations for 25 Arabic source language newswire documents, along with corresponding human assessments of adequacy and preference.  This data set was originally provided to NIST MetricsMATR08 participants for the purpose of MT metric development.

2009 Subscription Members are automatically sent all MY2009 data as it is released.  2009 Standard Members are entitled to request 16 corpora for free from MY2009.   Non-members may license most data for research use.