July 2009 Newsletter

Friday, July 17, 2009

New Corpora

Czech Broadcast Conversation Speech

Czech Broadcast Conversation MDE Transcripts

Spanish Gigaword Second Edition

Announcements

2009 Publications Pipeline Update
Membership Year (MY) 2009 has included a diverse selection of publications including data created for the Unified Linguistic Annotation (ULA) project, new releases of English Gigaword and Tagged Chinese Gigaword, and syntactic structure annotations of English telephone conversations.  Please consult our corpus catalog for a full list of publications distributed by LDC. As we have recently reached the half-way point of the year, we would like to provide information on what publications you can expect for the remainder of MY2009.  Our pipeline includes the following:

  • Arabic Gigaword Fourth Edition ~ edition includes our recent newswire collections through 2008 as well as the contents of Arabic Gigaword Third Edition (LDC2007T40).  In addition to sources found in previous releases such as Xinhua, Agence France Presse, An Nahar, Al Hayat, this release includes data from several new sources, such as Al Quds, Asharq Al Awasat, and Al Ahram.
  • Arabic Treebank English Translation ~ consists of Arabic source text chosen from three Arabic Treebanks - Arabic Treebank Part 1, Part 3 and Part 4.  Source text totals about 551K Arabic words selected from three newswire sources - Agence France-Presse, An Nahar, and Assabah.  Each Arabic news story was translated once.
  • Chinese Gigaword Fourth Edition ~ edition includes our recent newswire collections through 2008 as well as the contents of the Chinese Gigaword Third Edition (LDC2007T38). In addition to sources found in previous releases such as Agence France Presse, Central News Agency (Taiwan), Xinhua and Zaobao, this release includes data from several new sources, such as People's Liberation Army Daily, Guangming Daily, and China News Service.
  • Chinese Web 5-gram Corpus Version 1 ~ contains n-grams (unigrams to five-grams) and their observed counts in 880 billion tokens of Chinese web data collected in March 2008. All text was converted to UTF-8. A simple segmenter using the same algorithm used to generate the data is included. The set contains 3.9 billion n-grams total.
  • FactBank 1.0 ~ consists of 208 news articles from TimeBank and AQUAINT TimeML, in which event mentions are annotated with their degree of factuality, expressing whether they correspond to actual situations in the world, situations that have not happened, or situations of uncertain interpretation.  9,500 events are identified by manual annotation.
  • French Gigaword Second Edition ~ edition includes our recent newswire collections through 2008 as well as the contents of French Gigaword First Edition (LDC2006T17).  The sources of text data include Agence France-Press and Associated Press French Service.

2009 Subscription Members are automatically sent all MY2009 data as it is released.  2009 Standard Members are entitled to request 16 corpora for free from MY2009.   Non-members may license most data for research use.