February 2010 Newsletter

Monday, February 22, 2010

New Corpora

Fisher Spanish Speech

Fisher Spanish - Transcripts

Announcements

65,000th LDC Corpus Distributed
LDC has recently reached another milestone.  Two years after having distributed our 50,000th corpus, we have just distributed our 65,000th!  To help us celebrate, we took the names of all the organizations that had licensed data on the day we distributed our 65,000th corpus and tossed them into a Phillies baseball cap.

We then randomly drew a name, and the winner is ...Swarthmore College and Universidad Carlos III de Madrid!  That's not a typo, we have two lucky winners!  We are celebrating our 65,000th distribution by awarding a benefit of US$2000 each to both Swarthmore College and Universidad Carlos III de Madrid. The benefit can be used towards membership or data licensing fees at any time this year.

Swarthmore College and Universidad Carlos III de Madrid join our other recipients of landmark corpora distributions:

        Helsinki University of Technology, Adaptive Informatics Research Centre (AIRC) - licensed our 50,000th distribution in January 2008.
        Instituto de Engenharia de Sistemas e Computadores (INESC) - licensed our 40,000th distribution in November 2006.
        University of Hawai'i, Manoa, Language Analysis and Experimentation Laboratories - licensed our 15,000th distribution in April 2002.

We would like to thank both members and non-members for helping the LDC reach this landmark distribution. The unceasing demand for LDC data from over 2800 organizations supports our mission to develop and share resources for research in human language technologies.

About our winners:

Swarthmore College ~ The Department of Computer Science offers courses that emphasize the fundamental concepts of computer science, treating today's languages and systems as current examples of the underlying concepts. By educating students to think conceptually, we are preparing them to adapt to developments in this dynamic field.

Universidad Carlos III de Madrid ~ The Multimedia Processing Group aims to make a significant research contribution to the field of multimedia processing, especially focusing on combining signal analysis tools with emerging machine learning methods. Projects include automatic multimedia indexing, automatic speech recognition, and last-generation video codin

Membership Year 2010 Discounts Still Available
If you are considering joining for Membership Year 2010 (MY2010), take note that there is still time to save on membership fees.   Any organization which joins or renews membership for 2010 prior to Monday, March 1, 2010, is entitled to a 5% discount on membership fees.  Organizations which held membership for MY2009 can receive a 10% discount on fees, provided they renew prior to March 1, 2010.  For further information on pricing, please consult our Announcements page or contact LDC.  Information on our planned releases for MY2010 is provided below.

2010 Publications Pipeline
For Membership Year 2010 (MY2010), we anticipate releasing a varied selection of publications. Many publications are still in development, but here is a glimpse of what is in the pipeline for MY2010.  Please note that this list is tentative and subject to modifications.  Our planned publications for the coming months include:

  • Arabic Treebank: Part 3 v 3.2 ~ a revision of Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis (LDC2005T20). The full Arabic Treebank:  Part 3 has been revised according to the new Arabic Treebank annotation guidelines.  The Arabic Treebank project consists of two distinct phases: (a) Part-of-Speech (POS) tagging which divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss, and (b) Arabic Treebanking which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. on-terminal node. Arabic Treebank:  Part 3 v 3.2 consists of 599 newswire stories from An Nahar.                                      
  • Chinese Treebank 7.0 ~ this release encompasses 2400 text files, containing 45000 sentences, 1.1 million words and 1.65 million hanzi (Chinese characters). The data is provided in two encodings: GBK and UTF-8, and the annotation has Penn Treebank-style labeled brackets.      
  • Chinese Web 5-gram Version 1 ~ contains n-grams (unigrams to five-grams) and their observed counts in 880 billion tokens of Chinese web data collected in March 2008. All text was converted to UTF-8. A simple segmenter using the same algorithm used to generate the data is included. The set contains 3.9 billion n-grams total.
  • NPS Chat Corpus Version 1.0 ~ consists of 10,567 posts gathered from age-specific chat rooms. Each file is a recording transcript from one of these chat rooms for a short period on a particular day.   In order to comply with the chat services' terms of service, the posts have been privacy-masked.   Each post is annotated with a chat dialog-act tag, and individual tokens within each post are annotated with part-of-speech tags.
  • WTIMIT  ~ is a mobile wideband (i.e., 50 Hz – 7kHz) telephone adjunct to TIMIT (LDC93S1).   WTIMIT has been derived as follows: the original TIMIT speech files at 16 kHz sampling rate were concatenated to 11 signal chunks each being preceded by a 4 second calibration tone. These speech chunks were transmitted via two prepared Nokia 6220 mobile phones over T-Mobile’s 3G wideband mobile network in The Hague, The Netherlands, employing the Adaptive Multirate Wideband (AMR-WB) speech codec. After data acquisition and deconcatenation by maximizing the normalized cross-correlation with the original speech files, a database was obtained that is time aligned with the original TIMIT data with good precision. Accordingly, all TIMIT label files can still be used.  WTIMIT is suitable for research on speech quality and intelligibility, and investigations on possible wideband upgrades of network-sided IVR systems with retrained or bandwidth extended acoustic models for automatic speech recognition.  WTIMIT will be presented at LREC2010.

2010 Subscription Members are automatically sent all MY2010 data as it is released.  2010 Standard Members are entitled to request 16 corpora for free from MY2010.   Non-members may license most data for research-use only.