September 2009 Newsletter

Tuesday, September 15, 2009

New Corpora

Chinese Gigaword Fourth Edition

CSLU: S4X Release 1.2

FactBank 1.0

Announcements

LDC’s Free Resources
LDC is pleased to distribute FactBank 1.0 which is available at no cost.  To license a copy of this data, non-members should complete the LDC User Agreement for Non-members and fax to +1 215 573 2175 or scan and email to this address. FactBank joins a host of LDC resources which are available for free.  These resources include tools and corpora developed at LDC as well as corpora made available through LDC's strong network of data providers.  

Since LDC's founding, we have distributed over 1300 copies of corpora at no cost including:

  • over 700 non-member downloads of Buckwalter Arabic Morphological Anaylzer 1.0
  • 400 copies of Talkbank-sponsored data including popular releases such as the American National Corpus and the Santa Barbara Corpora of Spoken American English
  • nearly 200 copies of Web 1T 5-gram Version 1, sponsored by Google Inc.
  • over 30 copies of TimeBank 1.2
  • over a dozen copies of the corpora developed for the Unified Linguistic Annotation (ULA) project

Release of XTrans
At InterSpeech 2009, LDC introduced XTrans, a new tool for manual transcription and annotation of audio recordings.  XTrans is a next generation transcription tool that is designed to support transcription tasks in multiple languages on multiple platforms.   XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics.  Its versatile and powerful waveform display/playback component can load multiple audio files of different file formats and sampling rates at the same time. LDC and its partners have used XTrans to generate over 3500 hours of time-aligned verbatim transcripts in a variety of genres and languages.

With an intuitive interface, user configurability and embedded QC functions, XTrans is optimized for high-quality, high-volume transcription tasks involving real world data. XTrans successfully addresses the challenges of real world data including transcribing multiple speakers in a single channel through Virtual Speaker Channel, which enables an unlimited number of distinct speakers to be associated with the same audio channel.  Furthermore, XTrans allows transcribers to open an effectively unlimited number of audio files for simultaneous transcription. Transcribers can switch focus between one, two or multiple speakers as needed.  XTrans also provides strong multilingual support, with bidirectional text input for languages like Arabic, Farsi, Urdu, and Hebrew.

Realtime transcription rates have improved dramatically in LDC projects using XTrans, with rates for some tasks cut by as much as half.   XTrans also brings key quality control functions directly into the interface, giving transcribers the power to improve the quality of their own work.  XTrans components are written in Python and C++, utilizing LDC's QWave waveform display module. Even with very large files or multiple recordings, XTrans provides users with fast display and playback capabilities.  A range of audio formats is supported, including .sph, .wav, .aiff, .flac, and .ogg. Transcripts are output in a Tab Delimited Format (TDF), which is easily converted to other common formats and is readily usable by downstream manual and automatic annotation tasks.

Availability:

XTrans for Linux and Windows platforms is available from the LDC at no cost under GPLv3.