Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  
15th Anniversary Monthly Spotlight Archive

As part of our 15th Anniversary celebration in 2007, we highlighted one aspect of the LDC in our monthly newsletters. These features provided our members and data users with a glimpse of the broad range of the LDC’s research activities.


LDC Programmers and Software Tools -- December 20, 2007

As part of our 15th Anniversary celebration, we have highlighted one aspect of the LDC in our monthly newsletters. These features provided our members and data users with a glimpse of the broad range of the LDC’s research activities. The last feature of the year will focus on the LDC's software programmers and the tools they create.

A large segment of LDC's programming group is led by Senior Research Programmer Kazuaki Maeda. Besides being a programmer, Maeda is a linguist specializing in phonetics, phonology and computational linguistics. The group currently has ten full-time staff, augmented as necessary by part-time programmers. LDC’s programmers are adept in all major programming languages and can work across platforms; their work supports virtually every aspect of LDC’s operation. More information about LDC’s programmers can be found on our staff page .

One of the programming group’s principal responsibilities is to develop workflow management software and annotation and transcription tools to support projects such as GALE and LCTL . Our goal is to make tools developed for general use broadly available. One such tool is XTrans, a next generation transcription tool that is designed to support transcription tasks in multiple languages on multiple platforms. Its versatile and powerful waveform display/playback component can load multiple audio files of different file formats and sampling rates at the same time. The virtual channel supported by XTrans provides the most natural method for transcribing overlapping speech. Virtual channel represents an audio source, not a physical channel, that is identified and transcribed in a given audio recording. A single-channel audio file can contain many audio sources. For instance, a round-table talk show with five speakers contains five audio sources in a single-channel audio recording. With XTrans, that file is modeled as a 5-virtual-channel audio file, and each virtual channel is transcribed independently. Additionally, if a recording consists of audio files with different sampling rates, XTrans will automatically resample them to the same rate. The LDC has used XTrans for many varied projects, and the tool has proven to be quick to learn and easy to master. We are currently working through licensing issues with organizations that provided libraries for XTrans. Once those issues are resolved, we will make XTrans generally available.

Two other general use tools developed by the LDC -- The Annotation Graph Toolkit (AGTK) and Champollion Tool Kit (CTK) -- are available on Sourceforge.net . Like XTrans, these tools represent creative solutions to difficult problems:

  • The Annotation Graph Toolkit (AGTK) is a primary resource for annotation tool development at LDC. AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). Unlike the traditional approach of designing and implementing data structures and user interfaces for new tasks from scratch, AGTK allows developers to quickly prototype tools and define data formats. The flexible nature of the AG model means that data representations can be rapidly modified in response to evolving annotation task definitions. AGTK allows for rapid deployment of highly specialized, task-specific tools that maximize user interface ergonomics and improve the speed and accuracy of annotation.
  • Champollion Tool Kit (CTK) was developed to address issues in aligning parallel text consisting of remote language pairs and a significant amount of noise. To achieve high precision and recall on manually-aligned text, CTK assumes a noisy input, that is, that a sizable percentage of alignments will not be one to one, and that the number of deletions and insertions will be significant. Furthermore, CTK differs from other lexicon-based approaches in assigning greater weight to less frequent translation pairs. CTK was first evaluated using Chinese-English parallel text but is designed to be used on as many language pairs as possible.

XTrans, AGTK and CTK are representative of the work by LDC’s programmers, making it possible for us to support projects of increasing complexity and to distribute a growing variety of linguistic resources. The LDC Catalog contains several publications which were created using software tools developed by LDC's programming group. These include ACE data, Arabic Treebank publications, and NIST Rich Transcription corpora.

[ top ]

Arabic Treebanking -- October 17, 2007

As part of our 15th Anniversary celebration, we will be highlighting one aspect of the LDC in each of our monthly newsletters for the remainder of the year. These features will provide our members and data users with a glimpse of the broad range of the LDC’s research activities. This month will focus on Arabic Treebanking, a annotation project managed by Mohamed Maamouri, Senior Research Administrator at the LDC. Very briefly, the annotation process involves two phases:

  • Arabic Part-of-Speech (POS) tagging - which divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, gloss, and vocalization.
  • Arabic Treebanking (ArabicTB) - which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc.

Maamouri's group is concluding a major revision of the annotation guidelines termed Arabic Treebank II. Revisions include significant enhancements of the morphological (POS) and syntactic (ArabicTB) annotation. These revisions are intended to make treebanked data more useful to both the Arabic speaking world and the natural language processing community. POS-tagged changes represent finer distinctions between classes of words, quantifiers, and numbers, while ArabicTB changes include more structure shown in noun phrases.

During the revision process, the Arabic Treebank group was joined by visiting scholar Sondos Krouna, an associate professor of Arabic linguistics at the Tunis Higher Institute of Languages (ISTL) at the University of Tunis-Carthage. Sondos spent a year at the LDC and was instrumental in authoring the Arabic Treebank II guidelines. Basma Bouziri, a graduate student and teaching assistant at ISLT, recently arrived at the LDC where she'll serve as a junior visiting scholar. While here Basma will focus on learning Arabic Treebank II annotation and methodology. Annotation utilizing the Arabic Treebank II guidelines is expected to commence soon; look for an Arabic Treebank II release from the LDC at a later date.

[ top ]

Biomedical Information Extraction -- September 17, 2007

As part of our 15th Anniversary celebration, we will be highlighting one aspect of the LDC in each of our monthly newsletters for the remainder of the year. These features will provide our members and data users with a glimpse of the broad range of the LDC’s research activities. This month will focus on Biomedical Information Extraction (BioIE), a five-year project which has recently come to a close under the management of Mark Mandel, Research Administrator. This project was hosted by the LDC and by Penn’s Institute for Research in Cognitive Science and was funded by a grant from the National Science Foundation. The goal of BioIE was to develop qualitatively better methods for automatically extracting information from biomedical literature.

BioIE focused on two domains, the inhibition of the cytochrome P450 family of enzymes (CYP450) and the molecular genetics of cancer (oncology). All texts used were publicly available abstracts from PubMed®. Annotators applied all levels of annotation to the abstracts, from paragraph, sentence, token, and part of speech to biomedical entity. A subset of the abstracts was also syntactically annotated with Treebank-style tagging. All annotation, except for entity tagging, was done by trained automatic taggers then manually corrected. Throughout the project, Mandel's group collaborated with the Knowledge Integration and Discovery Systems group researchers at GlaxoSmithKline for CYP450 annotation and with the eGenome group at the Children's Hospital of Philadelphia (CHOP) for oncology annotation.

Due to the nature of the texts used, BioIE presented unique challenges for the annotators. The project required that annotators be familiar with both treebanking and biomedical terminology. Annotators learned the terminology from domain experts who defined the annotation tasks and developed guidelines. Annotators and domain experts engaged in a continual interactive process to define, apply, redefine, and apply revised guidelines. Furthermore, since the texts were paper abstracts and not full texts, a number of similar entities or events were described in a single collapsed manner, such as "pre- and post-operative complications". Annotators devised a method to annotate discontinuous entities references, called "chains," to make the connection explicit in the actual annotation by building a chain of strings and applying a single tag to them.

BioIE can be used for the domain-specific task of building a database of compounds that inhibit the various CYP450 enzymes. Such a database may be useful in predicting and preventing drug interactions. The data has wider applications for researchers interested in the general task of information extraction. Further information about this project is available on the BioIE website, Mining the Bibliome. Look for the LDC release of BioIE in the coming months.

[ top ]

Translation -- August 17, 2007

As part of our 15th Anniversary celebration, we will be highlighting one aspect of the LDC in each of our monthly newsletters for the remainder of the year. These features will provide our members and data users with a glimpse of the broad range of the LDC’s research activities. This month will focus on translation, one of the many projects handled by our annotation group. This group is led by Stephanie Strassel, Associate Director, Annotation Research and Program Coordination, whose role in translation involves coordinating development of translation resources for sponsored programs. Strassel's team includes Lauren Friedman, External Data Coordinator, who handles outsourcing to translation agencies and supervises data flow and quality control.

The LDC works with translation agencies to produce high quality translations in a dozen languages including Chinese, Arabic, and the Less Commonly Taught Languages (LCTLs). While most translation work in the past has focused on newswire text, the LDC now handles the translation of a variety of genres, including web logs. For these genres, the LDC developed best practices for treating the unique characteristics of informal written language such as slang, incomplete sentences, and misspellings. The LDC performs preliminary work on the data before it is sent to translation agencies. The data is manually selected to ensure that it is of high quality and then the format is standardized so that translators are presented with one sentence at a time. After the translated data is received from the translators, the LDC relies on local expertise to perform quality assessment and additional annotation of the data, such as an idiom translation and a word-by-word translation for idiomatic expressions. To support translation activities, LDC's programmers have developed infrastructure to handle large volumes of data, including customized software to assist in tracking where the data is in the translation process.

A host of projects have benefited from LDC's translation efforts, including GALE, REFLEX- Entity Translation, MT08, LCTL, and previously, TIDES. Most translation involves English to the source language and vice versa, although the LDC has experimented with translation into other languages. For instance, for REFLEX, the LDC managed the translation of text from Chinese to Arabic and Arabic to Chinese; this project proved to be challenging as it was difficult to locate translators proficient in both languages and to perform quality control.

The end product of the LDC's translation work can be found in such publications as the recently released GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (LDC2007T24). Please look for the regular addition of translated datasets to our catalog.

[ top ]

Less Commonly Taught Languages (LCTL) Project -- July 17, 2007

As part of our 15th Anniversary celebration, we will be highlighting one aspect of the LDC in each of our monthly newsletters for the remainder of the year. These features will provide our members and data users with a glimpse into the broad range of the LDC’s research activities. This month, the focus is on the Less Commonly Taught Languages (LCTL) project. LCTL is winding down to a successful close under the management of Christopher Walker, LDC's Project Manager, Information Extraction. The goal of this project is to create and share resources to support additional basic research and initial technology development in LCTLs - languages with over a million speakers which are not often taught to non-native speakers and which lack developed resources for natural language processing. The target languages for the first phase of the project are Bengali, Berber (Amazigh), Hungarian, Kurdish, Punjabi, Pashto, Tagalog, Tamil, Thai, Tigrinya, Urdu, Uzbek, and Yoruba.

Walker’s team developed a complex plan that operates on two levels: (1) gathering raw data, processing it and developing resources for each language; and (2) implementing various parts of the plan simultaneously for multiple languages. For each language, the team follows the same basic steps. After preliminary research to ascertain that the target LCTL has adequate web presence, gathering of resources begins with a day long 'web harvest festival'. During the festival, LDC employees, along with a native speaker of the target language, scout the web for monolingual and parallel texts, dictionaries, and word lists. After the harvest festival, any necessary distribution rights for large resources are secured and further processing is done on the found resources. The end result: a 'language pack' containing the necessary data and tools for computational linguists interested in conducting machine translation work in any of the LCTLs. Language packs for each LCTL contain resources such as monolingual and parallel texts, a translation dictionary, and a grammatical sketch of the language plus tools including encoding converters, word and sentence segmenters, morphological analyzers, and POS and named entity taggers.

During the course of the project, the LDC has collaborated with the Budapest Institute of Technology in Hungary for resource development in Hungarian, Kurdish, and Uzbek, and with the Institut Royal de la Culture Amazighe (IRCAM) in Rabat, Morocco for resource development in Amazigh. Although no LCTL language packs are scheduled for general publication at this time, we anticipate adding these exciting resources to our catalog in the coming years.

[ top ]

Data Collection and Storage -- June 20, 2007

As part of our 15th Anniversary celebration, we will be highlighting one aspect of the LDC in each of our monthly newsletters for the remainder of the year. These features will provide our members and data users with a glimpse of the broad range of the LDC’s research activities. The corpora in our catalog are drawn from a diverse range of source data that has been collected by the LDC: from broadcast news, to telephone conversations, to newswire text. This month's feature will focus on the technical details of collecting and storing large volumes of data.

The LDC's current data collection infrastructure includes multiple collection systems, which partially overlap in function to provide broad, redundant coverage. The LDC maintains several satellite dishes which collect broadcast transmissions, including video and audio and closed captioning where possible. Speech recognition software for English, Chinese and Arabic provide automatic audio indexing. Broadcast data is routinely migrated from the collection system to network storage for permanent archiving. Broadcast collections are manually audited to confirm that the data intended to be captured has been collected.

Due to the substantial volume of broadcast data, the bulk of our collection and storage efforts are devoted to broadcast data. Additionally, the LDC maintains several systems to collect both telephone speech data and newswire text. Our telephone collection platforms are each connected to a dedicated server with T1 circuits. All systems are capable of storing thousands of hours of recorded telephone conversations, and each system can process up to 24 simultaneous active lines. After being collected, telephone speech is archived both locally and on the network server. Newswire text is collected through a host of delivery methods including dedicated internet clients, modem feeds, web harvests, email transmissions, and ftp deliveries. Upon delivery, all text is normalized with SGML markup; both the raw and processed texts are sent to the network server for storage.

The LDC currently supports 58 terabytes of research data, a set that is growing at the rate of approximately 1 terabyte per month. Historically, most LDC data has been hosted on commodity grade hardware running on free open source software. Last year, the LDC recognized that an upgrade was crucial and introduced branded hardware and proprietary software to the computing infrastructure to better protect our data.

Specifically, the LDC deployed a network attached storage (NAS) solution from Sun Microsystems. NAS is a dedicated, optimized file server, and NAS technology has found wide acceptance in corporate data centers in support of mission-critical applications. In back of the NAS "head" are dual-redundant storage controllers with dual-redundant fiber-channel paths to disk arrays protected by RAID Level 5 redundancy. Other features include dual-redundant power supplies and fans, continuous monitoring of all subsystems, and component fault prediction. This more robust storage solution ensures that all data the LDC has collected is securely stored. Data associated with current projects has already been migrated to NAS. As time and funding permits, data from finished projects will be migrated, and the commodity grade file servers will eventually be a thing of the past.

For further information about the LDC's data collection and storage capabilities, please visit our updated LDC Facilities page.

[ top ]

 


Human Subjects Collection - May 17,2007

As part of our 15th Anniversary celebration, we will be highlighting one aspect of the LDC in each of our monthly newsletters for the remainder of the year. These features will provide our members and data users with a glimpse of the broad range of research focuses of the LDC. First up for examination is the Human Subjects Collections group whose work involves the compiling of conversational speech. Over the past ten years, this team has been responsible for supplying the speech recognition and language identification communities with resources such as Callfriend, Switchboard, and Fisher.

Human Subjects Collections is currently managed by Linda Corson and its primary project is Mixer which involves speech collection in 30 target languages. Past speech collection efforts at the LDC have focused on telephone conversations collected via an automated call platform. The Mixer studies are unique in that some also include 'in-studio' interviews conducted either at the LDC or at the International Computer Science Institute, at the University of California, Berkeley. These scripted sociolinguistic interviews are intended to elicit rich conversational speech and complement speech data recorded over the telephone. After speech data is collected, native speakers audit the conversations for adequate speech content and ambient noise, and provide perceptual judgments on the nativeness of the speakers.

Mixer and other speech studies are ongoing and accepting new participants; registration information is provided below. Be sure to be on the lookout for the release of some of our latest human subject collections work later this year. Portions of Mixer Phases 1 and 2, which include cross-channel sessions, are currently in our publications pipeline for Membership Year (MY) 2007.

[ top ]

 

 

 


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Monday, 25-Jul-2011 15:23:22 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.