LDC Institute

The LDC Institute is a seminar series on issues broadly related to linguistics, computer science, natural language processing and human language technology development. Featured speakers include researchers from LDC, the Penn community and distinguished scholars from around the globe.

LDC Institute Archive

 

 

2023 

Repetition and Information Flow in Music and Language

David Temperley, Eastman School of Music

December 1, 2023; 11:00-12:30pm

The theory of Uniform Information Density makes two predictions regarding the use of repetition: (1) when a pattern is repeated, variable aspects of the pattern will be less probable in the second instance than in the first; (2) less probable patterns will have a higher tendency to be used repetitively. In this talk, Temperley presents recent research testing these two predictions with regard to language and music. Regarding prediction (1): in syntactically-matched coordinate constructions (e.g. the big dog and the small cat), the second coordinate tends to have less frequent words than the first; in music, when a melodic pattern is immediately repeated in an altered form, the alterations tend to lower the schematic probability of the pattern (e.g. by increasing the interval size). Regarding prediction (2): in music, unusual melodic devices (such as escape tones and anticipations) tend to be used repetitively; in language, rare syntactic constructions show a higher tendency than common ones to be used repetitively (in coordinate constructions and elsewhere). Intriguingly, the syntactic constructions that are most often used repetitively tend to be associated with persuasive rather than informative discourse, implying an emotional commitment on the part of the speaker (such as the construction Det Adj, e.g., (the rich); this suggests a further connection with music. 

A New Database, Family Tree and Origins Hypothesis for the Indo-European Language Family

Dr. Paul Heggarty, Pontificia Universidad Católica del Perú in Lima

September 6, 2023; 10:00-11:30am

A recent article in Science presents a new language database and family tree analysis of the Indo-European languages, and a new hypothesis on their origins and expansion. Indo-European is dated to some 8100 years ago, as a central estimate of when it began to spread and diverge. This date, and the family tree structure, fit with neither the Steppe nor the farming hypothesis for Indo-European origins. Instead, separate aspects of each combine into a new ‘hybrid’ hypothesis: Indo-European did not originate on the Steppe, but in the northern arc of the Fertile Crescent, and only some of its main branches in Europe came through the Steppe, as a secondary staging-post.

This talk sets out all aspects of this wide-ranging, cross-disciplinary research. It covers key issues in Indo-European studies; in cognacy databases; in methodology for Bayesian phylogenetic analysis of language families; and in how the ancient DNA record fits with a hybrid hypothesis of Indo-European origins.

A Conversation with Roberto Pieraccini and Mark Liberman 

March 3, 2023; 11:30-1:00pm

Both Roberto Pieraccini and Mark Liberman started their careers at AT&T Bell Laboratories, one of the most respected US research institutions, more than 30 years ago and continued, until today, to pursue the technology and science of human language This is an opportunity for them to share ideas about their work and experiences and to provide a critical view at the evolution of their respective fields, and how that helped shape the current technological and scientific outlook. This session will be structured as a fireside chat where the speakers will elaborate on a number of questions asked by a moderator and conclude with open questions from the audience. 

Roberto’s career highlights include positions in research (CSELT, IBM Research) and industry (SpeechWorks International, SpeechCycle, JIBO, Google) focusing on statistical natural language understanding and reinforcement learning for automated dialogue systems. Mark left Bell Labs as the Head of the Linguistics Research Department to join the University of Pennsylvania with appointments as Professor in the Department of Linguistics and the Department of Computer and Information Science. He is also LDC’s founder and Director. His research interests include, among others, corpus-based phonetics; speech and language technology; clinical applications of linguistic analysis; and the phonology and phonetics of lexical tone and its relationship to intonation.  

2022

What is a linguistic variety? Linguistic coherence, variation and nominalisation processes in Cockney

Amanda Cole, University of Essex

November 30, 2022; 12:00-1:30pm

This talk approaches the deceptively complex question: what is a linguistic variety? In research on human language there is often much variability in how a given linguistic variety is defined or identified in terms of speakers’ demographic and social characteristics, and/or the constituent linguistic features. This creates potential problems in terms of research reproducibility and verifiability. In this talk, Amanda Cole considers a linguistic variety as being, firstly, spoken by a group of speakers with some shared characteristics and, secondly, linguistically coherent. This latter point means that a linguistic variety is not defined by any single linguistic feature, but instead, it includes many linguistic features which co-vary. Linguistic coherence does not require all speakers of the same variety to speak identically. Instead, a linguistic variety occurs on a continuum which includes internal variation from a centre of gravity but is sufficiently different to demarcate it from other varieties. Cole takes the case study of Cockney, an urban variety of southern British English. She presents data and results on linguistic coherence, linguistic variation and change, and linguistic boundary-marking and nominalisation processes in Cockney and related varieties to probe the linguistic and social nature of linguistic varieties.

Available: Slides in PDF

French CrowS-Paris: Extending a challenge dataset for measuring social bias in masked language models to a language other than English 

Karën Fort, Associate Professor, Sorbonne Université

April 21, 2022; 12:00-1:30pm

Much work on biases in natural language processing has addressed biases linked to the social and cultural experience of English speaking individuals in the United States. Karën Fort presents work aimed at widening the scope of bias studies by creating material to measure social bias in language models against specific demographic groups in France. Collaboratively she built on the US-centered CrowS-pairs dataset to create a French equivalent, adapting the corpus to French and collecting additional French stereotypes using LDC's citizen science platform, LanguageARC. She obtained a 1,677 sentence pairs corpus in French covering ten types of biases, like gender or age, and offers guidelines to further extend the dataset to other languages and cultural environments.

Available: Slides in PDF

2020

Describing typical language development in early childhood in South Africa: Harnessing local knowledge though online technologies 

Heather Brookes, Child Language Africa, University of Cape Town 

March 6, 2020; 12:00-1:30pm 

Heather Brookes reports on the development of online Communicative Development Inventories (CDIs) for South African languages, as well as the first attempts by the South African CDI team to harness citizens’ grassroots knowledge on local varieties through LanguageARC, a citizen science portal for linguistics. CDIs are parent-completed questionnaires that assess children's gestural, lexical, and grammatical development from 18-30 months. Adapted for 6 of the 11 languages spoken in South Africa, CDIs utilize common gestural and lexical items and where possible, equivalent grammatical measurements. Brookes discusses the new possibilities for dealing with challenges related to the number of varieties of spoken languages in South Africa, as well as the highly mixed urban varieties resulting from language contact in urban areas.  

Available: Slides in PDF  

2019

Construction and Analysis of the Chinese Abstract Meaning Representation Corpus

Bin Li, Nanjing Normal University 

November 22, 2019; 12:00-1:30pm

Abstract Meaning Representation (AMR) represents the meaning of a sentence as a single-rooted, acyclic, directed graph. In this talk, Bin Li introduces the on-going project to build a Chinese AMR corpus (CAMR), of which Chinese Abstract Meaning Representation 1.0 (LDC2019T07) – consisting of AMR annotations on over 10,000 sentences from the newsgroup and weblog portions of Chinese Treebank 8.0 (LDC2013T21) – is the first release. Li discusses the annotation methodology adapted to accommodate Chinese linguistic features, presents some quantitative analysis of CAMR, and addresses questions about AMR in general, such as whether sentences should be represented by graphs or trees; can concepts and relations be related back to the original words; and the applicability of AMR to other languages.

Available: Slides in PDF, Video

A Tutorial on Finite-State Text Processing

Kyle Gorman, City University of New York; Google Research

November 5, 2019; 12:00-1:30pm

Finite-state machines are widely used in text and speech processing, particularly as probabilistic models of string-to-string transductions. In this talk, Kyle Gorman provides an introduction to finite-state text processing methods and software, beginning with finite acceptors, finite transducers, and weighted finite transducers and moving onto finite-state algorithms for constructing, optimizing, and searching transducers. He presents Pynini, a Python-based finite state grammar library developed by Gorman and which is based on the OpenFst toolkit; compares its features with several other finite-state tools; and walks through several Pynini examples for spelling correction, morphological analysis, and pronunciation modeling.

Available: Slides in PDF, Video

2018

Transcribing and Sorting Cairo Geniza Fragments in Partnership with Citizen Humanists: Scribes of the Cairo Geniza

Laura Newman Eckstein, University of Pennsylvania

June 1, 2018; 12:00-1:30pm

Laura Newman Eckstein introduces a project to sort and transcribe a corpus of 300,000 individual fragments written between the 10th and 13th centuries and discovered in the Cairo Geniza – a temporary holding chamber for unusable Hebrew texts prior to ceremonial burial. The texts include religious documents, wills, legal documents, and children’s handwriting practice. They are written in Hebrew and Arabic script and include the languages of Hebrew, Arabic, Judeo-Arabic, and Judeo-Persian, among others. Fewer than one-third of the texts have been catalogued, and even fewer have been transcribed. This project attempts to solve that problem, turning to web-based communities in partnership with citizen humanists all over the world, in order to engage non-language specialists in sorting and transcribing this sprawling archive. 

Boundary-Based MWE Segmentation and Applications

Jake Williams, Drexel University

January 26, 2018; 12:00-1:30pm

Multiword expressions (MWEs) are context rich and often idiomatic phrases that could improve many machine learning and text analysis applications by simply replacing words as base features. In this talk, Williams describes the MWE segmentation task, as well as a state-of-the-art method for its execution and its data annotation requirements. Alongside this, Williams presents ongoing collaborative work applying MWE segmentation in the development of event detection tools for, and a political events database from, the Twitter media platform.

2017

Introduction to Beijing Advanced Innovation Center for Language Resources (ACLR): Objective, Mission, and Projects

Erhong Yang, Beijing Language and Culture University

November 20, 2017; 12:00-1:30pm

The Beijing Advanced Innovation Center for Language Resources (ACLR) is a scientific research institution funded by the Beijing Municipal Education Commission. In this talk, Professor Yang reviews the Center’s objectives and mission, some language resources projects, and future work. This includes three programs -- Language Resources Bank, Language and Culture Museum, and Language Processing Service.  Language resources are represented by eight projects, grouped into three categories: language resources specifically to preserve world languages and display culture, resources for language teaching and technology research, and field-oriented or task-oriented language resources. The Center builds language resources from the aspects of resource protection, cultural inheritance and information processing. In the future, ACLR will conduct research on language processing with low resource languages, develop standards or specifications for language resource construction, and carry out evaluations on language resource applications.

Available: Slides in PDF

Corpus of Political Speeches in Greater China

Kathleen Ahrens, Hong Kong Polytechnic University

June 29, 2017; 12:00-1:00pm

Kathleen Ahrens highlights four on-line English and Chinese political corpora that she created. She explains how they are used for metaphor research in conjunction with information manually extracted from WordNet and SUMO.

Available: Slides in PDF

Language Resources from the LLT Group at the Hong Kong Polytechnic University: From phonological neighbourhood to semantic relata, from grammar to emotion

Chu-Ren Huang, Hong Kong Polytechnic University

June 29, 2017; 1:00-2:00pm

Chu-Ren Huang introduces new language resources developed by the Linguistics and Language Technology group, including a grammatically marked example database as a companion to A Reference Grammar of Chinese (Huang and She 2016), an emotion (and cause) annotated corpus of Chinese and English, a relate of verified lexical semantic relation pairs in English and Chinese, and a phonological neighbourhood database of Mandarin Chinese.

Available: Slides in PDF, References in PDF

A Framework for Conducting Non-Expert Translations and Summarizations

Christopher Harris, SUNY Oswego

April 28, 2017; 12:00-2:00pm

For a decade, crowdsourcing platforms such as Amazon Mechanical Turk have demonstrated the ability to have non-experts perform translations and text summarizations at qualities approaching those of professionals. However, these crowdsourcing tasks need to be well designed to ensure quality outputs are produced. In this talk, Professor Harris discusses a framework for crowdsourcing both translations and text summarizations and presents some recent empirical experiments conducted using this framework. He also describes some design elements, including the number (depth) of crowdworkers needed for different tasks in the framework and how this depth affects output quality and task completion time. 

Available: Slides in PDF

2016

The Growth in Grammar Corpus: On Working with Children (But not Animals)

Mark Brenchley, University of Exeter

September 15, 2016; 12:00-2:00pm

Dr. Brenchley explains how Exeter’s Growth in Grammar project aims to build and analyse an extensive corpus of student writing: one that encompasses a range of genres and curricula, and which captures the full age and ability range of the English education system. He discusses specific questions of annotation and transcription raised by the work so far, such as how to handle material of widely varying kind and quality and to what extent the specific structures that best capture the grammatical dimensions of writing development can be identified.

Available: Slides in PDF

The Language Grid: Multi-Language Service Platform for Intercultural Collaboration

Toru Ishida, Kyoto University

July 14, 2016; 12:30-2:00pm

Professor Ishida explains how the Language Grid allows users to freely create language services from existing language resources and combine them to develop new services for their own communication environment. He describes the Grid’s design concept and service architecture together with YMC-Viet: a youth mediated communication project in Vietnam, where Japanese agricultural experts transfer knowledge to Vietnamese farmers in illiteracy regions. By integrating various services registered to the Language Grid, a communication channel between experts and farmers via children was realized to bridge the significant communication gaps including language, knowledge, culture and distance.

Available: Slides in PDF 

Social Data Research at a National Laboratory

Eric Bell, Pacific Northwest National Laboratory, Richland Washington

July 7, 2016; 12:00-1:30pm

Eric Bell describes the linguistic, computer science and statistical research conducted with social data by the Pacific Northwest National Laboratory. He also discusses the importance of relationships in government and industry for being successful in the field.

Available: Slides in PDF   

2015

Multimodal Interaction Standards at the World Wide Web Consortium

Deborah A. Dahl, Conversational Technologies; Chair, W3C Multimodal Interaction Working Group

December 4, 2015; 12:00-1:30pm

Deborah A. Dahl describes three of the standards developed by the W3C Multimodal Interaction Working Group that can be used to represent multimodal user inputs and system outputs. Extensible Multimodal Annotation (EMMA) represents cross-modality metadata for user inputs and system outputs. Ink Markup Language (InkML) represents traces in electronic ink, for example, for applications such as handwriting or gesture recognition. Emotion Markup Language (EmotionML) can be used to represent emotions.

She then discusses how these tools interoperate with each other and offers some thoughts about future directions.

Available: Slides in PDF  

2014

Comparing Dialect and Accented Pronunciations on the Basis of Transcriptions and Articulography

Martijn Wieling, Department of Humanities Computing, University of Groningen

September 12, 2014; 3:00-4:30pm

Dr. Wieling introduces the Pointwise-Mutual-Information-based distance which is able to determine sensitive pronunciation distances and sound segment distances on the basis of transcribed speech (Wieling et al., 2012). He then applies the PMI-based Levenshtein distance to transcriptions of accented English speech obtained from the Speech Accent Archive and validates the method by comparing the computational distances to perceptual judgments of hundreds of native American English speakers (Wieling et al., in press).

Dr. Wieling also illustrates how articulography (i.e. measuring the movement of tongue and lips during speech) is useful for comparing pronunciations. To demonstrate the method, he shows significant movement differences between native and non-native speakers of English, as well as Dutch dialect speakers.

Martijn Wieling, Jelke Bloem, Kaitlin Mignella, Mona Timmermeister and John Nerbonne (forthcoming). Automatically measuring the strength of foreign accents in English. Language Dynamics and Change.

Martijn Wieling, Eliza Margaretha and John Nerbonne (2012). Inducing a measure of phonetic similarity from pronunciation variation. Journal of Phonetics, 40(2), 307-314.

Available: Slides in PDF

Aikuma: A Mobile App for Collaborative Language Documentation

Steven Bird, University of Melbourne; LDC, University of Pennsylvania

July 2, 2014; 12:00-1:00pm

Dr. Bird describes Aikuma, a mobile app that is designed to put the key language documentation tasks of recording, respeaking, and translating in the hands of a speech community. He discusses the motivation for this approach, describes the system, reports on its use in fieldwork and presents ongoing research and development activities.

Available: Slides in PDF

2013

The Corpus of Interactional Data: A Large Multimodal Annotated Resource

Philippe Blache, CNRS; Aix-Marseille University

November 20, 2013; 12:00-1:30pm

Professor Blache presents in this talk a methodology for creating annotated multimodal corpora. Such resources have to take into consideration all types of linguistic information, from phonetics to discourse and gestures. It is then necessary to propose a homogeneous framework making it possible to represent in a coherent way these different domains. He proposes for doing this the elaboration of an abstract annotation scheme, relying on typed feature structures. He shows how such scheme can be instantiated at a large scale with the example of a large multimodal corpus, the CID (Corpus of Interactional Data). After a presentation of the abstract scheme, he details the annotation process of the different domains and concludes in showing how such scheme can be interesting in the perspective of data reusability as well as interoperability of annotation tools.

Available: Slides in PDF

2012

The Sociolinguistic Archive and Analysis Project: Data, Tools and Applications

Tyler Kendall, University of Oregon

May 18, 2012; 12:00-2:00pm

Dr. Kendall describes his work on the Sociolinguistic Archive and Analysis Project (SLAAP), a web-based data preservation and exploration initiative centered at North Carolina State University. SLAAP houses audio recordings and associated materials from over 2,500 sociolinguistic interviews. In addition to its basic, preservational, organizational and access related features, a centerpiece of the archive is a time-aligned, databased transcription model which allows for dynamic, corpus-like analysis of the transcript data as well as real-time phonetic processing of the audio from the transcripts. The presentation describes the background of the project and its architecture, provides a demonstration of several of the web-based features, and discusses some of the ways that the archive and tools are being used in current sociolinguistic research.

Available: Slides in PDF

2011

Building a universal corpus of the world's languages

Steven Bird, University of Melbourne; LDC, University of Pennsylvania

Jul 20, 2011; 12:30-2:30pm 

This talk reports ongoing work on developing a corpus that includes all of the world's languages, represented in a consistent structure that permits large-scale cross-linguistic processing (Abney & Bird 2010). The focal data types, bilingual texts and lexicons, relate each language to a reference language. The ability to train systems to translate into and out of a given language is proposed as the yardstick for determining when that language is adequately documented. Bird reports on recent efforts to incorporate datasets built by the language resources community, via the "Language Commons". He describes a new project that is recording and transcribing material from a large number of unwritten languages in Papua New Guinea.

Steven Abney and Steven Bird, The Human Language Project: Building a universal corpus of the World's languages, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (2010).

Available: Slides in PDF

Coding Conventions for Archival Sharing

Malcah Yaeger-Dror, University of Arizona; LDC consultant

Jun 29, 2011; 12:30-2:30pm

Dr. Yaeger-Dror discusses her recent work at LDC formulating coding conventions for speech archives in three areas: (1) coding for the social situation; (2) demographic coding that could be used as relevant research in future studies; and (3) the influence of interpersonal attitudes on speech variation. Most of the talk centers on this last point, focusing on the speakers' attitudes toward their interlocutors and how one might be able to go about determining this information without recourse to Gilesian psychological studies.

Available: Slides in PDF

Free recall of word lists; empirical and theoretical issues

Michael J. Kahana, University of Pennsylvania

Jun 15, 2011; 1:30-3:00pm 

Professor Kahana discusses the major empirical phenomena concerning recall of common words and the theoretical issues raised by these phenomena. He shows how memory researchers have devised theories to explain these data and presents some critical tests of those theories.

Contact, Restructuring, and Decreolization: The Case of Tunisian Arabic

Thomas A. Leddy-Cecere, Dartmouth College

Jan 14, 2011; 12pm-2pm

The modern Arabic colloquial dialects stand out in the world of dialectology and historical linguistics. Though all languages display dialectal variation, Arabic represents a special case -- attempts to classify and trace its varieties by standard linguistic techniques do not produce satisfactory results. It has been suggested (Versteegh, 1984) that this failure could be explained by positing that modern Arabic is a product of creolization. Leddy-Cecers's study represents a focused, cross-dialectal examination of a single Arabic dialect area (Tunisia) in search of evidence of creolization and subsequent effects.

Available: Slides in PDF, Leddy-Cecere's Honors Thesis in PDF

2010

Language Technology Resources for Sanskrit and other Indian Languages at Jawaharlal Nehru University, India

Girish Nath Jha, Jawaharlal Nehru University; University of Massachusetts Dartmouth

Jul 22, 2010; 12pm-2pm

The introductory section of the talk presents the complex linguistic scenario of India and the role Sanskrit has to play in it. The next section discusses the language technology resources being developed at Jawaharlal Nehru University, New Delhi, the premier university of India for Sanskrit and other major Indian languages under Technology Development for Indian Languages -- an initiative of the Indian government’s Department of Information Technology. The concluding section highlights the issues and challenges being faced and scope for future collaboration.

Available: Slides in PDF

Bibliotheca Alexandrina: The oldest library in the digital age

Ibrahim Shihata Arabic UNL Center at Bibliotheca Alexandrina

Magdy Nagi and Ahmed Bhargout, Bibliotheca Alexandrina

Apr 28, 2010; 3:00-5:00pm    

"Bibliotheca Alexandrina: The oldest library in the digital age"

Since its inauguration in 2002, Bibliotheca Alexandrina (BA) has devoted itself to be the center of excellence and the right platform to address the ever advancing technology of our time. This presentation sheds light on some of the initiatives of the BA, carried out by the International School of Information Science (ISIS), to document and preserve heritage through digital archives and to offer the right platform for advancing scientific research in addition to its digital initiatives to boost children's skills and to catalyze reform in the Arab world. 

Available: Slides in PDF

"Ibrahim Shihata Arabic UNL Center at Bibliotheca Alexandrina"

The Universal Networking Language (UNL) project is regarded as one of the most promising scientific research initiatives at the BA carried out by the ISIS. This presentation gives an overview on the history of the UNDL foundation which devises the program and its partnership with the BA. The presentation also gives a comprehensive overview of the UNL specifications, the software and system development at the BA including the UNL +3 engines and Knowledge Extraction System, and the tools deployed for the center. Moreover, the Book Catalogue is demonstrated as one of the applications on top of the UNL.

U.S. Supreme Court Corpus (SCOTUS)

Daniel Katz, J.D., Michigan Law School and Michael Bommarito, University of Michigan

Jan 26, 2010; 10:00-12:00pm   

The corpus of Supreme Court written opinions is a rich linguistic resource. Not only does this corpus provide a longitudinal sample of formal American English, but it is also a source of text with identified authors and vote-coded sentiment. Despite this value and years of qualitative and quantitative material of the United States Supreme Court, no compiled corpus of these opinions is currently available to researchers. The purpose of this talk is (1) to describe efforts to compile both the complete corpus of Supreme Court Opinions and associated metadata, (2) to outline a number of current research projects utilizing this data, and (3) to discuss any criticism, potential projects, or possible collaboration.

Available: Slides in PDF

2009

Variations Across Languages, Divisions Within Communities: Languages, Schools and the Internet in Tunisia

Simon Hawkins, Franklin and Marshall College

Nov 12, 2009; 12:00-2:00pm  

Despite the convenient categorization of languages, particularly English, into national varieties, some specific discourses cross national and linguistic borders. For some of these varieties, such as academic writing and Internet conventions, what is linguistically important is not the language used, but the global discourses and language ideologies of which they are a part. Multilingual practices in Tunisia illustrate these examples.

The LDC Standard Arabic Morphological Tagger

Rushin Shah, LDC

Jun 10, 2009; 12:00-2:00pm    

The current process of Arabic corpus annotation at LDC relies on using the Standard Arabic Morphological Analyzer (SAMA) to generate various morphology and lemma choices, and supplying these to manual annotators who then pick the correct choice. However, a major constraint of this process is that SAMA can generate dozens of choices for each word, each of which must be examined by the annotator. Moreover, SAMA does not provide any information about the likelihood of a particular choice being correct. A system that ranks these choices in order of their probabilities and manages to assign the highest or second-highest rank to the correct choice with a high degree of accuracy would hence be very useful in accelerating the rate of annotation of corpora. Such a system would also be able to aid intermediate Arabic language learners by creating annotated versions of news articles or other web pages submitted by them.

Shah describes such a model that simultaneously performs morphological analysis and lemmatization for Arabic, using choices supplied by SAMA. Morphological labels are converted into vectors of various morphosyntactic features, such as basic part-of-speech, gender, number, mood, prefixes, suffixes, case, etc. These various attributes of the supplied Arabic data are then used to create models for lemmas and MSFs. Individual models are combined into one aggregate model that simultaneously predicts lemmas and complete morphological analyses. This model achieves accuracy in the high nineties.

Available: Slides in PDF

Building an ASL Corpus Project

Gaurav Mathur, Gene Mirus, and Paul Didus, Gallaudet University

Jan 15, 2009; 3:00-5:00pm    

There has been a need for the present American Deaf communities to preserve a sample of their language so that they can appreciate the richness of their language for generations. There is also a practical need for materials that can be used in a wide variety of settings, ranging from American Sign Language (ASL) instruction for deaf children to training people who work with deaf communities. This talk describes a long-term project that meets those needs, namely, the establishment of an ASL corpus by collecting a comprehensive and representative sample of ASL from around the nation. The talk opens with a description of other sign langauge corpora outside the United States that have been successful and then offers an outline of the ASL corpus project that is currently underway, including the kinds of data to be be collected and the methodology to be used for the data collection.

2008

Development of Resources and Techniques for Processing of Some Indian Languages

Shyam S. Agrawal, KIIT College of Engineering

Jul 17, 2008; 11:30-1:30pm  

In the past few decades there has been a pressing demand and need to develop speech and language corpora for training, testing and benchmarking of speech technology systems for various applications. Richly annotated corpora and labeled databases are needed to develop models of spoken languages and also to understand the structure of speech and the variability that occurs in peech signals.

This talk presents some of the phonetic differences in Hindi compared to English and presents an overview of the efforts made by CDAC, Noida and some other institutions to develop text and speech databases, tools and some techniques for the processing of some Indian languages. For the collection of speech databases, issues and procedures related to text selection, e.g. multiform units and variables such as demographic, dialectal, environmental, emotional, linguistic background, etc. have been included. Special tools developed for the analysis of text are described. The objective has been that tools should be adaptive to other Indian languages and also to other languages. Details of some of the application-oriented task specific databases such as the ELDA-sponsored database for Hindi and the CFSL Speaker Identification database for forensic applications will be described in detail. 

Available: Slides in PDF

2007

HTML Templates for LDC Sponsored Projects

Shawn Medero, LDC

Jul 19, 2007; 2:30-4:30pm  

This is an introduction to existing resources for communicating LDC projects over the web. The templates presented provide a professional, consistent and attractive design while allowing creativity and variation in each project's web site approach. They help define a predictable and organized navigation structure for LDC employees, project sponsors, research and general visitors. Questions concerning use of web standards, such as CSS and HTML, are encouraged throughout the presentation.

Speaking Arabic in Iraq and the Middle East: Reflections on Three Tours of Duty

Kenneth Gardner, USMC, ret.

Jul 13, 2007; 2:30-4:30pm    

Kenneth Gardner served in the U.S. military for 22 years. He learned Arabic at the Defense Language Institute Foreign Language Center in 1995. Gardner shares his experiences as a non-native Arabic speaker communicating with native speakers in a variety of settings as well as the issues faced by monolingual U.S. troops in the field.

Programming Specifications: Procedures and Practices

Andrew Cole, LDC

Jun 20, 2007; 3:00-5:00pm    

Virtually all tasks at LDC depend on programmer input. LDC policy requires that most requests for programming assistance be accompanied by a specification that describes the desired output and includes some estimate of the programming time needed to complete the tasks in the specification. Cole outlines a simple set of guidelines for LDC staff to follow when making programming requests and illustrates how those guidelines work by using them to develop a business systems specification.

2006

Comparing Linguistic Annotations -- Issues in Harmonization and Quality Control

Christopher R. Walker, LDC

Oct 26, 2006; 3:00-5:00pm   

Consistency analysis was an important aspect of quality assessment in 2005 ACE data creation. Within this framework, Walker became quite interested in the various assumptions and applications of annotation scoring infrastructure. As he attempted to better understand the bounds of the problem and solution spaces, it quickly became clear that there was no existing discussion of these issues -- and very little documentation of general best practices.

In this talk Walker seeks to reduce this gap by outlining the apparent problem and solution spaces and by opening for discussion the utility of annotation scoring metrics in the various domains of empirical, computational and corpus linguistics -- and more cogently in the domain of quality control for linguistic data creation.

Recording and Annotation of Speech Data via the WWW - A Case Study

Dr. Christoph Draxler, Ludwig-Maximilians University

Sep 22, 2006; 10:30-11:30am    

The German Ph@ttSessionz project will create a database of 1000 adolescent speakers balanced by gender and covering all major German dialect areas. The project employs a novel approach to collecting speech data: all recordings are performed via the WWW -- using a web application in a standard web browser -- in more than thirty-five German public schools. Speech is recorded using a standardized audio setup on the school's PC, and the signal and administrative data are immediately transferred to the BAS server in Munich. Using this approach, geographically distributed recordings in high bandwidth quality can be made efficiently and reliably.

Draxler describes the Ph@ttSessionz web application and its major components, SpeechRecorder and WebTranscribe, and outlines the infrastructure developed at BAS for WWW-based speech recordings. He also discusses the strategies employed to enlist schools in the project and presents preliminary analyses of the Ph@ttSessionz speech database.

LDC Online

David Graff, LDC

Jul 27, 2006; 3:00-5:00pm    

Graff presents an overview of LDC Online's corpora coverage, search methodology and future plans for growth.

Pros and Cons of Different Annotation Workflow Systems 

Seth Kulick, IRCS; Julie Medero, Hubert Jin, David Graff and Kevin Walker, LDC

May 4, 2006; 3:00-5:00pm 

This LDC Institute is part of an effort to determine the desirable properties for a single workflow system that can be used (or extended as appropriate) in the various annotation projects at LDC and IRCS (Penn's Institute for Research and Cognitive Science). Because the several different workflow systems that are currently used were designed for projects with different needs, they handle many issues differently, among them, support for local and remote annotation, sophistication of report capability, use of automated tagging, and flexibility in the specification of workflow stages.

The speakers are LDC/IRCS programmers who designed some of the current systems. They discuss the following topics: (1) the properties of the different systems; (2) why some characteristics of a particular workflow system might make it unsuitable for a particular project; (3) properties that should be added to the workflow systems; and (4) alternative ways of setting up a workflow system.

Recent Trends in Annotation Tool Development at LDC

Kazuaki Maeda, Julie Medero and Haejoong Lee, LDC

Apr 20, 2006; 3:00-5:00pm    

LDC has created large volumes of annotated linguistic data for a variety of evaluation programs and projects using highly customized annotation tools developed on site. LDC programmers  Maeda, Medero and Lee discuss the history of annotation tool development at LDC and share some current approaches. Two tools in particular are highlighted: (1) LDC's model for decision point-based annotation and adjudication, which was used effectively in the ACE 2005 annotation effort; and (2) XTrans,a new speech transcription and annotation tool particularly suited for transcribing meeting speech that was used by LDC in the NIST meeting recognition evaluation and Mixer Spanish and Russian telephone conversation studies.

2005

Building a Lexicon Database for Arabic Dialects

David Graff, LDC

Dec 8, 2005; 3:00-5:00pm   

One of the major problems in creating a lexicon database for Arabic dialects is the fact that standardized orthographic (spelling) conventions do not generally exist. The word forms generated by transcribers from recorded conversations are based on relatively loose conventions and show significant variability within any given dialect. Graff describes how that problem is being resolved by creating a relational database design that makes the transcripts a key part of the database so that repairs to word forms in the lexicon table are propagated automatically to the transcripts. He also reviews some earlier approaches to lexicon building, describes the annotation tools developed specifically for the current lexicon project, and briefly considers some possible extensions to the database structure and annotation methods LDC currently uses to cover tasks such as treebank annotation.

Available: Slides in PDF

Less Commonly Taught Languages (LCTLs)

Mark Mandel, LDC

Nov 17, 2005; 3:00-5:00pm    

One of LDC’s principal tasks in the Less Commonly Taught Languages (LCTLs) portion of the REFLEX (Research on English and Foreign Language Processing) Project is to discover, produce, and maintain language resources for the target languages. Those resources include, among other things, linguistic information, writing systems, converters, word segmenters, electronic lexicons, monolingual texts, bilingual texts, morphological parsers, and tools for producing and using annotated resources.  Mandel describes the challenges associated with assembling those language resources, as well as the current progress of LDC’s work focusing on Thai, Urdu, Bengali, Panjabi, Hungarian, Tamil and Yoruba.

The Teaching of Berber in Morocco: Reality and Perspectives

Fatima Agnaou, IRCAM

Jul 7, 2005; 2:00-4:00pm

Agnaou discusses IRCAM (The Royal Institute of Amazigh Culture) and its realization in regard to the integration of Berber in Moroccan schools. She addresses the aims and objectives of teaching in Berber and the standardization of the Berber language. A presentation of textbooks and other teaching materials is also included along with teacher training and methodology.

Functional Morphology

Otakar Smrz, Charles University

Feb 3, 2005; 12:30-2:30pm   

Computational morphological models are usually implemented as finite-state transducers. The morphologies of natural languages are, however, better described in terms of inflectional paradigms, lexicons, and categories (inflectional and inherent parameters). Markus Forsberg and Aarne Ranta have recently introduced a framework called Functional Morphology (FM) smoothly reconciling both of these viewpoints. Linguists can model their systems without any 'finite-state restrictions', using the full power of the functional language Haskell and delegating actual computational issues to FM. The morphological models become clearer, reusable, (ex)portable, and even more efficient. Smrz highlights the noticeable features of FM/Haskell and outlines plans to use it for Arabic. He also refers to other languages, including Latin, Swedish, Sanskrit, Spanish and Russian.

2004

Arabic Propbank

Mona Diab, Stanford University

May 13, 2004; 12:30-2:30pm

The holy grail of computational linguistics, from the time of the field's inception, has been (automatic) natural language understanding. Semantic parsing appears as a significant stride in that direction giving researchers a glimpse into the world of concepts in a functional and operational manner. Thanks to the English PropBank, a jumpstart for the task of semantic role assignment is noted in several community wide standardised evaluations such as CoNLL and Senseval3. In fact, the porting of PropBank style annotation to ten Chinese verbs has achieved very interesting results in a relatively short period of time (Sun & Jurafsky, 2004), but more importantly, has shown some relevant quantifiable linguistic variation between English and Chinese.

Diab focuses on three of the verbs which are fully annotated discussing the different frames and roles associated with their arguments and adjuncts.The main issue that arises frequently is that of consistency and variation especially with respect to assigning ARGM and ARG3 roles to constituents. She advocates a more generalised consistent annotation across verbs (sometimes it shifts even within a single verb [potentially sense variation]) as well as within a verb. Diab discusses some of the rudimentary guidelines she has set for herself inspired by the annotation guidelines from the English and Chinese PropBanks.

Project Santiago

Colonel Stephen A. LaRocca, Center for Technology Enhanced Language Learning

Mar 23, 2004; 1:30-2:30pm   

The Center for Technology Enhanced Language Learning (CTELL), organized within the Department of Foreign Languages at the U.S. Military Academy, collects speech data and turns it into speech recognition software primarily for the benefit of cadets learning languages at West Point. Since 1997 CTELL has collected broadband speech corpora for Arabic, Russian, Portuguese, American English, Spanish, Croatian, Korean, German and French.

Tongue-Tied in Singapore: A Language Policy for Tamil?

Harold F. Schiffman, University of Pennsylvania, Department of South Asia Studies

Feb 26, 2004; 12:30-2:30pm

The Tamil situation in Singapore is one that lends itself ideally to the study of minority language maintenance. The Tamil community is small and its history and demographics are well known. The Singapore educational system supports a well-developed and comprehensive bilingual education program for its three major linguistic communities on an egalitarian basis, so Tamil is a sort of test-case for how well a small language community can survive in a multilingual society where larger groups are doing well. But Tamil is acknowledged by many to be facing a number of crises; Tamil as a home language is not being maintained by the better-educated, and Indian education in Singapore is also not living up to the expectations many people have for it. Educated people who love Tamil are upset that Tamil is becoming thought of as a 'coolie' language and regret this very much. Since Tamil is a language characterized by extreme diglossia, there is the additional pedagogical problem of trying to maintain a language with two variants, but with a strong cultural bias on the part of the educational establishment for maintaining the literary dialect to the detriment of the spoken one.

Schiffman examines these attempts to maintain a highly-diglossic language in emigration and concludes that the well-meaning bilingual educational system actually produces a situation of subtractive bilingualism.

The Contextualization of Linguistic Forms across Timescales

Stanton Wortham, University of Pennsylvania, Graduate School of Education

Feb 10, 2004; 1:00-3:00pm   

When people speak, they socially identify themselves and others. The use of linguistic forms is one important means through which social identities get established. But the implications of any utterance for social identity depend on relevant context. Decontextualized sociolinguistic regularities cannot fully explain how a given utterance establishes a given social identity, although such regularities certainly play an important role. The centrality of context seems to imply, methodologically, that a full linguistic analysis of social identification must rely on case studies of particular utterances in context. Social identification, however, is not a phenomenon of isolated cases. An individual gets socially identified across a series of interrelated events, not a series of unique, unrelated contexts.

Wortham describes an empirical research project which traces the identity development of a ninth grade student across the academic year in one classroom. The student’s identity develops in substantial part through speech events that position her in socially recognizable ways. Wortham presents a methodological strategy for analyzing the interrelated events across which this individual gets socially identified, focusing on specific kinds of speech events that play a central role in the emergence of her social identity across time. This project provides an opportunity to reflect on the kinds of data necessary for studying social identification.

Interfaces for Parser and Dictionary Access

Malcolm D. Hyman, Harvard University

Jan 26, 2004; 1:30-3:30pm   

The subject of this presentation is "linguistic middleware" — software designed to mediate between backend linguistic tools and data sources (for instance, tokenizers, morphological analyzers, and parsers) and frontend user agents (browsers and editors). Making linguistic data available within graphical user agents will allow for rich, next-generation working environments that can offer substantial benefits to language researchers and students. Simple but powerful interfaces will allow for interoperability between diverse technologies, including legacy systems. Current web-services and XML standards provide the basis for the development of such interfaces. The goal is distributed networks that connect arbitrary tools, databases, reference works, and corpora; ultimately, this architecture will help to break down barriers between scholarly communities and to enrich the work of linguists, philologists, technologists, historians, and literary scholars.

In order to realize the vision of generalized linguistic middleware, we need to address a range of challenges encountered in typologically diverse languages and writing systems. Hyman focuses on:

  • multiple approaches to tokenization required by different writing systems
  • orthographic normalization
  • handling different "window sizes" needed for context-sensitive analysis
  • strategies for identifying lexical items that are realized discontinuously
  • metalanguages for morphosemantic and syntactic category labels

The discussion is accompanied by demonstrations of some prototype implementations and solutions.

2003

Finite State Morphology using Xerox Software

Kenneth Beesley, XRCE

Dec 9, 2003; 12:00-2:00pm   

Morphological analysis (word analysis) and generation are foundation technologies used in many kinds of natural-language processing, including dictionary lookup, language teaching, part-of-speech disambiguation (tagging), syntactic parsing, etc. Successful and publicly available software implementations based on "finite-state" theory include Koskenniemi's Two-Level Morphology, AT&T Lextools, Groningen University's Fsm Utils, and Xerox's lexc and xfst languages. The Xerox tools are now available on CD-ROM in a book entitled "Finite State Morphology", Beesley & Karttunen, 2003, CSLI Publications. Beesley briefly and gently covers the history and underlying theory of finite-state morphology, and introduces lexc and xfst syntax.

Finite-state morphological analyzers are excellent projects for a Master's Thesis, and for field linguists they are a practical way to encode, computerize and test morphotactic grammars, alternation rules and lexicons that would otherwise remain inert on paper. Finite-state morphology has been successfully applied to languages around the world, including the obviously "commercial" European languages, Finnish, Hungarian, Basque, Turkish, Korean, Arabic, Syriac, Hebrew, several Bantu languages, a variety of American Indian languages, etc.

Searching through Prague Dependency Treebank

Jiri Mirovsky and Roman Ondruska, Charles University

Oct 15, 2003; 1:00-3:00pm   

Netgraph is a search tool for annotated treebanks. Originally developed for the Prague Dependency Treebank,
Netgraph is a multiuser system with a net architecture. This means that more than one user can access it at the same time and its components may be located in different nodes of the Internet.

Mirovsky and Ondruska show NetGraph in use. Although work with NetGraph is easy, the client-server architecture requires actions from the user which a stand-alone application does not. They describe four parts of working with NetGraph -- connecting to the server, selecting files, creating a query, and viewing the results. NetGraph provides a simple-to-use and powerful query language. The basic query is a tree structure with a few evaluated attributes. Searching a corpus given a query means searching for all trees which contain the query tree as a subtree. This basic functionality is improved by so-called meta attributes -- an easy way to add more restrictions to found trees, e.g. size, orientation, position of query tree, forbidden nodes, etc. Mirovsky and Ondruska show several examples of queries, from the simplest to more complex.

The Pennsylvania Sumerian Dictionary Project

Stephen Tinney, University of Pennsylvania  Museum of Archaeology & Anthropology

Nov 12, 2003; 1:30-3:30pm   

The Pennsylvania Sumerian Dictionary project is an online dictionary which combines lexicon and text-corpora within an interface with multiple entry points to the Dictionary, multiple views of the lexicon or individual items and reverse navigation back from the text-corpora. The framework within which this system is implemented is a generic XML data structure for corpus-based dictionaries.

The data structure binds control lists with the references drawn from the text-corpora. These references are tagged with morphological and semantic information to enable programmatic generation of lexical articles containing exhaustive information on orthography, chronological and geographical distribution and usage. Tinney describes the particular problems associated with writing a dictionary of Sumerian and the corpus-based dictionary model and demonstrates the state of the current implementation.

Arabic Language: Issues and Perspectives

Mohamed Maamouri and Tim Buckwalter, LDC

Apr 10, 2003; 12:00-2:00pm   

The presentation starts from the early standardization of Arabic and leads to the emergence of 'diglossia' and its linguistic and sociolinguistic consequences. The dominant attitudes toward linguistic reforms is also presented. The second focal point is the Arabic reading process, its challenges and its consequences for reading performance and for education in general.

In a connected second part of the same presentation, Buckwalter focuses on Arabic NLP Issues. He presents his morphological analyzer and lexicon. A brief overview of the LDC Arabic Treebank project follows.

Collections

David Miller, LDC

Mar 20, 2003; 12:00-2:00pm   

Miller discusses the various speech collection projects undertaken at LDC from 1995-2003. Included is a discussion of telephone speech data collection projects and on-site speech collection projects. Data collection projects covered  include CallHome, CallFriend, Switchboard, ROAR and FISHER.

Data and Annotations for Sociolinguistics (DASL): Using digital data to address issues in sociolinguistic theory

Stephanie Strassel, LDC

Jan 16, 2003; 12:00-2:00pm 

A longstanding focus of sociolinguistic research has been the quantitative analysis of language variation and change, an endeavor that necessarily begins with the empirical observation and statistical description of linguistic behavior. Current technology encourages the collection and analysis of such data, and even the presentation and publication of research findings, wholly within the digital domain. Within the field of human language technology, such benchmark data has proven to be an essential ingredient for progress, as it reduces the cost of analytical infrastructure within research communities, frees researchers to focus on their interests, encourages collaboration and reduces impediments to new participants. However, most empirical data within sociolinguistics continues to be collected and analyzed by individuals or individual research groups, and is never made available to the wider research community. This proprietary approach to data hampers collaboration, the replication of studies and the comparison of models, methods and results, all necessary components of rigorous science. The prospect of digital data sharing in sociolinguistics also raises theoretical and methodological questions: Can sociolinguists make effective use of existing corpora? Do the insights gained from corpus data differ qualitatively from data most commonly used in quantitative sociolinguistics, namely recordings of sociolinguistic interviews? What are the best practices for the creation of new digital resources for sociolinguistic research?

The project on Data and Annotations for Sociolinguistics (DASL),  at LDC with support from NSF via the Talkbank project, begins to address these issues. DASL investigates the use of digital data in sociolinguistics through a series of case studies involving both analysis of variation in existing corpora and the creation of new data sets. Strassel introduces DASL's goals, assumptions, data and tools and reviews annotation and corpus creation efforts and results to date.

Towards a Comprehensive, Empirical Analysis of Linguistic Data: the case of Regional Italian vowel systems

Christopher Cieri, LDC

Jan 16, 2003; 12:00-2:00pm

Any empirical study of language relies necessarily upon a body of observations of linguistic behavior even if the study fails to formally acknowledge its corpus. The decisions one makes in approaching data affect research profoundly by opening some avenues of inquiry while blocking others. Looking across research communities as diverse as sociolinguistics and speech technologies, one finds methods that may be integrated in order to both broaden research possibilities and to perform research more efficiently.

Cieri explores the relationship among data, tools, annotation (or coding) processes and the research they support,  focusing specifically on the quantitative analysis of linguistic variation. The data come from a series of sociolinguistic interviews undertaken to investigate the modeling of variation in the regional speech of central Italy.

After describing the motivation for this study, Cieri demonstrates a series of tools, processes and data formats that permit a comprehensive yet rapid analysis of vowel systems. Specifically, he demonstrates tools for transcription and segmentation, lexicons and search tools that automatically select and categorize tokens of interest from the transcripts, batch processes that perform acoustic analyses of the selected tokens and an interface for managing and adding human judgments to these analyses. In the process he offers a particular perspective on tool development, favoring information retention and annotator efficiency over computational efficiency and portability.

2002

New Methods for Constructing Annotated Speech Corpora

Steven Bird, LDC

Jun 14, 2002; 12:00-2:00pm   

Over the past decade of creating and managing speech corpora, LDC staff have developed literally hundreds of utilities, user interfaces and file formats. These databases are becoming increasingly complex in their structure, with rich standoff annotations organized across multiple layers. At the same time, the range of contributing specialties has become more diverse, as illustrated by LDC's publication plans in such areas as field linguistics, sociolinguistics, gesture, and animal communication.

Bird outlines the traditional corpus production process and catalogs the problems LDC has experienced. This provides the backdrop for LDC's R&D effort over the last four years, which has created new software infrastructure and a suite of annotation tools. He introduces the principles and key concepts of the annotation graph toolkit (AGTK), describes the current tools, and gives a brief overview of the tool development process. Finally, Bird introduces OLAC, the Open Language Archives Community, and demonstrate hows it is being used for describing and discovering language resources of the kind created at LDC. The talk is followed by an informal demonstration session.

Available: Slides in PDF

More information about AGTK and OLAC is available at agtk.sf.net and www.language-archives.org.

Corpus Development for the ACE (Automatic Content Extraction) Program

Alexis Mitchell and Stephanie Strassel, LDC

Jun 26, 2002; 1:00-3:00pm   

The objective of the ACE Program is to develop core automatic content extraction technology to enable text processing through the the detection and characterization of entities, relations, and events. As part of the DARPA TIDES program, ACE supports technology research and development for various classification, filtering, and selection applications by extracting and representing language content (i.e., the meaning conveyed by the data). The ultimate goal of ACE is the development of technologies that will automatically detect and characterize this meaning.

For the past three years, LDC has been developing annotated corpora for the ACE program. Data for ACE consists of newspaper, newswire and broadcast news transcripts. To support Entity Detection and Characterization, ACE annotators label selected types of entities (Persons, Organizations, etc.) mentioned in text data. The textual references to these entities are then characterized and multiple entity mentions are co-referenced. The Relation Detection and Characterization task requires annotators to identify and characterize relations between the labeled set of entities. LDC's role in ACE has recently expanded to encompass annotation of all data for the ACE program as well as development and maintenance of annotation guidelines and annotation tools.

Mitchell and Strassel describe corpus development for the ACE program, focusing on annotation procedures and guidelines as well as quality assurance measures. In addition, they touch on particular annotation challenges including classifying generic entities, metonymic entity mentions (including the concept of GeoPolitical Entities) and identifying the temporal attributes of relations.

Available: Slides in PDF

Dictionary Creation

Mike Maxwell, LDC

Jul 25, 2002; 1:00-3:00pm  

What is a bilingual dictionary? Most of us have used bilingual dictionaries, so the answer seems obvious. But when it comes to defining the structure of a dictionary as a database on a computer, the obvious becomes non-obvious.

Maxwell talks about the structure of a bilingual lexicon, and in particular that of a lexical entry, from a computational and linguistic viewpoint. There are (at least) three levels at which one might define such structure. Proceeding from the most concrete to the most abstract, these are: the file format level (e.g. in terms of an XML structure); a model (using a modeling language such as UML); and an ontology of concepts. 

Available: Slides in PDF

Scripts/Programs for Large Data Sets

David Graff, LDC

Oct 10, 2002; 12:00-2:00pm   

In any sort of corpus-based language research, the efficiency and usefulness of the research will be limited by the consistency and usefulness of the corpus. Graff focuses on establishing consistency in terms of how language corpora are presented to researchers as input for their work: the directory structure, file structure, document structure, character encoding, the amount and nature of meta-data (information about the corpus content) and how this information is incorporated.

Virtually all text corpora are drawn from "found" data -- material that already exists in electronic form to serve some purpose other than corpus-based language research, such as: publication of books, periodicals or daily news; archival preservation of public, commercial or government transactions; online discussions on various topics among diverse interest groups; and so on. The problem is that each data source has its own unique set of needs and conventions that dictate the data formats used to store and transport its particular content -- as well as its own rate of failure in making sure the data satisfy its needs and conventions.

The task for LDC, working on behalf of corpus researchers, is to design and apply the tools needed to distill each source into a common, standardized form that will (1) maximize the usability of the data on any researcher's chosen computer system, (2) preserve as much information as possible from the source, and (3) discard as much interference and noise as possible -- and do all this with a minimum of manual effort. Graff discusses strategies and tools that have been developed and used at LDC over the years for this purpose.

(1) BITS and other Machine Translation Collection Projects
(2) Overview of Machine Translations
(3) BITS and other Machine Translation Collection Projects

(1) Xiaoyi Ma (2) Shudong Huang, (3) Xiaoyi Ma, Mark Y. Liberman, LDC

Oct 31, 2002; 1:00-3:00pm   

Parallel corpora are valuable resources for machine translation, multi-lingual text retrieval, language education and other applications, but, for various reasons, their availability is limited. The World Word Web is a potential source to mine parallel text, and researchers are exploring this resource for large collections of bitext.

Ma and Liberman present BITS (Bilingual Internet Text Search), a system which harvests multilingual texts over the World Wide Web with virtually no human intervention. The technique is simple, easy to port to any language pair, and is highly accurate.

Available: Paper in PDF

Mining the Bibliome: Information Extraction from Biomedical Text

Mark Liberman, LDC

Dec 19, 2002; 12:00-2:00pm 

The goal is qualitatively better methods for automatically extracting information from the biomedical literature, relying on recent progress and new research in three areas: high-accuracy parsing, shallow semantic analysis, and integration of large volumes of diverse data. Liberman describes two applications: drug development, in collaboration with researchers in the Knowledge Integration and Discovery Systems group at GlaxoSmithKline, and pediatric oncology, in collaboration with researchers in the eGenome group at Children's Hospital of Pennsylvania. These applications, worthwhile in their own right, provide excellent test beds for broader research efforts in natural language processing and data integration.