The LDC Institute is a seminar series on issues broadly related to linguistics, computer science, natural language processing and human language technology development. Featured speakers include researchers from LDC, the Penn community and distinguished scholars from around the globe.
LDC Institute Archive
Comparing Dialect and Accented Pronunciations on the Basis of Transcriptions and Articulography
Martijn Wieling, Department of Humanities Computing, University of Groningen
September 12, 2014; 3:00 to 4:30 pm
Dr. Wieling introduces the Pointwise-Mutual-Information-based distance which is able to determine sensitive pronunciation distances and sound segment distances on the basis of transcribed speech (Wieling et al., 2012). He then applies the PMI-based Levenshtein distance to transcriptions of accented English speech obtained from the Speech Accent Archive and validates the method by comparing the computational distances to perceptual judgments of hundreds of native American English speakers (Wieling et al., in press).
Dr. Wieling also illustrates how articulography (i.e. measuring the movement of tongue and lips during speech) is useful for comparing pronunciations. To demonstrate the method, he shows significant movement differences between native and non-native speakers of English, as well as Dutch dialect speakers.
Martijn Wieling, Jelke Bloem, Kaitlin Mignella, Mona Timmermeister and John Nerbonne (forthcoming). Automatically measuring the strength of foreign accents in English. Language Dynamics and Change.
Martijn Wieling, Eliza Margaretha and John Nerbonne (2012). Inducing a measure of phonetic similarity from pronunciation variation. Journal of Phonetics, 40(2), 307-314.
Available: Slides in PDF
Aikuma: A Mobile App for Collaborative Language Documentation
Steven Bird, University of Melbourne; LDC, University of Pennsylvania
July 2, 2014; 12:00-1:00 pm
Dr. Bird describes Aikuma, a mobile app that is designed to put the key language documentation tasks of recording, respeaking, and translating in the hands of a speech community. He discusses the motivation for this approach, describes the system, reports on its use in fieldwork and presents ongoing research and development activities.
Available: Slides in PDF
The Corpus of Interactional Data: A Large Multimodal Annotated Resource
Philippe Blache, CNRS; Aix-Marseille University
November 20, 2013; 12:00-1:30 pm
Professor Blache presents in this talk a methodology for creating annotated multimodal corpora. Such resources have to take into consideration all types of linguistic information, from phonetics to discourse and gestures. It is then necessary to propose a homogeneous framework making it possible to represent in a coherent way these different domains. He proposes for doing this the elaboration of an abstract annotation scheme, relying on typed feature structures. He shows how such scheme can be instantiated at a large scale with the example of a large multimodal corpus, the CID (Corpus of Interactional Data). After a presentation of the abstract scheme, he details the annotation process of the different domains and concludes in showing how such scheme can be interesting in the perspective of data reusability as well as interoperability of annotation tools.
Available: Slides in PDF
The Sociolinguistic Archive and Analysis Project: Data, Tools and Applications
Tyler Kendall, University of Oregon
May 18, 2012; 12:00-2:00pm
Dr. Kendall describes his work on the Sociolinguistic Archive and Analysis Project (SLAAP), a web-based data preservation and exploration initiative centered at North Carolina State University. SLAAP houses audio recordings and associated materials from over 2,500 sociolinguistic interviews. In addition to its basic, preservational, organizational and access related features, a centerpiece of the archive is a time-aligned, databased transcription model which allows for dynamic, corpus-like analysis of the transcript data as well as real-time phonetic processing of the audio from the transcripts. The presentation describes the background of the project and its architecture, provides a demonstration of several of the web-based features, and discusses some of the ways that the archive and tools are being used in current sociolinguistic research.
Available: Slides in PDF
Building a universal corpus of the world's languages
Steven Bird, University of Melbourne; LDC, University of Pennsylvania
Jul 20, 2011; 12:30-2:30pm
This talk reports ongoing work on developing a corpus that includes all of the world's languages, represented in a consistent structure that permits large-scale cross-linguistic processing (Abney & Bird 2010). The focal data types, bilingual texts and lexicons, relate each language to a reference language. The ability to train systems to translate into and out of a given language is proposed as the yardstick for determining when that language is adequately documented. Bird reports on recent efforts to incorporate datasets built by the language resources community, via the "Language Commons". He describes a new project that is recording and transcribing material from a large number of unwritten languages in Papua New Guinea.
Steven Abney and Steven Bird, The Human Language Project: Building a universal corpus of the World's languages, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (2010).
Available: Slides in PDF
Coding Conventions for Archival Sharing
Malcah Yaeger-Dror, University of Arizona; LDC consultant
Jun 29, 2011; 12:30-2:30pm
Dr. Yaeger-Dror discusses her recent work at LDC formulating coding conventions for speech archives in three areas: (1) coding for the social situation; (2) demographic coding that could be used as relevant research in future studies; and (3) the influence of interpersonal attitudes on speech variation. Most of the talk centers on this last point, focusing on the speakers' attitudes toward their interlocutors and how one might be able to go about determining this information without recourse to Gilesian psychological studies.
Available: Slides in PDF
Free recall of word lists; empirical and theoretical issues
Michael J. Kahana, University of Pennsylvania
Jun 15, 2011; 1:30-3:00pm
Professor Kahana discusses the major empirical phenomena concerning recall of common words and the theoretical issues raised by these phenomena. He shows how memory researchers have devised theories to explain these data and presents some critical tests of those theories.
Contact, Restructuring, and Decreolization: The Case of Tunisian Arabic
Thomas A. Leddy-Cecere, Dartmouth College
Jan 14, 2011; 12pm-2pm
The modern Arabic colloquial dialects stand out in the world of dialectology and historical linguistics. Though all languages display dialectal variation, Arabic represents a special case -- attempts to classify and trace its varieties by standard linguistic techniques do not produce satisfactory results. It has been suggested (Versteegh, 1984) that this failure could be explained by positing that modern Arabic is a product of creolization. Leddy-Cecers's study represents a focused, cross-dialectal examination of a single Arabic dialect area (Tunisia) in search of evidence of creolization and subsequent effects.
Language Technology Resources for Sanskrit and other Indian Languages at Jawaharlal Nehru University, India
Girish Nath Jha, Jawaharlal Nehru University; University of Massachusetts Dartmouth
Jul 22, 2010; 12pm-2pm
The introductory section of the talk presents the complex linguistic scenario of India and the role Sanskrit has to play in it. The next section discusses the language technology resources being developed at Jawaharlal Nehru University, New Delhi, the premier university of India for Sanskrit and other major Indian languages under Technology Development for Indian Languages -- an initiative of the Indian government’s Department of Information Technology. The concluding section highlights the issues and challenges being faced and scope for future collaboration.
Available: Slides in PDF
Bibliotheca Alexandrina: The oldest library in the digital age
Ibrahim Shihata Arabic UNL Center at Bibliotheca Alexandrina
Magdy Nagi and Ahmed Bhargout, Bibliotheca Alexandrina
Apr 28, 2010; 3pm-5pm
"Bibliotheca Alexandrina: The oldest library in the digital age"
Since its inauguration in 2002, Bibliotheca Alexandrina (BA) has devoted itself to be the center of excellence and the right platform to address the ever advancing technology of our time. This presentation sheds light on some of the initiatives of the BA, carried out by the International School of Information Science (ISIS), to document and preserve heritage through digital archives and to offer the right platform for advancing scientific research in addition to its digital initiatives to boost children's skills and to catalyze reform in the Arab world.
Available: Slides in PDF
"Ibrahim Shihata Arabic UNL Center at Bibliotheca Alexandrina"
The Universal Networking Language (UNL) project is regarded as one of the most promising scientific research initiatives at the BA carried out by the ISIS. This presentation gives an overview on the history of the UNDL foundation which devises the program and its partnership with the BA. The presentation also gives a comprehensive overview of the UNL specifications, the software and system development at the BA including the UNL +3 engines and Knowledge Extraction System, and the tools deployed for the center. Moreover, the Book Catalogue is demonstrated as one of the applications on top of the UNL.
U.S. Supreme Court Corpus (SCOTUS)
Daniel Katz, J.D., Michigan Law School and Michael Bommarito, University of Michigan
Jan 26, 2010; 10am-12pm
The corpus of Supreme Court written opinions is a rich linguistic resource. Not only does this corpus provide a longitudinal sample of formal American English, but it is also a source of text with identified authors and vote-coded sentiment. Despite this value and years of qualitative and quantitative material of the United States Supreme Court, no compiled corpus of these opinions is currently available to researchers. The purpose of this talk is (1) to describe efforts to compile both the complete corpus of Supreme Court Opinions and associated metadata, (2) to outline a number of current research projects utilizing this data, and (3) to discuss any criticism, potential projects, or possible collaboration.
Available: Slides in PDF
Variations Across Languages, Divisions Within Communities: Languages, Schools and the Internet in Tunisia
Simon Hawkins, Franklin and Marshall College
Nov 12, 2009; 12-2pm
Despite the convenient categorization of languages, particularly English, into national varieties, some specific discourses cross national and linguistic borders. For some of these varieties, such as academic writing and Internet conventions, what is linguistically important is not the language used, but the global discourses and language ideologies of which they are a part. Multilingual practices in Tunisia illustrate these examples.
The LDC Standard Arabic Morphological Tagger
Rushin Shah, LDC
Jun 10, 2009; 12-2pm
The current process of Arabic corpus annotation at LDC relies on using the Standard Arabic Morphological Analyzer (SAMA) to generate various morphology and lemma choices, and supplying these to manual annotators who then pick the correct choice. However, a major constraint of this process is that SAMA can generate dozens of choices for each word, each of which must be examined by the annotator. Moreover, SAMA does not provide any information about the likelihood of a particular choice being correct. A system that ranks these choices in order of their probabilities and manages to assign the highest or second-highest rank to the correct choice with a high degree of accuracy would hence be very useful in accelerating the rate of annotation of corpora. Such a system would also be able to aid intermediate Arabic language learners by creating annotated versions of news articles or other web pages submitted by them.
Shah describes such a model that simultaneously performs morphological analysis and lemmatization for Arabic, using choices supplied by SAMA. Morphological labels are converted into vectors of various morphosyntactic features, such as basic part-of-speech, gender, number, mood, prefixes, suffixes, case, etc. These various attributes of the supplied Arabic data are then used to create models for lemmas and MSFs. Individual models are combined into one aggregate model that simultaneously predicts lemmas and complete morphological analyses. This model achieves accuracy in the high nineties.
Available: Slides in PDF
Building an ASL Corpus Project
Gaurav Mathur, Gene Mirus, and Paul Didus, Gallaudet University
Jan 15, 2009; 3-5pm
There has been a need for the present American Deaf communities to preserve a sample of their language so that they can appreciate the richness of their language for generations. There is also a practical need for materials that can be used in a wide variety of settings, ranging from American Sign Language (ASL) instruction for deaf children to training people who work with deaf communities. This talk describes a long-term project that meets those needs, namely, the establishment of an ASL corpus by collecting a comprehensive and representative sample of ASL from around the nation. The talk opens with a description of other sign langauge corpora outside the United States that have been successful and then offers an outline of the ASL corpus project that is currently underway, including the kinds of data to be be collected and the methodology to be used for the data collection.
Development of Resources and Techniques for Processing of Some Indian Languages
Shyam S. Agrawal, KIIT College of Engineering
Jul 17, 2008; 11:30-1:30pm
In the past few decades there has been a pressing demand and need to develop speech and language corpora for training, testing and benchmarking of speech technology systems for various applications. Richly annotated corpora and labeled databases are needed to develop models of spoken languages and also to understand the structure of speech and the variability that occurs in peech signals.
This talk presents some of the phonetic differences in Hindi compared to English and presents an overview of the efforts made by CDAC, Noida and some other institutions to develop text and speech databases, tools and some techniques for the processing of some Indian languages. For the collection of speech databases, issues and procedures related to text selection, e.g. multiform units and variables such as demographic, dialectal, environmental, emotional, linguistic background, etc. have been included. Special tools developed for the analysis of text are described. The objective has been that tools should be adaptive to other Indian languages and also to other languages. Details of some of the application-oriented task specific databases such as the ELDA-sponsored database for Hindi and the CFSL Speaker Identification database for forensic applications will be described in detail.
Available: Slides in PDF
HTML Templates for LDC Sponsored Projects
Shawn Medero, LDC
Jul 19, 2007; 2:30-4:30pm
This is an introduction to existing resources for communicating LDC projects over the web. The templates presented provide a professional, consistent and attractive design while allowing creativity and variation in each project's web site approach. They help define a predictable and organized navigation structure for LDC employees, project sponsors, research and general visitors. Questions concerning use of web standards, such as CSS and HTML, are encouraged throughout the presentation.
Speaking Arabic in Iraq and the Middle East: Reflections on Three Tours of Duty
Kenneth Gardner, USMC, ret.
Jul 13, 2007; 2:30-4:30pm
Kenneth Gardner served in the U.S. military for 22 years. He learned Arabic at the Defense Language Institute Foreign Language Center in 1995. Gardner shares his experiences as a non-native Arabic speaker communicating with native speakers in a variety of settings as well as the issues faced by monolingual U.S. troops in the field.
Programming Specifications: Procedures and Practices
Andrew Cole, LDC
Jun 20, 2007; 3-5pm
Virtually all tasks at LDC depend on programmer input. LDC policy requires that most requests for programming assistance be accompanied by a specification that describes the desired output and includes some estimate of the programming time needed to complete the tasks in the specification. Cole outlines a simple set of guidelines for LDC staff to follow when making programming requests and illustrates how those guidelines work by using them to develop a business systems specification.
Comparing Linguistic Annotations -- Issues in Harmonization and Quality Control
Christopher R. Walker, LDC
Oct 26, 2006; 3-5pm
Consistency analysis was an important aspect of quality assessment in 2005 ACE data creation. Within this framework, Walker became quite interested in the various assumptions and applications of annotation scoring infrastructure. As he attempted to better understand the bounds of the problem and solution spaces, it quickly became clear that there was no existing discussion of these issues -- and very little documentation of general best practices.
In this talk Walker seeks to reduce this gap by outlining the apparent problem and solution spaces and by opening for discussion the utility of annotation scoring metrics in the various domains of empirical, computational and corpus linguistics -- and more cogently in the domain of quality control for linguistic data creation.
Recording and Annotation of Speech Data via the WWW - A Case Study
Dr. Christoph Draxler, Ludwig-Maximilians University
Sep 22, 2006; 10:30-11:30am
The German Ph@ttSessionz project will create a database of 1000 adolescent speakers balanced by gender and covering all major German dialect areas. The project employs a novel approach to collecting speech data: all recordings are performed via the WWW -- using a web application in a standard web browser -- in more than thirty-five German public schools. Speech is recorded using a standardized audio setup on the school's PC, and the signal and administrative data are immediately transferred to the BAS server in Munich. Using this approach, geographically distributed recordings in high bandwidth quality can be made efficiently and reliably.
Draxler describes the Ph@ttSessionz web application and its major components, SpeechRecorder and WebTranscribe, and outlines the infrastructure developed at BAS for WWW-based speech recordings. He also discusses the strategies employed to enlist schools in the project and presents preliminary analyses of the Ph@ttSessionz speech database.
David Graff, LDC
Jul 27, 2006; 3-5pm
Graff presents an overview of LDC Online's corpora coverage, search methodology and future plans for growth.
Pros and Cons of Different Annotation Workflow Systems
Seth Kulick, IRCS; Julie Medero, Hubert Jin, David Graff and Kevin Walker, LDC
May 4, 2006; 3-5pm
This LDC Institute is part of an effort to determine the desirable properties for a single workflow system that can be used (or extended as appropriate) in the various annotation projects at LDC and IRCS (Penn's Institute for Research and Cognitive Science). Because the several different workflow systems that are currently used were designed for projects with different needs, they handle many issues differently, among them, support for local and remote annotation, sophistication of report capability, use of automated tagging, and flexibility in the specification of workflow stages.
The speakers are LDC/IRCS programmers who designed some of the current systems. They discuss the following topics: (1) the properties of the different systems; (2) why some characteristics of a particular workflow system might make it unsuitable for a particular project; (3) properties that should be added to the workflow systems; and (4) alternative ways of setting up a workflow system.
Recent Trends in Annotation Tool Development at LDC
Kazuaki Maeda, Julie Medero and Haejoong Lee, LDC
Apr 20, 2006; 3-5pm
LDC has created large volumes of annotated linguistic data for a variety of evaluation programs and projects using highly customized annotation tools developed on site. LDC programmers Maeda, Medero and Lee discuss the history of annotation tool development at LDC and share some current approaches. Two tools in particular are highlighted: (1) LDC's model for decision point-based annotation and adjudication, which was used effectively in the ACE 2005 annotation effort; and (2) XTrans,a new speech transcription and annotation tool particularly suited for transcribing meeting speech that was used by LDC in the NIST meeting recognition evaluation and Mixer Spanish and Russian telephone conversation studies.
Building a Lexicon Database for Arabic Dialects
David Graff, LDC
Dec 8, 2005; 3:00-5:00pm
One of the major problems in creating a lexicon database for Arabic dialects is the fact that standardized orthographic (spelling) conventions do not generally exist. The word forms generated by transcribers from recorded conversations are based on relatively loose conventions and show significant variability within any given dialect. Graff describes how that problem is being resolved by creating a relational database design that makes the transcripts a key part of the database so that repairs to word forms in the lexicon table are propagated automatically to the transcripts. He also reviews some earlier approaches to lexicon building, describes the annotation tools developed specifically for the current lexicon project, and briefly considers some possible extensions to the database structure and annotation methods LDC currently uses to cover tasks such as treebank annotation.
Available: Slides in PDF
Less Commonly Taught Languages (LCTLs)
Mark Mandel, LDC
Nov 17, 2005; 3:00-5:00pm
One of LDC’s principal tasks in the Less Commonly Taught Languages (LCTLs) portion of the REFLEX (Research on English and Foreign Language Processing) Project is to discover, produce, and maintain language resources for the target languages. Those resources include, among other things, linguistic information, writing systems, converters, word segmenters, electronic lexicons, monolingual texts, bilingual texts, morphological parsers, and tools for producing and using annotated resources. Mandel describes the challenges associated with assembling those language resources, as well as the current progress of LDC’s work focusing on Thai, Urdu, Bengali, Panjabi, Hungarian, Tamil and Yoruba.
The Teaching of Berber in Morocco: Reality and Perspectives
Fatima Agnaou, IRCAM
Jul 7, 2005; 2:00-4:00pm
Agnaou discusses IRCAM (The Royal Institute of Amazigh Culture) and its realization in regard to the integration of Berber in Moroccan schools. She addresses the aims and objectives of teaching in Berber and the standardization of the Berber language. A presentation of textbooks and other teaching materials is also included along with teacher training and methodology.
Otakar Smrz, Charles University
Feb 3, 2005; 12:30-2:30pm
Computational morphological models are usually implemented as finite-state transducers. The morphologies of natural languages are, however, better described in terms of inflectional paradigms, lexicons, and categories (inflectional and inherent parameters). Markus Forsberg and Aarne Ranta have recently introduced a framework called Functional Morphology (FM) smoothly reconciling both of these viewpoints. Linguists can model their systems without any 'finite-state restrictions', using the full power of the functional language Haskell and delegating actual computational issues to FM. The morphological models become clearer, reusable, (ex)portable, and even more efficient. Smrz highlights the noticeable features of FM/Haskell and outlines plans to use it for Arabic. He also refers to other languages, including Latin, Swedish, Sanskrit, Spanish and Russian.
Mona Diab, Stanford University
May 13, 2004; 12:30-2:30pm
The holy grail of computational linguistics, from the time of the field's inception, has been (automatic) natural language understanding. Semantic parsing appears as a significant stride in that direction giving researchers a glimpse into the world of concepts in a functional and operational manner. Thanks to the English PropBank, a jumpstart for the task of semantic role assignment is noted in several community wide standardised evaluations such as CoNLL and Senseval3. In fact, the porting of PropBank style annotation to ten Chinese verbs has achieved very interesting results in a relatively short period of time (Sun & Jurafsky, 2004), but more importantly, has shown some relevant quantifiable linguistic variation between English and Chinese.
Diab focuses on three of the verbs which are fully annotated discussing the different frames and roles associated with their arguments and adjuncts.The main issue that arises frequently is that of consistency and variation especially with respect to assigning ARGM and ARG3 roles to constituents. She advocates a more generalised consistent annotation across verbs (sometimes it shifts even within a single verb [potentially sense variation]) as well as within a verb. Diab discusses some of the rudimentary guidelines she has set for herself inspired by the annotation guidelines from the English and Chinese PropBanks.
Colonel Stephen A. LaRocca, Center for Technology Enhanced Language Learning
Mar 23, 2004; 1:30-2:30pm
The Center for Technology Enhanced Language Learning (CTELL), organized within the Department of Foreign Languages at the U.S. Military Academy, collects speech data and turns it into speech recognition software primarily for the benefit of cadets learning languages at West Point. Since 1997 CTELL has collected broadband speech corpora for Arabic, Russian, Portuguese, American English, Spanish, Croatian, Korean, German and French.
Tongue-Tied in Singapore: A Language Policy for Tamil?
Harold F. Schiffman, University of Pennsylvania, Department of South Asia Studies
Feb 26, 2004; 12:30-2:30pm
The Tamil situation in Singapore is one that lends itself ideally to the study of minority language maintenance. The Tamil community is small and its history and demographics are well known. The Singapore educational system supports a well-developed and comprehensive bilingual education program for its three major linguistic communities on an egalitarian basis, so Tamil is a sort of test-case for how well a small language community can survive in a multilingual society where larger groups are doing well. But Tamil is acknowledged by many to be facing a number of crises; Tamil as a home language is not being maintained by the better-educated, and Indian education in Singapore is also not living up to the expectations many people have for it. Educated people who love Tamil are upset that Tamil is becoming thought of as a 'coolie' language and regret this very much. Since Tamil is a language characterized by extreme diglossia, there is the additional pedagogical problem of trying to maintain a language with two variants, but with a strong cultural bias on the part of the educational establishment for maintaining the literary dialect to the detriment of the spoken one.
Schiffman examines these attempts to maintain a highly-diglossic language in emigration and concludes that the well-meaning bilingual educational system actually produces a situation of subtractive bilingualism.
The Contextualization of Linguistic Forms across Timescales
Stanton Wortham, University of Pennsylvania, Graduate School of Education
Feb 10, 2004; 1:00-3:00pm
When people speak, they socially identify themselves and others. The use of linguistic forms is one important means through which social identities get established. But the implications of any utterance for social identity depend on relevant context. Decontextualized sociolinguistic regularities cannot fully explain how a given utterance establishes a given social identity, although such regularities certainly play an important role. The centrality of context seems to imply, methodologically, that a full linguistic analysis of social identification must rely on case studies of particular utterances in context. Social identification, however, is not a phenomenon of isolated cases. An individual gets socially identified across a series of interrelated events, not a series of unique, unrelated contexts.
Wortham describes an empirical research project which traces the identity development of a ninth grade student across the academic year in one classroom. The student’s identity develops in substantial part through speech events that position her in socially recognizable ways. Wortham presents a methodological strategy for analyzing the interrelated events across which this individual gets socially identified, focusing on specific kinds of speech events that play a central role in the emergence of her social identity across time. This project provides an opportunity to reflect on the kinds of data necessary for studying social identification.
Interfaces for Parser and Dictionary Access
Malcolm D. Hyman, Harvard University
Jan 26, 2004; 1:30-3:30pm
The subject of this presentation is "linguistic middleware" — software designed to mediate between backend linguistic tools and data sources (for instance, tokenizers, morphological analyzers, and parsers) and frontend user agents (browsers and editors). Making linguistic data available within graphical user agents will allow for rich, next-generation working environments that can offer substantial benefits to language researchers and students. Simple but powerful interfaces will allow for interoperability between diverse technologies, including legacy systems. Current web-services and XML standards provide the basis for the development of such interfaces. The goal is distributed networks that connect arbitrary tools, databases, reference works, and corpora; ultimately, this architecture will help to break down barriers between scholarly communities and to enrich the work of linguists, philologists, technologists, historians, and literary scholars.
In order to realize the vision of generalized linguistic middleware, we need to address a range of challenges encountered in typologically diverse languages and writing systems. Hyman focuses on:
- multiple approaches to tokenization required by different writing systems
- orthographic normalization
- handling different "window sizes" needed for context-sensitive analysis
- strategies for identifying lexical items that are realized discontinuously
- metalanguages for morphosemantic and syntactic category labels
The discussion is accompanied by demonstrations of some prototype implementations and solutions.
Finite State Morphology using Xerox Software
Kenneth Beesley, XRCE
Dec 9, 2003; 12:00-2:00pm
Morphological analysis (word analysis) and generation are foundation technologies used in many kinds of natural-language processing, including dictionary lookup, language teaching, part-of-speech disambiguation (tagging), syntactic parsing, etc. Successful and publicly available software implementations based on "finite-state" theory include Koskenniemi's Two-Level Morphology, AT&T Lextools, Groningen University's Fsm Utils, and Xerox's lexc and xfst languages. The Xerox tools are now available on CD-ROM in a book entitled "Finite State Morphology", Beesley & Karttunen, 2003, CSLI Publications. Beesley briefly and gently covers the history and underlying theory of finite-state morphology, and introduces lexc and xfst syntax.
Finite-state morphological analyzers are excellent projects for a Master's Thesis, and for field linguists they are a practical way to encode, computerize and test morphotactic grammars, alternation rules and lexicons that would otherwise remain inert on paper. Finite-state morphology has been successfully applied to languages around the world, including the obviously "commercial" European languages, Finnish, Hungarian, Basque, Turkish, Korean, Arabic, Syriac, Hebrew, several Bantu languages, a variety of American Indian languages, etc.
Searching through Prague Dependency Treebank
Jiri Mirovsky and Roman Ondruska, Charles University
Oct 15, 2003; 1:00-3:00pm
Netgraph is a search tool for annotated treebanks. Originally developed for the Prague Dependency Treebank,
Netgraph is a multiuser system with a net architecture. This means that more than one user can access it at the same time and its components may be located in different nodes of the Internet.
Mirovsky and Ondruska show NetGraph in use. Although work with NetGraph is easy, the client-server architecture requires actions from the user which a stand-alone application does not. They describe four parts of working with NetGraph -- connecting to the server, selecting files, creating a query, and viewing the results. NetGraph provides a simple-to-use and powerful query language. The basic query is a tree structure with a few evaluated attributes. Searching a corpus given a query means searching for all trees which contain the query tree as a subtree. This basic functionality is improved by so-called meta attributes -- an easy way to add more restrictions to found trees, e.g. size, orientation, position of query tree, forbidden nodes, etc. Mirovsky and Ondruska show several examples of queries, from the simplest to more complex.
The Pennsylvania Sumerian Dictionary Project
Stephen Tinney, University of Pennsylvania Museum of Archaeology & Anthropology
Nov 12, 2003; 1:30-3:30pm
The Pennsylvania Sumerian Dictionary project is an online dictionary which combines lexicon and text-corpora within an interface with multiple entry points to the Dictionary, multiple views of the lexicon or individual items and reverse navigation back from the text-corpora. The framework within which this system is implemented is a generic XML data structure for corpus-based dictionaries.
The data structure binds control lists with the references drawn from the text-corpora. These references are tagged with morphological and semantic information to enable programmatic generation of lexical articles containing exhaustive information on orthography, chronological and geographical distribution and usage. Tinney describes the particular problems associated with writing a dictionary of Sumerian and the corpus-based dictionary model and demonstrates the state of the current implementation.
Arabic Language: Issues and Perspectives
Mohamed Maamouri and Tim Buckwalter, LDC
Apr 10, 2003; 12:00-2:00pm
The presentation starts from the early standardization of Arabic and leads to the emergence of 'diglossia' and its linguistic and sociolinguistic consequences. The dominant attitudes toward linguistic reforms is also presented. The second focal point is the Arabic reading process, its challenges and its consequences for reading performance and for education in general.
In a connected second part of the same presentation, Buckwalter focuses on Arabic NLP Issues. He presents his morphological analyzer and lexicon. A brief overview of the LDC Arabic Treebank project follows.
David Miller, LDC
Mar 20, 2003; 12:00-2:00pm
Miller discusses the various speech collection projects undertaken at LDC from 1995-2003. Included is a discussion of telephone speech data collection projects and on-site speech collection projects. Data collection projects covered include CallHome, CallFriend, Switchboard, ROAR and FISHER.
Data and Annotations for Sociolinguistics (DASL): Using digital data to address issues in sociolinguistic theory
Stephanie Strassel, LDC
Jan 16, 2003; 12:00-2:00pm
A longstanding focus of sociolinguistic research has been the quantitative analysis of language variation and change, an endeavor that necessarily begins with the empirical observation and statistical description of linguistic behavior. Current technology encourages the collection and analysis of such data, and even the presentation and publication of research findings, wholly within the digital domain. Within the field of human language technology, such benchmark data has proven to be an essential ingredient for progress, as it reduces the cost of analytical infrastructure within research communities, frees researchers to focus on their interests, encourages collaboration and reduces impediments to new participants. However, most empirical data within sociolinguistics continues to be collected and analyzed by individuals or individual research groups, and is never made available to the wider research community. This proprietary approach to data hampers collaboration, the replication of studies and the comparison of models, methods and results, all necessary components of rigorous science. The prospect of digital data sharing in sociolinguistics also raises theoretical and methodological questions: Can sociolinguists make effective use of existing corpora? Do the insights gained from corpus data differ qualitatively from data most commonly used in quantitative sociolinguistics, namely recordings of sociolinguistic interviews? What are the best practices for the creation of new digital resources for sociolinguistic research?
The project on Data and Annotations for Sociolinguistics (DASL), at LDC with support from NSF via the Talkbank project, begins to address these issues. DASL investigates the use of digital data in sociolinguistics through a series of case studies involving both analysis of variation in existing corpora and the creation of new data sets. Strassel introduces DASL's goals, assumptions, data and tools and reviews annotation and corpus creation efforts and results to date.
Towards a Comprehensive, Empirical Analysis of Linguistic Data: the case of Regional Italian vowel systems
Christopher Cieri, LDC
Jan 16, 2003; 12:00-2:00pm
Any empirical study of language relies necessarily upon a body of observations of linguistic behavior even if the study fails to formally acknowledge its corpus. The decisions one makes in approaching data affect research profoundly by opening some avenues of inquiry while blocking others. Looking across research communities as diverse as sociolinguistics and speech technologies, one finds methods that may be integrated in order to both broaden research possibilities and to perform research more efficiently.
Cieri explores the relationship among data, tools, annotation (or coding) processes and the research they support, focusing specifically on the quantitative analysis of linguistic variation. The data come from a series of sociolinguistic interviews undertaken to investigate the modeling of variation in the regional speech of central Italy.
After describing the motivation for this study, Cieri demonstrates a series of tools, processes and data formats that permit a comprehensive yet rapid analysis of vowel systems. Specifically, he demonstrates tools for transcription and segmentation, lexicons and search tools that automatically select and categorize tokens of interest from the transcripts, batch processes that perform acoustic analyses of the selected tokens and an interface for managing and adding human judgments to these analyses. In the process he offers a particular perspective on tool development, favoring information retention and annotator efficiency over computational efficiency and portability.
New Methods for Constructing Annotated Speech Corpora
Steven Bird, LDC
Jun 14, 2002; 12:00-2:00pm
Over the past decade of creating and managing speech corpora, LDC staff have developed literally hundreds of utilities, user interfaces and file formats. These databases are becoming increasingly complex in their structure, with rich standoff annotations organized across multiple layers. At the same time, the range of contributing specialties has become more diverse, as illustrated by LDC's publication plans in such areas as field linguistics, sociolinguistics, gesture, and animal communication.
Bird outlines the traditional corpus production process and catalogs the problems LDC has experienced. This provides the backdrop for LDC's R&D effort over the last four years, which has created new software infrastructure and a suite of annotation tools. He introduces the principles and key concepts of the annotation graph toolkit (AGTK), describes the current tools, and gives a brief overview of the tool development process. Finally, Bird introduces OLAC, the Open Language Archives Community, and demonstrate hows it is being used for describing and discovering language resources of the kind created at LDC. The talk is followed by an informal demonstration session.
Available: Slides in PDF
Corpus Development for the ACE (Automatic Content Extraction) Program
Alexis Mitchell and Stephanie Strassel, LDC
Jun 26, 2002; 1:00-3:00pm
The objective of the ACE Program is to develop core automatic content extraction technology to enable text processing through the the detection and characterization of entities, relations, and events. As part of the DARPA TIDES program, ACE supports technology research and development for various classification, filtering, and selection applications by extracting and representing language content (i.e., the meaning conveyed by the data). The ultimate goal of ACE is the development of technologies that will automatically detect and characterize this meaning.
For the past three years, LDC has been developing annotated corpora for the ACE program. Data for ACE consists of newspaper, newswire and broadcast news transcripts. To support Entity Detection and Characterization, ACE annotators label selected types of entities (Persons, Organizations, etc.) mentioned in text data. The textual references to these entities are then characterized and multiple entity mentions are co-referenced. The Relation Detection and Characterization task requires annotators to identify and characterize relations between the labeled set of entities. LDC's role in ACE has recently expanded to encompass annotation of all data for the ACE program as well as development and maintenance of annotation guidelines and annotation tools.
Mitchell and Strassel describe corpus development for the ACE program, focusing on annotation procedures and guidelines as well as quality assurance measures. In addition, they touch on particular annotation challenges including classifying generic entities, metonymic entity mentions (including the concept of GeoPolitical Entities) and identifying the temporal attributes of relations.
Available: Slides in PDF
Mike Maxwell, LDC
Jul 25, 2002; 1:00-3:00pm
What is a bilingual dictionary? Most of us have used bilingual dictionaries, so the answer seems obvious. But when it comes to defining the structure of a dictionary as a database on a computer, the obvious becomes non-obvious.
Maxwell talks about the structure of a bilingual lexicon, and in particular that of a lexical entry, from a computational and linguistic viewpoint. There are (at least) three levels at which one might define such structure. Proceeding from the most concrete to the most abstract, these are: the file format level (e.g. in terms of an XML structure); a model (using a modeling language such as UML); and an ontology of concepts.
Available: Slides in PDF
Scripts/Programs for Large Data Sets
David Graff, LDC
Oct 10, 2002; 12:00-2:00pm
In any sort of corpus-based language research, the efficiency and usefulness of the research will be limited by the consistency and usefulness of the corpus. Graff focuses on establishing consistency in terms of how language corpora are presented to researchers as input for their work: the directory structure, file structure, document structure, character encoding, the amount and nature of meta-data (information about the corpus content) and how this information is incorporated.
Virtually all text corpora are drawn from "found" data -- material that already exists in electronic form to serve some purpose other than corpus-based language research, such as: publication of books, periodicals or daily news; archival preservation of public, commercial or government transactions; online discussions on various topics among diverse interest groups; and so on. The problem is that each data source has its own unique set of needs and conventions that dictate the data formats used to store and transport its particular content -- as well as its own rate of failure in making sure the data satisfy its needs and conventions.
The task for LDC, working on behalf of corpus researchers, is to design and apply the tools needed to distill each source into a common, standardized form that will (1) maximize the usability of the data on any researcher's chosen computer system, (2) preserve as much information as possible from the source, and (3) discard as much interference and noise as possible -- and do all this with a minimum of manual effort. Graff discusses strategies and tools that have been developed and used at LDC over the years for this purpose.
(1) BITS and other Machine Translation Collection Projects
(2) Overview of Machine Translations
(3) BITS and other Machine Translation Collection Projects
(1) Xiaoyi Ma (2) Shudong Huang, (3) Xiaoyi Ma, Mark Y. Liberman, LDC
Oct 31, 2002; 1:00-3:00pm
Parallel corpora are valuable resources for machine translation, multi-lingual text retrieval, language education and other applications, but, for various reasons, their availability is limited. The World Word Web is a potential source to mine parallel text, and researchers are exploring this resource for large collections of bitext.
Ma and Liberman present BITS (Bilingual Internet Text Search), a system which harvests multilingual texts over the World Wide Web with virtually no human intervention. The technique is simple, easy to port to any language pair, and is highly accurate.
Available: Paper in PDF
Mining the Bibliome: Information Extraction from Biomedical Text
Mark Liberman, LDC
Dec 19, 2002; 12:00-2:00pm
The goal is qualitatively better methods for automatically extracting information from the biomedical literature, relying on recent progress and new research in three areas: high-accuracy parsing, shallow semantic analysis, and integration of large volumes of diverse data. Liberman describes two applications: drug development, in collaboration with researchers in the Knowledge Integration and Discovery Systems group at GlaxoSmithKline, and pediatric oncology, in collaboration with researchers in the eGenome group at Children's Hospital of Pennsylvania. These applications, worthwhile in their own right, provide excellent test beds for broader research efforts in natural language processing and data integration.