Linguistic Exploration


Linguistic Exploration is a mode of investigation in (computational) linguistics involving empirical research on complex, dynamic, multimodal datasets through the combination of traditional field methods with new technologies for storing and analyzing linguistic data. The languages under study may range from the undescribed to the well-studied, and the investigator may operate in a village or a laboratory. The focus is the documentary and exploratory mode of research, generating reusable language resources, and developing new techniques for working with continually evolving datasets.

Activities

Resources

This page describes resources for language documentation and linguistic exploration. Recently added links and updated descriptions are marked with a *. Please send updates to Steven Bird (sb@ldc.upenn.edu)


Index
Apache Arizona BNC-Online CBOLD CDEL CHILDES CRG East Cree
ELF EGA EMILLE Encyclopedias* Ethnologue Expedition FEL Fieldwork
Field Software HRAF HyperLex Gamilaraay ICHEL IDS Ingush IPA
ISO 639-2 Jiwarli Kamusi Kura LACITO LangWorld LDC-Online LINCOM
Linguasphere LSA Maasai Maliseet Mambilex Nahuatl NATO NICE
NSEP NSF* Numic Oykangand PacLing PDLMA Pirahã Rosetta*
TalkBank TELL SALTMIL SIL SOWL STEDT StressTyp TDS
Terralingua TIDES UNESCO VW WALS Warlpiri Wiyuta Yinka Dene
YourDictionary

Key
C: an online corpus of textual data
S: an online corpus including speech recordings
W: an online corpus with a Web interface
L: there is a lexicon
T: an available tool for creation, display or search
D: a tool is downloadable
P: there is a citeable paper which documents the system
R: any other kind of resource such as a web page, a professional association, a new project, a classification system, etc


Note: the title of each section gives the primary hyperlink

Linguistic Resources
CW Apache: Chiricahua and Mescalero Apache Texts (Eleanor Culley)
This project, based the Electronic Text Center of the University of Virginia, is creating an electronic version of Harry Hoijer's Chricahua and Mescalero Apache Texts (1938), containing stories, songs, translation, analysis and a grammar. The presentation makes extensive use of HTML frames and an interesting technique for aligning the parallel texts.
  Arizona Native American Online Dictionary Project (Mike Hammond)
This project is developing online dictionaries of Native American languages spoken in southwest states of the US. At present there is a Tohono O'odham - English dictionary with a search interface.
CWT British National Corpus - Online (natcorp@oucs.ox.ac.uk)
The British National Corpus (BNC) is a 100 million word corpus of British English, both spoken and written. The BNC Online service allows you to search the corpus and download citations. BNC Online requires users to install a special-purpose browser and concordancer generator, which is only available under MS Windows. Users must agree to licensing conditions, and pay a subscription fee (the first 20 days are free).
CLWDT CBOLD: Comparative Bantu Online Dictionary (Larry Hyman, Ron Sprouse, Jeff Good)
CBOLD is a collection of lexical databases covering over 260 Bantu languages. The data are presently stored in FileMaker, FoxBase Pro, MS Word, and Text formats. Most of the databases can be downloaded from a file server and some of the dictionaries and word lists can be searched online here. Several research and computational tools have been developed, including Bantu MapMaker, a macintosh hypercard program, and Perl scripts supporting web-search of the dictionaries. There is also an interface to the Tanzanian Language Survey. [NSF award]
CSL CDEL: Center for the Documentation of Endangered Languages (Douglas Parks, Wally Hooper)
CDEL is developing multimedia dictionaries and language lessons for several Caddoan and Siouan languages. Two dictionaries with web-accessible samples are the Skiri Pawnee Multimedia Dictionary and the Yanktonai Lexicon. The lexical databases use Visual FoxPro. There is also work on developing pedagogical materials using Macromedia Authorware and Director. For an example of the results, see the Arikara (Sáhnis) Language Program. An interlinear text processor is in development. [ archives:CDEL | NSF award | Wally Hooper's Chicago Talk ]
CLT CHILDES: The Child Language Data Exchange System (Brian Macwhinney)
The CHILDES project provides a large database of first and second language acquisition data from over 30 languages. Lexicons containing dynamic word-frequency data are being constructed for each of the major languages, in support of crosslinguistic studies of lexical development. [NSF award]
R CRG: Cross-linguistic Reference Grammar (Dietmar Zaefferer)
The Cross-Reference-Grammar project aims to develop a grammar format that allows the systematic and detailed description of any natural language. Descriptive grammars written in this format will be stored in a database, and this will be used for systematic and automatic comparison between languages. CRG is funded by DFG (Germany).
R East Cree Interactive Grammar (Marie-Odile Junker)
This online interactive grammar is under construction. There is a text (in Cree syllabics), with translation and audio.
R ELF: Endangered Language Fund (Doug Whalen)
The Endangered Language Fund provides small grants in support of the scientific study of endangered languages, and of native efforts in maintaining endangered languages. The ELF maintains a list of endangered language resources. Doug Whalen and David Harrison have written an encyclopedia entry on endangered languages.
  EGA: A Documentation Model for an Endangered Ivorian Language ( Dafydd Gibbon, Bruce Connell, Firmin Ahoua)
This is a collaborative project involving researchers at the universities of Bielefeld, Oxford and Cocody, funded by the pilot phase of the VW Foundation's project on Documentation of Endangered Languages. Ega is an endandered Kwa language spoken by less than 300 speakers in Ivory Coast. This project will `process existing documentation and extend it into a structured data collection based on selected text sorts, a typologically oriented questionnaire, high quality audio and video data, and structured multi-tier annotation procedures.'
  EMILLE: Enabling Minority Language Engineering (Tony McEnery, Robert Gaizauskas)
This project, based at the Linguistics Department at Lancaster and the CS Department at Sheffield will extend and apply the GATE architecture to non-indigenous minority languages in Britain. EMILLE will create text corpora of over nine million words each for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu, trying to match the range of genres covered by the British National Corpus. EMILLE will also gather half-million word spoken corpora for Bengali, Gujarati, Hindi, Panjabi and Urdu. EMILLE is the outgrowth of the earlier MILLE Project.
R Encyclopedias about the World's Languages
Jane Garry and Carl Rubino have edited the Encyclopedia of the World's Languages: Past and Present. This 1200 page volume covers 214 languages, providing information about location, classification, origin, orthography, phonology, morphology, syntax, example sentences, and (where appropriate) language preservation efforts. Example chapters are available on Ancient Greek, Modern Greek, and Bemba. [Amazon.com listing].
R Ethnologue: Languages of the World (Barbara Grimes)
The Ethnologue catalogs and describes 6,809 living and recently extinct languages. It includes language maps, country overviews, and large bibliography, plus it assigns a unique three-letter code to each language (cf ISO 639-2, Linguasphere). The Ethnologue comes as a pair of books (1600pp), and also as searchable CD-ROM and web versions. There is a page listing nearly extinct languages. [Constable & Simons: Language Identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale]
P Expedition ( Sergei Nirenburg, Rémi Zajac)
This project is devoted to developing a computational environment for enabling and speeding up quick ramp-up of machine translation systems for so-called `low density' languages. Expedition includes a knowledge acquisition module called Boas, which is meant to permit a speaker of one of about 60 designated languages (who is also bilingual in English), together with a programmer, to provide the system with knowledge about the morphological and syntactic properties of the language, and to create a bilingual dictionary between that language and English. A document describing Boas is available online here. A related TIDES project at NMSU is CREST: Crosslingual, Retrieval, Extraction, Summarization and Translation. [ Ron Zacharski's Chicago talk ]
R FEL: Foundation for Endangered Languages (Nicholas Ostler)
The Foundation for Endangered Languages exists to promote awareness and use of endangered languages. It publishes a newsletter and organizes an annual conference. The next conference is FEL IV: Endangered Languages and Literacy (September 2000).
R Field Software ( Evan Antworth and Randy Valentine)
This is an online appendix to a chapter called `Software for Doing Field Linguistics' in the book: Using Computers in Linguistics: A Practical Guide. This site contains information and pointers for electronic resources related to data management, speech analysis and phonetics, phonology and morphology, syntax and grammar Description, lexicon, text analysis, and language survey and comparison.
R Fieldwork: The Anthropologist in the Field (Laura Tamakoshi)
This graphics-rich site documents many aspects of anthropological field research, with a focus on Papua New Guinea.
C Gamilaraay Dictionary and Knowledge Base (Peter Austin)
The Gamilaraay dictionary is a collection of linked HTML documents. It contains glossed examples where each word is a hyperlink to the corresponding full entry. To search the dictionary it is necessary to load the whole database as a single page and then use the search facility provided by the Web browser. Another set of links into the dictionary is provided by a thesaurus.
R HRAF: Human Relations Area Files (Melvin Ember)
HRAF is a consortium of research institutions based at Yale which archives archaeological and ethnographic data and publishes it on the Internet and on CD-ROM. HRAF members can access their ethnography collection, which contains one million pages of information covering over 350 cultures worldwide, classified according to the Outline of Cultural Materials. Although the focus is cultural anthropology, the HRAF archive includes many dictionaries, grammars and text collections.
CSLTP Hyperlex (Steven Bird)
Steven Bird's Hyperlex system, developed in support of a field project, provides HTML-mediated access to a lexicon, speech recordings and paradigmatic catalogues for several languages. A short paper describing the system is available here. Steven plans to produce a portable version that can easily be adapted to new languages and new projects. HyperLex queries can be embedded inside online descriptive documents, as exemplified in the paper: Multidimensional exploration of online linguistic field data. [ NSF award | Steven Bird's Chicago talk ]
CLTW Ingush Grammar, Dictionary, and Texts (Johanna Nichols)
The Ingush project aims to produce a descriptive grammar, a bilingual dictionary, and a collection of texts for Ingush. Data are stored in a FileMaker database, and there is a Web interface. A system for collaborative interlinearization over the Web is under development. [ NSF award | Ron Sprouse's Chicago talk ]
R IPA: International Phonetic Alphabet - Sound Recordings (John Esling)
This site contains 93Mb of WAV files organized into folders according to language.
R ICHEL: The International Clearing House for Endangered Languages (Kazuto Matsumura)
This group, based in Tokyo, has organized workshops on endangered languages. Its activities largely ceased in early 1998. ICHEL continues to host part of the UNESCO Red Book of Endangered Languages.
R IDS: Intercontinental Dictionary Series (Bernard Comrie)
Bernard Comrie is editing the Intercontinental Dictionary Series. More information will be available soon. Currently, the only web presence of IDS is this very out-of-date page.
R ISO 639-2: Codes for the Representation of Names of Languages (iso639-2@loc.gov)
The International Standards Organization has a standard which assigns three-letter codes to languages and language families. It contains 464 codes, and languages which lack their own code are assigned the code of a language family; e.g. mhk = "other Mon Khmer languages" There are two versions of the standard, (ISO 639-2/B) for bibliographic applications, and (ISO 639-2/T) for terminology applications, and they differ on 5% of the codes. A previous version, ISO 639-1 assigned two-letter codes, and was superseded in 1989. (Note that Ethnologue three-letter codes cover 6,809 languages).
  Jiwarli - A Language of Western Australia (Peter Austin)
This is an excellent example of online description and analysis associated with fieldwork on Jiwarli. The site contains narrative texts with interlinear glosses and sound files, a phonological sketch with sound clips, a grammatical sketch, and a description of the kinship system.
R Kamusi: The Internet Living Swahili Dictionary (Martin Benjamin)
This web-based Swahili/English dictionary permits remote users to create and revise entries, as part of an "open experiment in cooperative scholarship".
T Kura: a database application for language description (Boudewijn Rempt)
Kura is a system for descriptive and analytic work (in a tradition known as Basic Linguistic Theory), instantiating Rempt's dream for accountable, large coverage grammars. The Kura client runs under the Unix KDE desktop (requires Python, MySQL, Qt), and currently supports lexicons and interlinear texts.
CDSPTW LACITO Linguistic Data Archiving Project (Boyd Michailovsky, John B. Lowe, Michel Jacobson)
Projet Archivage, based at LACITO in Paris, aims to provide tools and formats for linguistic and anthropological field data. An interesting feature is the use of XML markup, with a DTD that supports transcriptions, phrasal and word-by-word interlinear translations, and audio references. Some XSL style sheets are provided that illustrate the potential power of XML markup to support web browsing for material of this type, giving access to text and sound. The LACITO Archive contains a corpus of annotated speech in Limbu (Tibeto-Burman, Nepal) and examples in other languages. Archive documents can be browsed on-line or downloaded along with the tools required for local use. [ archives:LACITO | Michel Jacobson's Chicago talk ]
R Languages of the World (langworld@hotmail.com)
This is a series of Russian books, each documenting the languages of a given family, produced by the Institute of Linguistics at the Russian Academy of Sciences. Most of the books are still in the planning stage. There is an accompanying series of language maps (in English), using Linguasphere terminology.
CLSW LDC Online (Zhibiao Wu, Mark Liberman, Steven Bird)
The Linguistic Data Consortium has developed a general data model for online search of annotated text and speech corpora (dictionaries, broadcast news, telephone speech, newswire, ...). LDC has also developed a general framework for multilevel annotation, described in a recent technical report and conference paper. This work is being pursued in the context of the Talkbank project.
R LINCOM EUROPA (LINCOM.EUROPA@t-online.de)
LINCOM has published some 400 descriptive grammars, as well as dictionaries and research monographs.
R Linguasphere Register of the World's Languages and Speech Communities (David Dalby)
This is a comprehensive classification of the world's languages, designed to serve as a framework for referencing and accessing all forms of data and documentation on the world's languages. The register is available in print (1044 pages in two volumes) and online from the Linguasphere Observatory. Free extracts are available. A single pair of digits identifies any language or language name in terms of 100 zones of linguistic or geographical reference, and a flexible scale of up to six alphabetic elements records the detailed classification - and any subsequent reclassification - of the languages and dialects within each zone. Linguasphere and SOAS are preparing a GIS version covering Africa.
R LSA: Linguistic Society of America (lsa@lsadc.org)
The LSA has a committee on Endangered Languages and their Preservation, which meets annually in May. The LSA has a FAQ on endangered languages, written by Betty Birner. The LSA has resolutions on Research with Human Subjects and Language Rights.
CL Maasai Lexicographic and Text Data Base Project (Doris Payne)
This is a cross-dialectal study of Maasai, incorporating a detailed lexicon, a verb database, and texts. [NSF award]
CLW Maliseet-Passamaquoddy Dictionary (Robert Leavitt)
This project has created a web-searchable dictionary having 675 entries, with the data stored using FileMaker. There are plans to develop a CD-ROM or Website with sound clips and cross-linked texts, paradigms, and illustrations. [NSF award]
CS Mambilex: A Comparative Survey of Mambila Dialects (Bruce Connell)
This project is developing a comparative audio linguistic atlas of the Mambila region of the Nigeria-Cameroon borderland, using FileMaker. Each language has its own database with 883 records, one for each of the English/French glosses. A second component of the database is a linguistic atlas of the Mambiloid region, containing maps that are cross-referenced with the lexical database. The database and atlas together are intended as a basis for further research, and as a teaching tool of interest in general linguistics and historical linguistics courses. Virtual Institute of Mambila Studies.
CWL Analytic Dictionary of Ameyaltepec Nahuatl (Jonathan Amith)
Over the past 15 years, Jonathan Amith has been working on a comprehensive dictionary of Ameyaltepec Nahuatl. His database is in Shoebox format, and includes extensive grammatical information and glossed example sentences. The dictionary is used in the Nahuatl Summer Language Institute at Yale. The web interface uses HyperLex. SIL Mexico has a website with sections on the linguistic structure of Nahuatl. [ Jonathan Amith's Chicago talk ]
R NATO Advanced Study Institute on Language Engineering for Lesser-Studied Languages (Turkey, July 2000) (Kemal Oflazer)
This is a program of courses and workshops in language engineering, with a likely geographic focus on languages of eastern Europe, the middle East, and the former Soviet Union.
  NICE: Native-Languages Interpretation and Communication Environment (Lori Levin)
This project will apply machine translation to indigenous languages, by developing new low-cost methods for creating MT systems. The work focuses on Spanish and the native languages of Latin America. The project is funded by DARPA and hosted at CMU's Language Technologies Institute,
R NSEP: National Security Education Program (nsep@aed.org)
The Academy for Educational Development is a US-based non-profit international development organization. AED's National Security Education Program (NSEP) is designed `to increase the ability of U.S. citizens to communicate and compete globally by knowing the languages and cultures of other countries.' Towards this end, NSEP offers international graduate fellowships to U.S. citizens to study abroad, after which they must seek employment with a U.S government agency involved in national security. Funding is prioritized to cover languages which are critical to U.S. national security. This year, the NSEP Areas of Emphasis lists the following languages (new languages emphasised): Albanian, Amharic, Arabic (and dialects), Armenian, Azeri, Belarusian, Bulgarian, Burmese, Cantonese, Czech, Farsi, Georgian, Hebrew, Hindi, Hungarian, Indonesian, Japanese, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lingala, Macedonian, Malay, Mandarin, Mongolian, Polish, Portuguese, Romanian, Russian, Serbo-Croatian, Sinhala, Slovak, Slovenian, Swahili, Tagalog, Tajik, Tamil, Thai, Turkish, Turkmen, Uighur, Ukrainian, Urdu, Uzbek, and Vietnamese.
R NSF Linguistics Program (Cecile McKee)
The NSF Linguistics Program funds many language documentation projects, including: Apache (Axelrod), Cherokee (Haag), Tepehua (Woodbury), Yurok (Garrett), Fulbe (Ochs), Holikachuk (Oliverio), Yuma (Miller), Bole (Schuh), Coeur d'Alene (Doak), Albanian (Newmark), Phonetics (Maddieson), Tamashek (Heath), Creek (Martin), Montana Salish (Thomason), Khoisan (Collins), ASL (Neidle).
CL Numic Comparative Lexicon (John McLaughlin)
The Numic Comparative Lexicon is a database covering the seven languages of the Numic branch of the Uto-Aztecan language family, designed to support the reconstruction of Proto-Numic. The database uses commercial software having Mac and PC versions. [NSF award].
CLW Oykangand and Olkola Dictionary (Philip Hamilton)
The Oykangand and Olkola Dictionary is a website containing wordlists for several Australian languages, with extensive coverage of flora, fauna and material culture, including many scientific names and photographic images.
CL Pirahã transcriptions (Dan Everett)
Some excellent examples of linguists' traditional interlinear transcription style, and a discussion of its lexicographic use, can be found on Daniel and Keren Everett's Pirahã web site.
R Pacific Linguistics (Julie Manley)
Pacific Linguistics has published over five hundred volumes of descriptive work (including grammars and dictionaries) on languages of the Pacific, southeast and south Asia and Australia. Pacific Linguistics is part of the Research School of Pacific and Asian Studies at the Australian National University in Canberra.
CL PLDMA: Project for the Documentation of the Languages of Mesoamerica (Terrence Kaufman, John Justeson)
This project has elicited lexical data for several Mixe-Zoquean and Zapotecan languages, for the purposes of comparative research. Two of the dictionaries have a web interface although they do not appear to be operational at present. [NSF award]
R Rosetta Disk Project (Jim Mason)
The Long Now Foundation is working to create a modern "Rosetta Stone" using a new extreme longevity, high-density analog storage technology [picture]. The two-inch micro-etched nickel disk will store thousands of pages, written at a microscopic scale, including all the world's translations of the book of Genesis, an assortment of creation works with interlinear glosses, vocabulary, a guide to the orthography, plus metadata about the language. Like the Rosetta Stone, the disk will record a single text in many languages, and this will allow for the recovery of "lost" languages in the "deep future."
R SALTMIL: Speech And Language Technology for Minority Languages (Bojan Petek)
SALTMIL is a special interest group of ISCA, founded to promote R&D in speech and language technology for lesser-used languages, particularly those of Europe (spoken by some 40 million people). SALTMIL organized a one-day workshop in connection with LREC-2000 entitled: Developing language resources for minority languages: re-useability and strategic priorities. See also: EBLUL: European Bureau for Lesser Used Languages.
TDR Summer Institute of Linguistics
SIL has developed a variety of Windows and Macintosh tools to support field-based linguistic research, such as: Shoebox, a data management program for text-based linguistics and lexicography; the Speech Analysis Tools; and WORDSURV for creating and comparing wordlists. A full list of software is available via the SIL Software Catalog. LinguaLinks Linguistics Tools is a suite of programs including an interlinear text editor, a wordform inventory editor, an analysis editor, a morphology explorer, and a lexical database editor. Evan Antworth has written a chapter describing SIL's Software for Doing Field Linguistics. SIL has developed an extensible rendering engine for complex writing systems, called Graphite. SIL's Language and Culture Archive will have an electronic section to house field data. SIL publishes the Ethnologue. [ archives:SIL | Larry Hayashi's Chicago talk ]
T SOWL: Sounds of the World's Languages
The UCLA Phonetics Lab has produced a Macintosh hypercard stack which documents rare sounds from 80 languages. They also have DOS software providing access to the UCLA Phonological Segment Inventory Database (UPSID), which documents segment inventories for 451 languages.
LP STEDT: Sino-Tibetan Etymological Dictionary and Thesaurus ( James Matisoff, Richard Cook)
The STEDT project began in 1987 at UC Berkeley, with the goal of creating an etymological dictionary of Proto-Sino-Tibetan, through the systematic collection, preservation and publication of lexical data from the daughter languages, many of which are undocumented. The Primary STEDT Etymological Database environment is comprised of six related databases: Main Lexicon, Etyma, Language Names, Language Groups, and Font Reference. In addition a number of Ancillary Databases capture source documents and provide a working environment for the culling of data for the Main Lexicon. A recent paper describing STEDT's electronic resources is available here.
  StressTyp (Rob Goedemans)
StressTyp is a typological database containing information on the metrical systems of 500 languages. The data is stored in a proprietary format, for which it is necessary to use a commercial database package (4th Dimension). A runtime version of the database package is available for free, upon signing a license agreement. StressTyp is associated with a project to study Prosody in the Languages spoken in Indonesia, with the World Atlas of Language Stuctures, and the Typology Database System.
R TalkBank: A Multimodal Database of Communicative Interactions (Brian MacWhinney, Steven Bird)
TalkBank is an interdisciplinary research project funded by a five-year NSF grant, hosted by CMU and U Penn. The goal of TalkBank is to foster fundamental research in the study of human and animal communication. TalkBank will provide standards and tools for creating, searching, and publishing primary materials via networked computers. The project will develop methods for transcribing, annotating, accessing, and analyzing communicative interactions, involving the direct linkage of transcripts to digitized audio and video. One of the TalkBank domains is corpus-based field linguistics, and TalkBank sponsored what was probably the first meeting on computational support for field linguistics, at the Chicago LSA: The Linguistic Exploration Workshop. [ NSF award | Brian MacWhinney's Chicago talk ]
CLTW TELL: Turkish Electronic Living Lexicon (Sharon Inkelas)
TELL is a lexical database containing 30,000 entries, including phonemic transcription of headwords elicited in various morphological contexts, and etymological and morphological information for many words. TELL has a query interface which permits regular expression search over the fields. [ NSF award 1 | NSF award 2 | Ron Sprouse's Chicago talk ]
R TDS: Typology Database System (Paola Monachesi)
The Typology Database System will integrate multiple typology databases, permitting end-users to query them in parallel. The system is part of a planned web-based digital archive for typological description, including grammars, typological databases and typological expert systems. The project is coordinated by the Utrecht Institute of Linguistics and currently involves researchers in the Netherlands, Germany and Britain.
R Terralingua (Luisa Maffi, David Harmon)
Terralingua is an international nonprofit organization which supports global linguistic diversity. It produces a quarterly electronic newsletter and maintains a page of resources on language endangerment, survival, and revitalization.
R TIDES: Translingual Information Detection, Extraction and Summarization (Gary Strong)
The TIDES Vision describes a scenario where the US military finds itself deployed in a region where documents in the indigenous local language are strategically important, but no MT or IR technology is yet available. DARPA is sponsoring 20 projects to develop technology to permit English-language queries on documents in other languages, and for the analysis to be reported back in English. The technology will be used by "English-speaking US military users to access, correlate, and interpret multilingual sources of information relevant to real-time tactical requirements." Related projects: FALCon: Forward Area Language Converter: Translingual Help for U.S. Troops, ARDA Exploitation of Human Language Data .
R UNESCO Projects in Language Documentation
UNESCO is associated with two initiatives to document languages. The UNESCO Red Book on Endangered Languages was initiated by UNESCO and continues as a volunteer activity; ICHEL has links to Red Book sections on Asia/Pacific, Africa, South America, Europe, and Northeast Asia. More recently, UNESCO has initiated a Report on the Languages of the World, and an International Mother Language Day. A new edition of the booklet: Atlas of the World's Languages in Danger of Disappearing was published in July 2001.
R Volkswagen Foundation: Documentation of Endangered Languages (Vera Szöllösi-Brenig)
Volkswagen Foundation is sponsoring research on the documentation of endangered languages. The foundation encourages the development of new methods of researching, processing and archiving linguistic and cultural data, and intends to create opportunities for multidisciplinary and interdisciplinary utilisation of the data. In its pilot phase, the program is funding projects on Tofa (Siberia), Salar & Monguar (China), Ega (Ivory Coast), Teop (PNG), Wichita (USA), and some indigenous languages of Brazil. The program is also funding a project on the necessary computational infrastructure at MPI, called TIDEL. An index page for the group of projects is available here. Information about the main phase will be available from mid April 2001.
R WALS: World Atlas of Language Structures (Bernard Comrie, Matthew Dryer, David Gil, Martin Haspelmath)
Researchers at the Linguistics Department of the Max Planck Institute of Evolutionary Anthropology in Leipzig and the Linguistics Department at Buffalo are building a typological database which classifies 200 languages for some 100 phonological, morphological and syntactic features. Each feature will be displayed on a two-page global map (sample map) and accompanied by a two-page discussion. The result will be published as a book and a CD-ROM with a search interface. Descriptions are available at the Leipzig and Buffalo sites. Dryer has a page describing Basic Linguistic Theory, a cover term for the common theoretical assumptions underpinning most descriptive linguistics.
CLTP Warlpiri Dictionary (Chris Manning, Kevin Jansz and Nitin Indurkhya)
The Warlpiri dictionary is stored in XML and accessed via a Java program. The only online material seems to be the above page, which includes pointers to several screen shots. There is a paper which describes the system: Kirrkirr: Interactive Visualisation and Multimedia from a Structured Warlpiri Dictionary. [ Chris Manning's Chicago talk ]
R Wiyuta: Assiniboine Storytelling with Signs (Brenda Farnell)
This is an early, ground-breaking example of an interactive multimedia presentation of ethnographic materials (Macintosh CD-ROM, 1995). It documents the performances of traditional Assiniboine narrators, with various transcriptions (labanotation, phonemic, English translation) along with commentary.
R Yinka Dene Linguistic Information (Bill Poser)
This site contains a collection of annotated texts, a comparative vocabulary, grammatical information, and various other resources. A page on Dene syllabics contains photographed inscriptions annotated with linguistic information about the writing system.
R YourDictionary.com: A Web of Online Dictionaries (Robert Beard)
This site references and/or archives some 200+ online dictionaries and some 100+ online grammars, along with information about fonts and other language resources. The site includes an Endangered Language Repository.
 
Last update: 20 March 2002
Steven Bird, sb@ldc.upenn.edu


To be added: http://www.mpi.nl/world/groups/lcog.html http://www.carleton.ca/~mojunker/eastcreegrammar/ http://www.ietf.org/rfc/rfc1766.txt ftp://ftp.isi.edu/in-notes/rfc3066.txt http://llt.msu.edu/vol4num1/callforpapers.html http://www.ids-mannheim.de/dsav/ http://www.inalf.cnrs.fr/ http://socrates.berkeley.edu/~autotyp/ http://www.ecai.org/ http://www.lmp.ucla.edu/