LDC also collaborates with similarly-minded organizations, aiming towards a variety of end results.
ANC (American National Corpus)
The ANC project fosters the development of a corpus comparable to the British National Corpus (BNC) covering American English.The American National Corpus Second Release is available in the LDC Catalog.
DASL (Data and Annotation for Sociolinguistics)
This project investigates best practices in the use of digital speech corpora in the study of language variation. Our pilot study analyzes four large speech corpora for a common sociolinguistic variable and develops a corpus of carefully transcribed and annotated sociolinguistic interviews.
Linguistic Society of America (LSA) Summer Institutes
LSA sponsors biennial summer institutes that focus on themes of interest. The Institutes are comprised of regular session courses, pre-session courses and institute lectures. LDC has provided data for use in various Summer Institute courses.
SMART (Source Media Authoring Resources and Tools)
SMART refers to a combination of raw and annotated data sets, software resources for browsing, searching, extracting and preparing material for use in authored material distributed via an efficient infrastructure that provides licensing and computing support as well as an archive.
Syntactic Parsing Project
LDC and Google collaborated to create parsing resources to improve syntactic web searches; annotations were manually performed by specially trained linguists. Resources available through the LDC Catalog from this effort are English Web Treebank and English News Text Treebank: Penn Treebank Revised.
TalkBank was an interdisciplinary research project funded by a five year NSF grant to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data.
LDC’s agreements with over 40 data-providing organizations support the development of training and test data for sponsored projects as well as general corpus creation. Those collections are included in many LDC resources, including the multilingual Gigaword series, parallel text collections and broadcast data sets.
External corpus developers have contributed over 50% of the resources in LDC’s Catalog. Just a few of the contributors are:
- Microsoft Research India (Indian language part-of-speech tagsets)
- Brandeis University (TimeBank)
- Indiana University (Nationwide Speech Project)
- University of Georgia (Digital Archive of Southern Speech)
- USC Shoah Foundation Institute (MALACH Interviews and Transcripts)
- MITRE Corp. (aligned transcripts, spatial annotations)
- New York Times (New York Times Annotated Corpus)
- US Military Academy at West Point (multilingual speech databases)
- Charles University Prague (multilingual dependency treebanks)
- Hong Kong University of Science and Technology (China): Collection and annotation of Chinese conversational telephone speech and broadcast speech
- Brno University of Technology (Czech Republic): Collection, annotation and distribution of multilingual speech data, tools and related resources
- Budapest University of Technology and Economics (Hungary): Development of Hungarian and Kurdish NLP technologies and resources
- European Language Resources Association/Evaluations and Language resources Distribution Agency (France): Collection and annotation of Arabic broadcast speech
- Georgetown University Press (USA): Development of Arabic dialectal dictionaries
- Google Inc. (USA): Syntactic structure annotation of English web text
- Institut Royal de la Culture Amazighe (Morocco): Development of language resources for Amazigh
- Al Akhawayn University (Morocco): Enhancements to LDC’s Arabic reading tool and lexical resources development for an Iraqi Arabic WordNet
- Vassar College, Brandeis University (USA): Development and deployment of a Language Application Grid to provide access to NLP tools and resources
- University of Colorado (USA): Development of multilingual treebanks and propbanks
- Columbia University (USA): Development of Arabic tools and language resources
- MediaNet (Tunis): Collection of Arabic broadcast speech
Sister Organizations and Networks
LDC works with ELRA, the Linguistic Data Consortium for Indian Languages, Gengo-Shigo-Kyokai and others regarding the role of data centers in language resource development and distribution.
LDC collaborates with global networks including the British National Corpus Consortium, E-MELD, European projects such as CLARIN, ENABLER, FLaReNet and META-NET, the Japan-based Language Grid and the US TalkBank project.
LDC is a member of the Open Languages Archives Community (OLAC), an international partnership to create a worldwide virtual library of language resource metadata, which includes consensus for best practices for digital archiving. LDC’s Catalog (searchable through OLAC) consistently receives OLAC’s five-star rating for overall metadata quality.
As part of the American National Corpus Consortium, LDC contributed data to this corpus development effort and supports the continued broad availability of the American National Corpus and its progeny through the LDC Catalog and other means.
LDC serves multiple research communities by its representation on funding panels, editorial boards, scientific committees and as conference and workshop participants.