LDC Overview

In the early 1990s, as technical progress in Human Language Technologies was advancing due to improved algorithms and the increased availability of affordable microcomputers, private and public researchers were working in earnest to develop capabilities in speech and text processing. Leaders in the associated disciplines recognized, however, that they lacked the volume and variety of data necessary to build robust, portable and scalable systems. The Linguistic Data Consortium (LDC) was founded to simplify the distribution of critical linguistic data.

Charles L. Wayne, then Technical Officer at ARPA's Software and Intelligent Systems Technology Office, issued a call for proposals to host a consortium devoted to acquiring, archiving, preserving and distributing linguistic corpora. Professor Mark Liberman wrote the response from the University of Pennsylvania that was selected from among competing proposals. Penn's longstanding linguistics tradition, reputation in Computer Sciences and more recent work on the Penn Treebank, as well as Professor Liberman's work on the ACL/DCI initiative, made for a good fit.

The LDC Catalog was immediately populated with several important data sets donated by government and private sources that continue to be widely used by the community: TIMIT, ATIS and Switchboard, for example. The Catalog grows by 30-36 corpora each year, and contains data developed by LDC and contributed by partners around the globe.

LDC, beginning as a drawer in Professor Liberman's desk, has since grown to 50 full-time employees located within the University City Science Center in Philadelphia.

LDC engages in collaborations with US and foreign researchers, institutions and data centers. The Consortium supports various initiatives that promote language resource development and distribution, such as the Open Language Archives Community (OLAC), the Universal Catalog (ELRA), and the Language Grid. LDC’s work in sponsored programs has grown in size and scope to include consultation, needs analysis, task specification, data collection, development of annotation guidelines and software tools, management of multiple data providers and coordination with program sponsors, research performers and evaluation teams.

The Consortium continues to adapt to, and occasionally anticipate, community needs as technology targets and data requirements change over time, to promote language resources and to develop collaborations within the research communities focused on linguistic inquiry.