LDC Overview

In the early 1990s, as technical progress in Human Language Technologies was advancing due to improved algorithms and the increased availability of affordable microcomputers, private and public researchers were working in earnest to develop capabilities in speech and text processing. Leaders in the associated disciplines recognized, however, that they lacked the volume and variety of data necessary to build robust, portable and scalable systems. The Linguistic Data Consortium (LDC) was founded to simplify the distribution of critical linguistic data.

Charles L. Wayne, then Technical Officer at ARPA's Software and Intelligent Systems Technology Office, issued a call for proposals to host a consortium devoted to acquiring, archiving, preserving and distributing linguistic corpora. Professor Mark Liberman wrote the response from the University of Pennsylvania that was selected from among competing proposals. Penn's longstanding linguistics tradition, reputation in Computer Sciences and more recent work on the Penn Treebank [1], as well as Professor Liberman's work on the ACL/DCI initiative, made for a good fit.

The LDC Catalog [2] was immediately populated with several important data sets donated by government and private sources that continue to be widely used by the community: TIMIT [3], ATIS [4] and Switchboard [5], for example. The Catalog continues to grow each year with new corpora releases developed by LDC and contributed by partners around the globe

LDC, beginning as a drawer in Professor Liberman's desk, has since grown to 50 full-time employees located within the University City Science Center [6] in Philadelphia.

LDC engages in collaborations with US and foreign researchers, institutions and data centers. The Consortium supports various initiatives that promote language resource development and distribution, such as the Open Language Archives Community [7] (OLAC), the Universal Catalog [8] (ELRA [9]), and the Language Grid [10]. LDC’s work in sponsored programs has grown in size and scope to include consultation, needs analysis, task specification, data collection, development of annotation guidelines and software tools, management of multiple data providers and coordination with program sponsors, research performers and evaluation teams.

After three decades as the leader in language resource development and distribution, LDC continues its mission of providing large quantities of diverse data, research program support and high quality member services. Human language technology development and its related fields are changing rapidly and need effective digital resource delivery, greater language coverage, new data genres, faster, cost-efficient annotation processes and flexible tools. The Consortium successfully meets those challenges and will continue to do so with the support of members, licensees, sponsors and collaborators.