Curation and Distribution Services

LDC offers a range of services that meet NSF’s requirements for data management plans and can be customized for a project’s particular needs.

Data Curation

The curation process is composed of four basic steps.

  • Receive content from data provider.
  • Review data. This process includes conducting quality control checks and evaluating documentation.
  • Prepare data for deposit and submit to archive. Steps include developing a corpus description, applying the descriptive metadata schema and reviewing final formats and directory structure.
  • Monitor and manage archive. This is an ongoing activity. As needed, any updates, bug fixes and version control are applied to the corpus and data is migrated to new formats and platforms.

Data archived by LDC has the following characteristics compatible with DMP requirements:  

Perseverity. NSF relies on the community to determine how long data should be archived and made available. LDC is committed to the long-term accessibility of all language data. Every corpus deposited at LDC remains available, including those created in the 1980s and placed at LDC at its founding.

Documentation. DMPs require a record of how data was collected. LDC assures compliance with this requirement by working with contributors to include specifications and documentation that describe the collection procedures, documents formats for potential users and is comprehensible to an intelligent non-expert.

Data storage and back-up. LDC maintains in-house storage solutions currently accommodating over 200TB with the capability to scale to petabytes rapidly and transparently. Access policies prevent accidental or unauthorized changes to data. During preparation write access is restiricted to those actively preparing the publication. One the data has been published it becomes read-only to all. A back-up system includes snapshots, replication on disk, tape robots, cloud storage services and back-up servers. LDC back-up policies assure adequate protection according to the nature of the data. Copies of published data are also stored offsite. LDC ensures that data is migrated to new formats, platforms and storage media as required by best practices in the digital preservation community.   

Data Distribution  

LDC has an established track record for successfully distributing language resources to many users, by numerous methods and under a variety of licensing arrangements.

Over 108,000 copies of data have been distributed to more than 3500 organizations on media (CD, DVD, BD, HD, USB) and through the cloud and grid.  Licensing and distribution are increasingly automated. LDC’s licenses are compatible with the community’s customary uses as well as with intellectual property and human subjects concerns. Comprehensive recordkeeping ensures that users always know their rights to specific data sets.

LDC practices enhance resource usability, preserve contributors’ flexibility and ease the administrative burden. Among those are the following: 

Compatibility. LDC offers guidance on file-naming, meta data conventions, corpus structure and format. Using common conventions increases resource compatibility, an important consideration for follow-on research and scientific advancement in general.

Non-exclusive distribution. LDC does not insist on exclusive distribution rights to contributed data. Data creators may deposit their data at LDC and also distribute it through their institutional site or by other means. However, the corpus remains in LDC’s archives even after a data creator has changed institutions.

Management of property rights, privacy and ethical concerns. By depositing data with LDC, principal investigators, their institutions and third-party data providers retain their rights while licensing LDC to process, store and disseminate the resource to the community. In addition, LDC has vast experience in satisfying the legal and regulatory constraints imposed on data collection, annotation and archiving, and its staff includes experts on intellectual property, human subjects protection and export control. Its skill in research and corpus development translates into practices addressing ethical and privacy matters that can be applied to data management plans in the early stages.

Authorship. NSF expects investigators to properly acknowledge all contributions to research results. Recognizing the need for correct attribution, LDC requires contributors to name resource authors and includes authors as part of the descriptive metadata. In general, LDC considers authors to be those who make decisions on corpus design, structure or content.

Timed/delayed accessibility. The NSF standard of timeliness encourages rapid if not immediate distribution of resources it funds. However, data management plans allow for delayed release of sponsored data when circumstances require it. LDC has long experience in holding and protecting data for a timed or delayed release. For example, most NIST (National Institute of Standards and Technology) evaluation campaigns require that data be developed in advance, provided to campaign participants and then publicly available only after the evaluation process is completed.

Licensing options. Although NSF expects that research data will be shared at an incremental cost, it also understands the need to protect intellectual property of commercial value. LDC implements procedures to protect the commercial value of language data including research-only licenses and referrals to data owners for commercial licensing.

Extensive distribution network. International research collaborations may require that the resulting data be shared according to the regulations of multiple funding agencies. As a not-for-profit organization deeply committed to the broad availability of language resources, LDC maintains strong, positive relationships with other data centers around the world including the European Language Resources Association, the Linguistic Data Consortium for Indian Languages, the South African Language Resource Management Agency and the US Government Catalog of Language Resources, among others. LDC also has a large network of connections with data contributors and users across the globe.

Costs. Numerous archives and sponsored research offices have responded to the need for DMP cost justifications by assessing a DMP fee on new proposals that is calculated as a percentage of the overall project budget. Standard percentage increments range from 2%-10%. LDC’s approach is based more precisely on the actual costs to assure long-term preservation based on several factors:

  • Is the grant covering archive and distribution costs or are some costs borne by data recipients?
  • Will the data be hosted in the catalog as it is deposited or will LDC perform its standard quality checks and reviews?
  • How will the data be distributed? Costs differ for web-based resources and those delivered on media and within media, depending on type and shipping fees.
  • How many copies of the resource are expected to be delivered at no cost to recipients and over what time frame?

In general, the goal is to reach an understanding on the issues above to develop a funding model that meets the actual costs and project goals, such as, for example, a budget that covers archive and distribution costs for the life of the grant plus two years.