Human language technologies require large amounts of data to train, develop and test models and systems. There is a direct relationship between data quality and system effectiveness, that is, good data makes good systems.
LDC ensures that the community has access to high-quality data sets through effective data management practices that cover such matters as accessibility, usability, curation and archiving.
LDC’s data curation process includes intake, data review, quality checks, metadata creation, documentation development and archive management. Data deposited at LDC and published in the Catalog is archived in a logical data tree subject to a specialized backup system from which it can be migrated to new formats, platforms and storage media in accordance with best practices in the digital preservation community (DiPersio & Cieri, 2019).
These activities add value to data and are critical for data discovery and retrieval, usability, and long-term data sharing and re-use for future research.
LDC’s Catalog is a CoreTrustSeal-certified data repository that meets high standards for data access, rights management, curation, data integrity and authenticity, archival storage and security. The Catalog also consistently receives the highest (five-star) rating for metadata quality from the Open Language Archives Community (OLAC).
Visit Data Management for information on using LDC data, contributing data to the Catalog, data management plans and citing LDC data.