Home


The New York Times Annotated Corpus illustrates how data published in LDC’s Catalog can become an important resource for the community. The New York Times is one of LDC’s earliest data providers; the billions of words of news text it has provided for language resources since the 1990s continue to be used today for research and technology development. Its contribution of the New York Times Annotated Corpus in 2008 opened a new dimension for research with summaries, tags and parsing tools for close to two million news articles spanning a twenty year period. Researchers immediately recognized the significance of this resource. In its brief history in the Catalog, the corpus has become one of the top ten most distributed data sets and inspired over 200 research papers. 

The top ten list reflects how LDC data contributes to the work of our global community. It includes data sets published  over two decades ago  that are regarded as benchmark resources essential for new entrants to the field, as well as more recent releases that  support users’ ever-growing and changing needs. The papers written about LDC data – over 13,000 unique publications that we’ve found to date – confirm the impact of the Consortium’s archive in supporting continued work and scientific progress.