ACL Anthology Reference Corpus, Linguistic Data Consortium (LDC) catalog number
LDC2009T29 and isbn 1-58563-531-6, is a digital archive of 10,291 research papers
in computational linguistics sponsored by the Association for Computational
Linguistics (ACL). Also available
from the ACL, this release contains most of the papers that appear up to
February 2007 in the web-based ACL
Anthology, a dynamic repository that currently hosts over 16,500 articles
drawn from a range of conferences and workshops as well as past issues of the
Computational Linguistics journal. The ACL Anthology Reference Corpus is designed
to be a standard, real-world digital collection testbed for experiments in
bibliographic and bibliometric research.
The ACL is the international scientific and professional society for scholars
working on problems involving natural language and computation. Membership includes
the ACL quarterly journal, Computational Linguistics, reduced registration
at most ACL-sponsored conferences, discounts on ACL-sponsored publications and
participation in ACL Special Interest Groups. Since 1988, Computational
Linguistics has been the primary forum for research on computational linguistics
and natural language processing.
The material in the ACL Anthology Reference Corpus was scanned at 600dpi grayscale
for archival storage, down-sampled to 300dpi black-and-white, assembled into
articles and stored in the "PDF Image with Hidden Text" format.
Author and title metadata was extracted from the OCRed text and used to build
HTML index pages. Older materials, such as conference proceedings from the 1960s
and early volumes of Computational Linguistics, were manually digitized
from microfiche slides.
ACL Reference Anthology includes:
- 10,921 PDF files in the pdf/anthology-PDF tree.
- 13,551 files with metadata described in the metadata/anthology-XML tree
- 84,542 pages in the PDF files
Portions © 1963-2006 Association for Computational Linguistics, ©
2009 Trustees of the University of Pennsylvania