Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



ACE-2 Version 1.0

Item Name: ACE-2 Version 1.0
Authors: Alexis Mitchell, Stephanie Strassel, Mark Przybocki, JK Davis, George Doddington, Ralph Grishman, Adam Meyers, Ada Brunstein, Lisa Ferro, and Beth Sundheim
LDC Catalog No.: LDC2003T11
ISBN: 1-58563-270-8
Release Date: Sep 02, 2003
Data Type: text
Data Source(s): broadcast news, newswire
Project(s): ACE, GALE, TIDES
Application(s): automatic content extraction, information detection
Language(s): English
Language ID(s): ENG
Distribution: Web Download
Member fee: $0 for 2003 members
Non-member Fee: US$1000.00
Reduced-License Fee: US$500.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Alexis Mitchell, et al.
2003
ACE-2 Version 1.0
Linguistic Data Consortium, Philadelphia

Introduction

ACE-2 Version 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T11 and ISBN 1-58563-270-8.

This release contains Version 1.0 of the ACE-2 corpus, created and distributed by the LDC to support the Automatic Content Extraction (ACE) program. The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus the ACE program requires the development of technologies that automatically detect and characterize this meaning. The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events. There are three main ACE tasks: Entity Detection and Tracking, Relation Detection and Characterization, and Event Detection and Characterization.

Annotations for the ACE-2 corpus were produced by Linguistic Data Consortium to support the following two research tasks: Entity Detection and Tracking (EDT) and Relation Detection and Characterization (RDC).

For information regarding the ACE program and ACE technology evaluations administered by the National Institute of Standards and Technology (NIST), please visit the NIST website.

For information about ACE annotation and ongoing ACE corpus development, including annotation guidelines, task definitions, annotation tools and other project documentation, please visit the ACE Project page at the LDC.

Data

This publication contains two sets of data: training and devtest. Each of these sets is further divided by source: broadcast news, newspaper, and newswire.

The training contains data originally developed as training material for the February 2002 evaluation and again for the September 2002 evaluation. The devtest contains data originally developed as test data for the February 2002 evaluation and later used as devtest data for the September 2002 evaluation.

The broadcast and newswire source data is drawn from a subset of the TDT2 Multilanguage Text Version 4.0 (LDC2001T57); this has been supplemented with additional newspaper data from the Washington Post. A portion of the training broadcast data was drawn from the 1997 English Broadcast News Transcripts (HUB4) corpus (LDC98T28).

All material comes from the first half of 1998. The sources for the broadcast, newswire, and newspaper data are listed below.

NewswireNew York Times Newswire Service (NYT)
Associated Press Worldstream Service (APW)
Broadcast NewsCable News Network, "Headline News" (CNN for TDT2, ed for Hub-4)
American Broadcasting Co., "World News Tonight" (ABC for TDT2, ea for Hub-4)
Public Radio International, "The World" (PRI)
Voice of America, English news programs (VOA)
MSNBC, "The News With Brian Williams" (MNB)
National Broadcasting Company, "Nightly News" (NBC)
NewspaperWashington Post (WAP)


This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (APF), supporting documentation, and version 2.0.1 of the ACE DTD which was used for the September 2002 ACE Evaluation.

There are 179,007 words of source data, or 519 files, broken down as follows:

Source# Words train# Words devtest# Files train# Files devtest
NYT328927487489
APW2914470378220
CNN229026536911
ABC158826872410
PRI12725284439
VOA5942611247
MNB0253906
NBC0263308
WAP60247150707617
ea20190310
ed10940250
Total1310234798442297


Updates

There are no updates available at this time.

Content Copyright

Portions © 1998 Los Angeles Times-Washington Post News Service, Inc., © 1998 American Broadcasting Corporation, © 1998 Cable News Network, Inc., © 1998 Press Association, Inc., © 1998 New York Times, © 1998 National Broadcasting Company, Inc., © 1998 Public Radio International, © 2003 Trustees of the University of Pennsylvania

"The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.