Introduction
This publication contains the complete set of English, Arabic and Chinese
training data for the 2005 Automatic Content Extraction (ACE) technology
evaluation. The corpus consists of data of various types annotated for
entities, relations and events was created by Linguistic Data Consortium
with support from the ACE Program, with additional assistance from
LDC. This data was previously distributed as an e-corpus (LDC2005E18) to
participants in the 2005 ACE evaluation.
The objective of the ACE program is to develop automatic content extraction
technology to support automatic processing of human language in text
form.
In November 2005, sites were evaluated on system performance in five
primary areas: the recognition of entities, values, temporal expressions,
relations, and events. Entity, relation and event mention detection were
also offered as diagnostic tasks. All tasks with the exception of event
tasks were performed for three languages, English, Chinese and Arabic.
Events tasks were evaluated in English and Chinese only. The current
publication comprises the official training data for these evaluation
tasks.
A complete description of the ACE 2005 Evaluation can be found on the ACE
Program website maintained by the National Institute of Standards and
Technology (NIST).
For more information about linguistic resources for the ACE Program,
including annotation guidelines, task definitions, free annotation tools
and other documentation, please visit LDC's ACE website
Below is information about the amount of data included in the current
release and its annotation status.
- 1P: data subject to first pass (complete) annotation
- DUAL: data also subject to dual first pass (complete) annotation
- ADJ: data also subject to discrepancy resolution/adjudication
- NORM: data also subject to TIMEX2 normalization
| English |
| words | files |
| 1P | DUAL | ADJ | NORM | 1P | DUAL | ADJ | NORM |
| NW | 60658 | 57807 | 33459 | 48399 | 128 | 124 | 81 | 106 |
| BN | 59239 | 58144 | 52444 | 55967 | 239 | 234 | 217 | 226 |
| BC | 46612 | 46110 | 33874 | 40415 | 68 | 67 | 52 | 60 |
| WL | 45210 | 43648 | 35529 | 37897 | 127 | 122 | 114 | 119 |
| UN | 45161 | 44473 | 26371 | 37366 | 58 | 57 | 37 | 49
| | CTS | 47003 | 47003 | 34868 | 39845 | 46 | 46 | 34 | 39 |
| Total | 303833 | 297185 | 216545 | 259889 | 666 | 650 | 535 | 599 |
Chinese
Note: Chinese data expressed in terms of characters. We assume
a correspondence of roughly 1.5 characters/word. |
| chars | files |
| 1P | DUAL | ADJ | 1P | DUAL | ADJ |
| NW | 127319 | 124175 | 121797 | 248 | 242 | 238 |
| BN | 134963 | 133696 | 120513 | 332 | 328 | 298 |
| WL | 71839 | 68063 | 65681 | 107 | 101 | 97 |
| Total | 334121 | 325834 | 307991 | 687 | 671 | 633 |
| Arabic |
| words | files |
| 1P | DUAL | ADJ | 1P | DUAL | ADJ |
| NW | 61287 | 56158 | 53026 | 239 | 226 | 221 |
| BN | 29259 | 27165 | 26907 | 134 | 128 | 127 |
| WL | 21687 | 20181 | 20181 | 60 | 55 | 55 |
| Total | 112233 | 103504 | 100114 | 433 | 409 | 403 |
Samples
For examples of the data in this publication, please review the following samples:
Content Copyright
Portions © 2000-2003 Agence France Presse, © 2003 The Associated Press, © 2003
New York Times, © 2000-2001, 2003 Xinhua News Agency, © 2003 Cable News Network
LP, LLLP, © 2000-2001 SPH AsiaOne Ltd, © 2000-2001 China Broadcasting System,
© 2000-2001 China National Radio, © 2000-2001 China Television System, ©
2000-2001 China Central TV, © 2000-2001 Al Hayat, © 2000-2001 An-Nahar, ©
2000-2001 Nile TV, © 2005, 2006 Trustees of the University of Pennsylvania |