ACE 2005 Multilingual Training Corpus LDC2006T06 February 15, 2006 Linguistic Data Consortium 1. Introduction This file contains documentation on the ACE 2005 Multilingual Training Corpus, Linguistic Data Consortium (LDC) catalog number LDC2006T06 and isbn 1-58563-376-3. This publication contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18) to participants in the 2005 ACE evaluation. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. The current publication comprises the official training data for these evaluation tasks. A complete description of the ACE 2005 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST): http://www.nist.gov/speech/tests/ace/ For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website: http://projects.ldc.upenn.edu/ace/ 2. Annotation 2.1 Tasks and Guidelines Data contained in this release has been annotated for the following tasks: - Entities - Values (including TIMEX2 objects) - Relations between entities, including relation timestamps - Events, including event timestamps - TIMEX2 normalization for English data The latest annotation guidelines for each language can be downloaded from LDC's ACE website: http://projects.ldc.upenn.edu/ace/annotation 2.2 Annotation Process Training data files for all languages are dually annotated for all tasks by two annotators working independently. The first pass (complete) annotation is called 1P; the independent dual first pass (complete) annotation is called DUAL. For both 1P and DUAL, a single annotator completes all tasks (entities, values, relations & events) for a file. Files are assigned via an automated Annotation Workflow System (AWS), and file assignment is double-blind. Discrepancies between the 1P and DUAL version of each file are then adjudicated by a senior annotator or team leader, resulting in a high-quality gold standard file. The gold standard adjudicated file is known as ADJ. After adjudication, TIMEX2 values are normalized for English only. This task is known as NORM. Note that this annotation process differs substantially from previous years in which data was first passed (1P) by a single junior annotator, then that person's work was second passed (2P), or reviewed for completeness and accuracy, by a senior annotator; and then additional quality control (QC) spot-checks were conducted by the team leader. This annotation process should result in a final corpus of ADJ data that is higher quality and more consistent than in previous ACE corpora. Note however that due to time and funding constraints, most but not all files have been adjudicated or indeed dually annotated. The corpus also includes additional quality control checks conducted by team leaders on the ADJ files. The full annotation process for 2005 is represented below: 1P: entities DUAL: entities values values events events relations relations | | | | |_________?__________| | | | V ADJ: entities values events relations | | | V NORM: TIMEX2 normalization (English only) 3. Source Data Profile 3.1 Data Selection Process A new feature of 2005 ACE training corpus is careful, targeted data selection. Rather than choosing files at random for annotation, this year's task requires a certain density of annotation across the corpus. The established target, agreed upon at the Fall 2004 ACE Workshop, is 50 examples of each entity, relation and event type/subtype within the training corpus for each language. Note that the "50-example threshhold" is simply a target and not a hard and fast requirement of the corpus. LDC has made a concerted effort to identify at least 50 examples of each type/subtype, but has likely fallen short of the goal in some cases. What follows is a brief description of our data selection process. First, a pool of documents substantially larger than the target dataset was quickly labeled by ACE annotators as "good" or "bad" for ACE annotation. The good/bad determiniation was based on the document content including number and type of entities, relations and events. Annotators then reviewed the "good" documents and produced a rough estimate of the number of each type/subtype of entity, relation and event mentioned in the document. In practice for most genres this involved a binary yes/no distinction of whether a given subtype appeared in each document. Documents were then algorithmically selected from this set to maximize the overall count of each type/subtype. This process was supplemented by manual keyword searching focused on the rarest annotation types. 3.2 Training Data Sources and Epochs Below is a description of the data sources and epochs for each language, along with a rough percentage of each data type to be expected in the complete (final) training corpus. Some of the source data is drawn from previous LDC publications including TDT4 Multilingual Text and Annotations corpus (LDC2005T16) and English Gigaword Second Edition (LDC2005T12). English * Newswire (NW): 20% sources: AFP (Agence France Presse - English), APW (Associated Press), NYT (New York Times), XIN (Xinhua News Agency - English) training epoch: March-June 2003 test epoch: July-Aug 2003 * Broadcast News (BN): 20% sources: CNN (Cable News Network), CNNHL (CNN Headline News) training epoch: March-June 2003 test epoch July-Aug 2003 * Broadcast Conversation (BC): 15% sources: CNN_CF (CNN CrossFire), CNN_IP (CNN Inside Politics), CNN_LE (CNN Late Edition) training epoch: March-June 2003 test epoch July-Aug 2003 * Weblog (WL): 15% sources: various internet weblogs (shared online journals) training epoch: Nov 2004-Feb 2005 test epoch: March-April 2005 * Usenet Newsgroups/Discussion Forum (UN): 15% sources: various internet discussion forums/bulletin boards training epoch: Nov 2004-Feb 2005 test epoch: March-April 2005 * Conversational Telephone Speech (CTS): 15% sources: EARS Fisher 2004 Telephone Speech Collection Supplement training epoch: Nov-Dec 2004 test epoch: Nov-Dec 2004 Chinese * Newswire (NW): 40% sources: XIN (Xinhua News Agency), ZBN (Zaobao News Agency) training epoch: Oct-Dec 2000 test epoch: Jan 2001 * Broadcast News (BN): 40% sources: CBS (China Broadcasting System), CNR (China National Radio), CTS (China Television System), CTV (China Central TV), VOM (Voice of America - Mandarin) training epoch: Oct-Dec 2000 test epoch: Jan 2001 * Broadcast Conversation (BC): 0% Not targeted for Chinese * Weblog (WL): 20% sources: various internet weblogs (shared online journals) training epoch: Nov 2004-Feb 2005 test epoch: March-April 2005 * Usenet Newsgroups/Discussion Forum (UN): 0% Not targeted for Chinese * Conversational Telephone Speech (CTS): 0% Not targeted for Chinese Arabic * Newswire (NW): 40% sources: AFA (Agence France Presse - Arabic), ALH (Al Hayat), ANN (An Nahar) training epoch: Oct-Dec 2000 test epoch: Jan 2001 * Broadcast News (BN): 40% sources: NTV (Nile TV), VAR (Voice of America - Arabic) training epoch: Oct-Dec 2000 test epoch: Jan 2001 * Broadcast Conversation (BC): 0% Not targeted for Arabic * Weblog (WL): 20% sources: various internet weblogs (shared online journals) training epoch: Nov 2004-Feb 2005 test epoch: March-April 2005 * Usenet Newsgroups/Discussion Forum (UN): 0% Not targeted for Arabic * Conversational Telephone Speech (CTS): 0% Not targeted for Arabic 4. Annotation Data Profile Below is information about the amount of data included in the current release and its annotation status. 1P: data subject to first pass (complete) annotation DUAL: data also subject to dual first pass (complete) annotation ADJ: data also subject to discrepancy resolution/adjudication NORM: data also subject to TIMEX2 normalization English ==============words============ ==========files=========== 1P DUAL ADJ NORM 1P DUAL ADJ NORM NW 60658 57807 33459 48399 128 124 81 106 BN 59239 58144 52444 55967 239 234 217 226 BC 46612 46110 33874 40415 68 67 52 60 WL 45210 43648 35529 37897 127 122 114 119 UN 45161 44473 26371 37366 58 57 37 49 CTS 47003 47003 34868 39845 46 46 34 39 --------------------------------- --------------------------- Total 303833 297185 216545 259889 666 650 535 599 Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word. ========chars======== ========files===== 1P DUAL ADJ 1P DUAL ADJ NW 127319 124175 121797 248 242 238 BN 134963 133696 120513 332 328 298 WL 71839 68063 65681 107 101 97 ------------------------ -------------------- Total 334121 325834 307991 687 671 633 Arabic ========words======= =======files====== 1P DUAL ADJ 1P DUAL ADJ NW 61287 56158 53026 239 226 221 BN 29259 27165 26907 134 128 127 WL 21687 20181 20181 60 55 55 ----------------------- ------------------- Total 112233 103504 100114 433 409 403 5. Data Directory Structure The data are organized by language, data type and annotation status as follows: fp1: data subject to first pass (complete) annotation fp2: data also subject to dual first pass (complete) annotation adj: data also subject to discrepancy resolution/adjudication timex2norm: data also subject to TIMEX2 normalization So for instance, if a source file has been dually annotated, you will find an .apf.xml annotation file in each of "fp1" and "fp2". The "FileList" files contain information about the word (for English and Arabic) or character (for Chinese) counts and annotation status for each file in the release. 6. File Format Description Each directory contains files of the following formats. For most users, the most important files are the .sgm files and .apf.xml files. Source Text (.sgm) Files - These files contain the source text files in an SGM format. These files use the UNIX-style end of lines. All .sgm files are in UTF-8. ACE Program Format (APF) (.apf.xml) Files - These files are in the official ACE annotation file format. See section 8 for more details. AG (.ag.xml) Files - These are annotation files created with the LDC's annotation toolkit. These files have been convetered to the corresponding .apf.xml files. ID table (.tab) Files - These files store mapping tables between the IDs used in the ag.xml files and their corresponding apf.xml files. 7. Data Validation Below is a description of the sanity checks and other format validation steps applied to annotation files created by LDC. (Note that files created by Valorem have had only two sanity checks applied: validation of .xml, and self-scoring using NIST's ACE scorer, ace05-eval-v11.pl.) -- Extents stripped of all spaces and punctuation at front and back -- GPE mentions without roles were fixed -- For non-GPE mentions with roles, roles were removed -- All non-complex entity mentions have heads. For APF, this means that all entity mentions have heads -- No English passages are annotated in non-English files -- All relation mentions have exactly two non-timex2 arguments -- All relation arguments are contained in the extent of the relation mention -- All event mention arguments are contained in the extent of the event mention -- All NAMPRE and NOMPRE GPE mentions have GPE as their role -- No relations have mentions from the same entity as their only non-timex2 arguments -- All files have exactly one timex2 annotation in the DATETIME field -- No annotation extents overlap without nesting (entity mention, relation mention, event mention, value mention, entity mention head, event mention anchor) -- There are no annotations inside of sgm tags -- There are no instances where an entity and an event share exactly the same head/anchor -- All relation arguments have types that are allowed for their argument position based on their entity/value type -- All event arguments have types that are allowed for their role based on their entity/value type -- All entities, values, relations, events have permissible type-subtype pairs -- All files successfully convert to APF -- All APF files validate against DTD -- All APF files can be scored against themselves -- All instances of cross-type metonymy manually reviewed -- All instances of co-extensive entity mentions with the same heads manually reviewed -- Check for event mentions whose anchor is the full extent of the mention -- Manual scan of all PRO extents for outliers in adjudicated files -- Manual scan of all NOM heads with different entity type/subtype values in different parts of the corpus (adjudicated files only) -- Manual scan of all NAM heads with different entity type/subtype values in different parts of the corpus (adjudicated files only) -- Manual scan of all relation mentions by relation type/subtype and argument type/subtype for outliers in adjudicated files -- Manual and automatic scans of mention extents by patterns to identify inconsistencies in adjudicated files -- Search for untagged pronouns (English, Arabic) -- Search for English Building-Grounds mentions containing "Airport" or "Airfield" -- Search for untagged relative clauses (English) -- Search for demonstratives tagged as WHQ (Arabic) -- Search for relation arguments in violation of the Blocking Rule as defined in the annotation guidelines -- Search for non-Verbal, non-Other, co-extensive relation mentions -- Search for relations with frequently-confused types based on argument types (in particular, PHYS.Located vs. ART.UOIM and ORG-AFF.Employment vs. GPE-AFF.CRRE) -- Search for co-extensive, co-related event mentions -- Scan entities whose mentions appear more than once in the argument structure of a single event mention. -- Scan all clitic pronoun mentions that are not participants in the event whose anchor they are attached to (Arabic) -- Scan all unannotated common TIMEX2 triggers (English) -- Manually examine and correct or describe all fatal errors and warnings generated by the most recent version of the scorer 8. Notes About Scorer Warnings The current version of the ACE scorer (v11) still generates warnings for a handful of the files included in this distribution for both Chinese and English. The LDC has manually reviewed these annotations and determined that the annotation is as correct as possible given this year's annotation guidelines. The warnings the remain are as follows: English Warning: "appear to be redundant mentions" Description: Two mentions of the same event in the same extent may have access to the same arguments by the Typeable Reader Rule. Generally, at least one of the mentions is a nominal mention of the event. AFP_ENG_20030320.0722.apf.xml EV5-12 and EV5-11 AFP_ENG_20030320.0722.apf.xml EV6-3 and EV6-1 AFP_ENG_20030527.0616.apf.xml EV1-3 and EV1-2 AFP_ENG_20030509.0345.apf.xml EV4-2 and EV4-1 APW_ENG_20030326.0190.apf.xml EV10-5 and EV10-4 APW_ENG_20030416.0581.apf.xml EV5-2 and EV5-1 APW_ENG_20030424.0532.apf.xml EV13-2 and EV13-1 CNN_ENG_20030417_063039.0.apf.xml EV2-2 and EV2-1 CNN_ENG_20030526_183538.3.apf.xml EV7-2 and EV7-1 CNN_ENG_20030610_123040.9.apf.xml EV7-2 and EV7-1 CNN_IP_20030402.1600.02-1.apf.xml EV8-2 and EV8-1 CNN_IP_20030404.1600.00-1.apf.xml EV9-2 and EV9-1 Description: A subset of this type or errors is generated by specific compound expressions like "suicide bomber" and "release on parole" that always contain two mentions of the same event. AFP_ENG_20030305.0918.apf.xml EV37-2 and EV37-1 APW_ENG_20030416.0581.apf.xml EV2-2 and EV2-1 rec.arts.sf.written.robert-jordan_20050208.1350.apf.xml EV2-11 and EV2-10 Warning: equivalent event elements Description: This warning is raised when two non-coreferenced event mentions in the same extent share the same arguments. It seems to occur when one event is part of another event mentioned in the extent and when two or more events of the same type with the same arguments appear in a coordination structure. CNN_ENG_20030612_173004.10.apf.xml EV10 and EV11 MARKBACKER_20041220.0919.apf.xml EV1 and EV6 misc.survivalism_20050210.0232.apf.xml EV3, EV4 and EV5 Chinese Warning: equivalent event elements Description: This warning is raised when two non-coreferenced event mentions in the same extent share the same arguments. It seems to occur when one event is part of another event mentioned in the extent and when two or more events of the same type with the same arguments appear in a coordination structure. XIN20001121.0200.0021.apf.xml EV2 and EV3 CNR20001019.1700.0858.apf.xml EV3 and EV4 Warning: "appear to be redundant mentions" Description: Two mentions of the same event in the same extent may have access to the same arguments by the Typeable Reader Rule. Generally, at least one of the mentions is a nominal mention of the event. XIN20001225.0800.0058.apf.xml EV3-5 and EV3-2 Arabic Warning: equivalent event elements Description: The two events are not coreferenced, but they share the same arguments. It roughly translates to something along the lines of "she hit him and attacked him". ANN20001124.1500.0090.apf.xml EV11 and EV4 9. Notes About APF - Offsets APF uses the offset counting method traditionally used in previous ACE evaluation programs: 1) Each (UTF-8) character, not byte, is counted as one. 2) Each newline character is counted as one. (The .sgm files use the UNIX-style end of line characters.) 3) SGML tags are *not* counted towards offsets. (Please note that the AG files included in this release do count SGML tags in offsets.) - TIMEX2 (new this year) The timex2 element represents TIME2 timex expression annotations. Its optional attributes, such as "VAL" and "MOD", represent the TIMEX2 normalization values. Note that LDC is creating timex2 annotation on all files, but not performing timex2 normalization of all of the files. Timestamping in relations and events are represented as references to timex2 annotations in relation_arguments and event_arguments (and as references to timex2_mention annotations in relation_mention_arguments and event_mention_arguments). These timestamp arguments have roles that start with "Time-". - Extent and Scope in Event Mentions (new this year) In response to requests made at the 2005 mid-course correction workshop, the extent of an event_mention is now an automatically generated minimum string that includes its anchor and its event_mention_arguments. The "ldc_scope" element stores the scope marked in the LDC's annotation process. - REFID (new this year) The REFID attributes used in relation_argument and event_argument refer to entity, value or timex2 IDs. The REFID attributes used in relation_mention_argument and event_mention_argument refer to entity_mention, value_mention or timex2_mention IDs. - TYPE, LDCTYPE and LDCATR in entity_mention The TYPE attribute of entity_mention store the official ACE entity mention types, and the LDCTYPE and LDCATR attributes store the attributes used in the LDC's annotation process. - "Unspecified" TENSE for "Other" MODALITY in relations (new this year) If the MODALITY attribute of a relation is set to "Other", the TENSE attribute is automatically set to "Unspecified". This is not true for events. - name in entity_attributes The "name" element in entity_attributes stores the heads of "NAM"-type mentions as in the previous years. In response to George Doddington's request, we have added the NAME attribute to the "name" element. The NAME attribute stores slightly normalized versions of the names where: - \n is replaced with a space - multiple spaces are reduced to one space - " (double quote) is removed - Example: United States - Nickname metonymy Nickname metonyms are indicated with METONYMY_MENTION="TRUE" in entity_mentions. "NAN"-type entity mentions marked as nickname metonymy do not give rise to name elements. - Cross-type metonymy "Cross-type" metonyms are represented with relations of the type METONYMY. The METONYMY type relations do not have relation_mentions. - For more details, please refer to the APF V5.1.1 DTD. 10. DTDs The following DTDs are in the dtd subdirectory. apf.v5.1.1.dtd - XML DTD for APF files (Updated from apf.v5.0.0.dtd --- please see section "0. NEWS" in this document.) ace_source_sgml.v1.0.2.dtd - SGML DTD for .sgm files ag-1.1.dtd - XML DTD for AG files 11. Copyright Information Portions © 2000-2003 Agence France Presse, © 2003 The Associated Press, © 2003 New York Times, © 2000-2001, 2003 Xinhua News Agency, © 2003 Cable News Network LP, LLLP, © 2000-2001 SPH AsiaOne Ltd, © 2000-2001 China Broadcasting System, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 2000-2001 China Central TV, © 2000-2001 Al Hayat, © 2000-2001 An-Nahar, © 2000-2001 Nile TV, © 2005, 2006 Trustees of the University of Pennsylvania 12. Contact Information If you have questions about this data release, please contact the following personnel at the LDC. Christopher Walker - ACE Project Manager Stephanie Strassel - LDC Annotation Group Director/ACE Consultant Julie Medero - ACE Lead Developer Kazuaki Maeda - Technical Consultant/Manager README Created January 9, 2006 Julie Medero Updated January 9, 2006 Stephanie Strassel