|

|
|
TIDES Extraction (ACE) 2003 Multilingual Training Data
| |
| Item Name: | TIDES Extraction (ACE) 2003 Multilingual Training Data |
| Authors: | Alexis Mitchell, Stephanie Strassel, Mark Przybocki, JK Davis, George Doddington, Ralph Grishman, Adam Meyers, Ada Brunstein, Lisa Ferro, and Beth Sundheim |
| LDC Catalog No.: | LDC2004T09 |
| ISBN: | 1-58563-292-9 |
| Release Date: | Feb 16, 2004 |
| Data Type: | text |
| Data Source(s): | broadcast news, newswire, transcribed speech |
| Project(s): | ACE, GALE, TIDES |
| Application(s): | automatic content extraction, information detection |
| Language(s): | English, Mandarin Chinese, Modern Standard Arabic |
| Language ID(s): | arb, eng |
| Distribution: | Web Download |
| Member fee: | $0 for 2004 members |
| Non-member Fee: | US $3000.00 |
| Reduced-License Fee: | US $1500.00 |
| Extra-Copy Fee: | N/A |
| Non-member License: | yes |
| Online documentation: | yes |
| Licensing Instructions: | Subscription Members, Standard Members, Non-Members |
| Citation: | Alexis Mitchell, et al. 2004 TIDES Extraction (ACE) 2003 Multilingual Training Data Linguistic Data Consortium, Philadelphia |
|
Introduction
TIDES Extraction (ACE) 2003 Multilingual Training Data was produced by
Linguistic Data Consortium (LDC) catalog number LDC2004T09 and ISBN
1-58563-292-9.
This corpus was created and previously distributed by Linguistic Data Consortium as an e-corpus (catalog number LDC2003E18)
to support the September 2003 TIDES Extraction (ACE) program evaluation.
For information regarding the ACE program and ACE technology evaluations
administered by the National Institute of Standards and Technology, please
visit the
NIST website.
For more information about ACE annotation and ongoing ACE corpus
development, including annotation guidelines, task definitions, annotation
tools and other project documentation, please visit LDC's
ACE Project page.
The source material for this corpus consists of broadcast and newswire data
drawn from October 2000 through the end of December 2000.
The sources are listed below.
Newswire:
- Arabic
- Agency France Press (AFA)
- Al Hayat (ALH)
- An-Nahar (ANN)
Chinese
- Xinhua Newswire (XIN)
- Zaobao (ZBN)
English
- New York Times Newswire Service (NYT)
- Associated Press Worldstream Service (APW)
Broadcast News:
- Arabic
- Voice of America, Arabic news programs (VAR)
- Nile TV (NTV)
- Chinese
- China National Radio (CNR)
- China Television System (CTS)
- Voice of America, Chinese news programs (VOM)
- China TV Program Agency (CTV)
- China Broadcasting System (CBS)
- English
- Cable News Network, "Headline News" (CNN)
- American Broadcasting Co., "World News Tonight" (ABC)
- Public Radio International, "The World" (PRI)
- Voice of America, English news programs (VOA)
- MSNBC, "The News With Brian Williams" (MNB)
- National Broadcasting Company, "Nightly News" (NBC)
Data
Annotations for this corpus were produced by Linguistic Data Consortium to
support the following tasks broken down by language:
Arabic
- Entity Detection and Tracking (EDT)
Chinese
- Entity Detection and Tracking (EDT)
- Relation Detection and Characterization (RDC)
English
- Entity Detection and Tracking (EDT)
- Relation Detection and Characterization (RDC)
This publication includes both the source data files in .sgm format and the
annotation files in ACE Pilot Format (APF), as well as the ACE DTD and
supporting documentation.
The data files for each language are divided by source
type (bnews, nwire). For Chinese, the annotation files (.apf.xml) are
encoded in UTF8. We have included source files (.sgm) in both GB and UTF8
encoding. The following tables outline the word and file counts by
language and source.
Arabic
| Source | Words | Files |
| AFA | 11154 | 66 |
| ALH | 7437 | 20 |
| ANN | 7734 | 20 |
| VAR | 8360 | 57 |
| NTV | 7512 | 43 |
| Total | 42197 | 206 |
Chinese
| Source | Characters | Files |
| XIN | 28157 | 57 |
| ZBN | 25591 | 42 |
| CNR | 4758 | 21 |
| CTS | 7160 | 22 |
| VOM | 18160 | 42 |
| CTV | 6017 | 18 |
| CBS | 8130 | 19 |
| Total | 97973 | 221 |
English
| Source | Words | Files |
| NYT | 18983 | 24 |
| APW | 38222 | 81 |
| CNN | 5706 | 54 |
| ABC | 4453 | 15 |
| PRI | 9785 | 27 |
| VOA | 4203 | 28 |
| MNB | 4356 | 8 |
| NBC | 4976 | 15 |
| Total | 90684 | 252 |
Updates
There are no updates available at this time.
Content Copyright
© 2000 American Broadcasting Corporation
© 2000 Cable News Network, Inc.
© 2000 Press Association, Inc.
© 2000 New York Times
© 2000 National Broadcasting Company, Inc.
© 2000 Public Radio International
© 2000 Agency France Press
© 2000 Al Hayat
© 2000 An-Nahar
© 2000 Nile TV
© 2000 Xinhua News
© 2000 SPH AsiaOne Ltd.
© 2000 China National Radio
© 2000 China Television System
© 2000 China TV Program Agency
© 2000 China Broadcasting System
"The World" is a co-production of Public Radio International and the
British Broadcasting Corporation and is produced at WGBH Boston. |
|
|