Introduction
2009 CoNLL Shared Task Part 2, LDC Catalog Number LDC2012T04
and ISBN 1-58563-611-8, contains the Chinese and English trial corpora, training corpora,
development and test data for the 2009
CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation.
The 2009 Shared Task developed syntactic dependency annotations, including the
semantic dependencies model roles of both verbal and nominal predicates.
The Conference on
Computational Natural Language Learning (CoNLL) is accompanied every year
by a shared task intended to promote natural language processing applications
and evaluate them in a standard setting. The 2004 and 2005 CoNLL shared tasks
were dedicated to semantic role labeling (SRL) in a monolingual setting (English).
In 2006 and 2007, the shared tasks were devoted to the parsing of syntactic
dependencies and used corpora from up to thirteen languages. In 2008, the shared
task focused on English and employed a unified dependency-based formalism and
merged the task of syntactic dependency parsing and the task of identifying
semantic arguments and labeling them with semantic roles; that data has been
released by LDC as 2008 CoNLL Shared Task Data
(LDC2009T12).
The 2009 task extended the 2008 task
to several languages (English plus Catalan, Chinese, Czech, German, Japanese
and Spanish). Among the new features were comparison of time and space complexity
based on participants' input, and learning curve comparison for languages with
large datasets.
The 2009 shared task was divided into two subtasks:
- parsing syntactic dependencies
- identification of arguments and assignment of semantic roles for each
predicate
2009 CoNLL Shared Task Part 1
(LDC2012T03)
contains the Catalan, Czech, German and Spanish task data and is also available through LDC.
Data
The materials in this release consist of excerpts from the following corpora:
- Penn Treebank II (LDC95T7)
(English): over one million words of annotated English newswire and other
text developed by the University of Pennsylvania
- PropBank
(LDC2004T14)
(English): semantic annotation of newswire text from Treebank-2 developed
by the University of Pennsylvania
- NomBank (LDC2008T23)
(English): argument structure for instances of common nouns in Treebank-2
and Treebank-3
(LDC99T42) texts developed by New York University
- Chinese Treebank 6.0
(LDC2007T36)
(Chinese): 780,000 words (over 1.28 million characters) of annotated Chinese
newswire, magazine and administrative texts and transcripts from various broadcast
news programs developed by the University of Pennsylvania and the University
of Colorado
- Chinese Proposition Bank
2.0 (LDC2008T07)
(Chinese): predicate-argument annotation on 500,000 words from Chinese Treebank
6.0 developed by the University of Pennsylvania and the University of Colorado
In addition, an archive of all of the uploaded data from the participants is
included in the eval-data folder. Users should note that not all data indicated
in the individual READMEs is included in this release and neither are some of
the corresponding DTDs for the XML. Additionally, all data is presented in
its uncompressed form for ease of use. Within the user eval-data folder, the
two folders marked "bad" contain references to data from languages
included in Part 1 of this release as well as to Japanese data. Japanese data
is not included in this release.
Samples
For samples of documents from each language use the links below:
Updates
None at this time.
Content Copyright
Portions © 2000-2001 China Broadcasting System, © 2000-2001 China
Central TV, © 2000-2001 China National Radio, © 2000-2001 China Television
System, © 1987-1989 Dow Jones & Company, Inc., © 1997 The Government
of the Hong Kong Special Administrative Region, © 1996-2001 Sinorama Magazine,
© 1994-1998 Xinhua News Agency, © 1995, 1999, 2001, 2004, 2005, 2007,
2008, 2012 Trustees of the University of Pennsylvania
|