Introduction
The Unified Linguistic Annotation Text Collection, Linguistic Data Consortium (LDC) catalog number LDC2009T07 and isbn 1-58563-511-1, consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11).
Most recent annotation efforts for language have focused on small pieces of
the larger problem of semantic annotation rather than producing a single unified
representation. The Unified Linguistic Annotation (ULA) project, sponsored by the National
Science Foundation, seeks to integrate into one framework different layers
of annotation (e.g., semantics, discourse, temporal, opinions) using various
existing resources, including PropBank,
NomBank,
TimeBank,
Penn
Discourse Treebank and coreference and opinion annotations. The project
represents a concerted effort of researchers from several institutions to develop
a large word corpus with balanced and annotated data. The ULA Text Collection is provided as a resource for the ULA effort. It consists
of two datasets, the Language Understanding Annotation Corpus from the Johns
Hopkins Center of Excellence in Human Language Technology and ACE
Reflex Entity Translation Training Dev/Test developed by LDC.
The Language Understanding
Annotation Corpus (LDC2009T10). The Language Understanding Annotation Corpus
consists of over 9000 words of English text (6949 words) and Arabic text (2183
words) annotated for committed belief, event and entity coreference, dialog acts
and temporal relations. The materials were chosen from various sources to represent "informal input," that is, text that contains colloquial forms. The documents in the corpus include excerpts from newswire stories, telephone conversation transcripts, emails, contracts and written instructions.
REFLEX Entity Translation Training/DevTest
(LDC2009T11). REFLEX Entity Translation Training/DevTest is the complete set
of training data and development test data for the 2007 REFLEX Entity Translation evaluation sponsored by the National Institute of Standards and Technology (NIST). It contains approximately 67.5k words of newswire and weblog text for each of English, Chinese and Arabic (or approximately22.5k words in each language) translated ito each of the other two languages. The data is annotated for entities and TIMEX2 extents and normalization.
Content Copyright
Portions © 1998-2000, 2003 Agence France Presse, © 2000 Al Hayat,
© 2000, 2003 The Associated Press, © 2000, 2002 An Nahar, © 2003,
2005 Cable News Network, LP, LLLP, © 1987-1989 Dow Jones & Company,
Inc., © 2003 Indiana Center for Intercultural Communication, © 2000
New York Times, ©1994-1998, 2000-2003 Xinhua News Agency, © 1992-
2009 Trustees of the University of Pennsylvania |