|

|
|
Language Understanding Annotation Corpus
| |
| Item Name: | Language Understanding Annotation Corpus |
| Authors: | Mona Diab, Bonnie Dorr, Lori Levin, Teruko Mitamura, Rebecca Passonneau, Owen Rambow, Lance Ramshaw |
| LDC Catalog No.: | LDC2009T10 |
| ISBN: | 1-58563-513-8 |
| Release Date: | Mar 17, 2009 |
| Data Type: | text |
| Data Source(s): | broadcast conversation, broadcast news, email, newswire, telephone speech, varied |
| Application(s): | pragmatics |
| Language(s): | Arabic, English |
| Language ID(s): | arb, eng |
| Distribution: | Web Download |
| Member fee: | $0 for 2009 members |
| Non-member Fee: | US $0.00 |
| Reduced-License Fee: | US $0.00 |
| Extra-Copy Fee: | N/A |
| Non-member License: | yes |
| Licensing Instructions: | Subscription Members, Standard Members, Non-Members |
| Citation: | Mona Diab, et al. 2009 Language Understanding Annotation Corpus Linguistic Data Consortium, Philadelphia |
|
Introduction
The Language Understanding Annotation Corpus, Linguistic Data Consortium (LDC)
catalog number LDC2009T10 and isbn 1-58563-513-8, emerged from a series of interdisciplinary
meetings on semantics and pragmatics hosted by the Human
Language Technology Center of Excellence at Johns Hopkins University. The
participants were researchers from BBN Technologies, Carnegie
Mellon University and Columbia University
who were developing representations of text semantics, machine translation and
summarization systems. The resulting corpus contains over 9000 words of English
text (6949 words) and Arabic text (2183 words) annotated for committed belief,
event and entity coreference, dialog acts and temporal relations. The source
materials were chosen from various genres to represent "informal input,"
that is, text that contains colloquial forms. The documents in the corpus include
excerpts from newswire stories, telephone conversation transcripts, emails,
contracts and written instructions.
The problem was modeled as an extended exercise in extracting information
elements from a "document" (that is, from discrete language records
in written or spoken forms). The goal was to answer two broad questions:
- What are the elements of knowledge that can be derived from a document?
- Can the representation, and hence, the annotation, be laid out in terms
of iterative layers, the accumulation of which would represent the sum of
the knowledge?
The annotations attempted to resolve these questions in the following ways:
- Belief/Opinion/Confidence. Committed belief annotation distinguishes between
statements which assert belief or opinion, those which contain speculation,
and statements which convey facts or otherwise do not convey belief. The goal
is to be able to determine automatically from a given text what beliefs can
be ascribed to the author and with what strength the author holds those beliefs.
- Dialog Acts. Dialog act annotation seeks to determine the forward and backward
links between pairs of dialog acts.
- Coreference (entities and events). Event coreferences indicate which events
are related to other events at the document level. Entity relations within
these related events provide further information about e.g., the main actors,
targets and causes of the events.
- Temporal relations. Temporal annotations mark the temporal relationship
between the different events and time anchors mentioned in a document, that
is, it highlights what the text is saying about the time line of time-mentions.
Content Copyright
Portions © 2000, 2002 Agence France Presse, © 2000 Al Hayat, ©
2000 The Associated Press, © 2003, 2005 Cable News Network, LP, LLLP, ©
1987-1989 Dow Jones & Company, Inc., © 2003 Indiana Center for Intercultural
Communication, © 2000 New York Times, © 2000 Xinhua News Agency, ©
1992, 1993, 1997, 2009 Trustees of the University of Pennsylvania |
|
|