Introduction
GALE Phase 1 Distillation Training, Linguistic
Data Consortium (LDC) catalog number LDC2007T20 and isbn 1-58563-452-2,
constitutes the final release of training
data created by LDC for the DARPA GALE Program Phase 1 Distillation technology
evaluation. Distillation is one of three primary technology components for the
DARPA GALE Program, along with Transcription and Translation. Distillation engines
respond to queries from English-speaking users, delivering pertinent, consolidated
information in easy-to-understand forms. The distillation engine processes English
and foreign language material, both speech and text, from multiple sources and
documents, removing redundancy and presenting an integrated response to the
user.
This release consists of 248 English, Chinese and/or Arabic queries and
their responses, created by LDC annotators. Queries conform to one of ten
template types. Query responses may include document and snippet relevance
judgments, nuggets, nugs and supernugs. 158 of the 248 queries have been
annotated for all features, while the remainder are labeled for only some
features. In addition, not all queries have been exhaustively annotated
for a given feature, given resource constraints during corpus development.
The table below indicates the number of queries that have been labeled for
each template in each source language.
| English | Chinese | Arabic |
| Template 1 | 15/28 | 9/17 | 12/16 |
| Template 3 | 16/29 | 9/29 | 13/29 |
| Template 4 | 15/23 | 7/18 | 11/18 |
| Template 5 | 21/39 | 10/39 | 20/36 |
| Template 6 | 15/20 | 7/19 | 7/20 |
| Template 8 | 12/14 | 6/13 | 5/14 |
| Template 9 | 14/23 | 7/21 | 10/21 |
| Template 11 | 11/22 | 8/15 | 2/14 |
| Template 15 | 12/21 | 8/11 | 5/11 |
| Template 16 | 13/24 | 10/12 | 8/12 |
| Total | 144/243 | 81/194 | 93/191 |
Annotation
The annotation task involves responding to a series of user queries. For
each query, annotators first find relevant documents and identify snippets
(strings of contiguous text that answer the query) in the Arabic, Chinese
or English source document. Annotators then create a nugget for each fact
expressed in the snippet. Semantically equivalent nuggets are grouped into
cross-language, cross-document "supernugs". Judges at BAE Systems finally
provide relevance weights for each supernug.
Queries in this release have been annotated for the following tasks:
- searching for relevant documents and providing yes/no judgements
- extracting snippets
- resolution of pronouns, and certain types of temporal and locative expressions contained in the snippets
- creating nuggets, i.e. atomic pieces of information that an annotator considers a valid answer to the query
- building nugs, i.e. clusters of semantically-equivalent nuggets for each language
- building supernugs, i.e. clusters of semantically-equivalent nugs across languages
Samples
For an example of the data contained in this corpus, please review this sample.
Sponsorship
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Content Copyright
Portions © 2003 Agence France Presse, © 2000, 2001 American Broadcasting
Company, © 2000, 2001, 2003 The Associated Press, © 2000, 2001 Cable
News Network, LP, LLLP, © 2003 Los Angeles Times-Washington Post News Service,
Inc., © 2000 National Broadcasting Company, Inc., © 2000, 2001 New
York Times, © 2000, 2001 Public Radio International, © 2000 SPH AsiaOne
Ltd, © 2003 Ummah Press Service, © 2003 Xinhua News Agency, ©
2006, 2007 Trustees of the University of Pennsylvania
The World is a co-production of Public Radio International and the British
Broadcasting Corporation and is produced at WGBH Boston. |