Introduction
Annotated English Gigaword was developed by Johns
Hopkins University's Human Language Technology Center of Excellence. It
adds automatically-generated syntactic and discourse structure annotation to
English Gigaword Fifth Edition (LDC2011T07)
and also contains an API and tools for reading the dataset's XML files. The
goal of the annotation is to provide a standardized corpus for knowledge extraction
and distributional semantics which enables broader involvement in large-scale
knowledge-acquisition efforts by researchers.
Data
Annotated English Gigaword contains the nearly ten million documents (over
four billion words) of the original English Gigaword Fifth Edition from seven
news sources:
- Agence France-Presse, English Service (afp_eng)
- Associated Press Worldstream, English Service (apw_eng)
- Central News Agency of Taiwan, English Service (cna_eng)
- Los Angeles Times/Washington Post Newswire Service (ltw_eng)
- Washington Post/Bloomberg Newswire Service (wpb_eng)
- New York Times Newswire Service (nyt_eng)
- Xinhua News Agency, English Service (xin_eng)
The following layers of annotation were added:
- Tokenized and segmented sentences
- Treebank-style constituent parse trees
- Syntactic dependency trees
- Named entities
- In-document coreference chains
The annotation was performed in a three-step process: (1) the data was preprocessed
and sentences selected for annotation (sentences with more than 100 tokens were
excluded); (2) syntactic parses were derived; and (3) the parsed output was
post-processed to derive syntactic dependencies, named entities and coreference
chains. Over 183 million sentences were parsed.
The data is stored in a form similar to the gigaword SGML format with XML annotations
containing the additional markup. The included API provides object representations
for the contents of the XML files.
Samples
Please the link for a
sample.
Additional Licensing Information
Any 2011 member organization that licensed English Gigaword Fifth Edition
(LDC2011T07) may request a no-cost copy of Annotated English Gigaword. Any non-member organization that licensed English Gigaword Fifth Edition
may request a copy of Annotated English Gigaword for a $200 media fee. Please contact
ldc@ldc.upenn.edu for licensing or
with any additional questions.
Updates
None at this time.
Content Copyright
Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated
Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009
Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York
Times, © 2010 The Washington Post News Service with Bloomberg News, ©
1995-2010 Xinhua News Agency, © 2012 Matthew R. Gormley, © 2003, 2005, 2007, 2009, 2011, 2012 Trustees
of the University of Pennsylvania
|