Introduction
This release of Spanish newswire contains data from the following
sources:
Agence France Presse (January 13, 1996--December 13,1998)
Associated Press Worldstream (December 1, 1995--August 31, 1998)
El Norte (January 1, 1997--December 31, 1998)
Data
The consistent format chosen for release consists of SGML tagging and
the ISO-8859-1 (Latin1) 8-bit character set. Our general strategy for
SGML tagging is as follows:
All document units (articles) are bounded by the tags DOC and
/DOC, and within these units, the text content of each article is
bounded by TEXT and /TEXT. Following each DOC tag is a DOCID
tag that provides a unique identifying string for that article. Other
tags within the DOC unit (but external to TEXT) provide additional
information that was receieved with the article (e.g. headline,
dateline, byline, keywords, etc), but the inventory and nature of
additional information varies from one source to the next (and in some
cases, from one article to the next), and this variability is
reflected in the SGML tags that are used to preserve the information.
Within the TEXT units, tagging is kept to a minimum, typically
consisting only of paragraph tags.
Updates
There are no updates at this time.
Copyright |