Introduction
This file contains documentation on the Spanish Gigaword First Edition, Linguistic
Data Consortium (LDC) catalog number LDC2006T12 and ISBN 1-58563-393-3.
The Spanish Gigaword Corpus is a comprehensive archive of newswire
text data that has been acquired over several years by the Linguistic
Data Consortium (LDC) at the University of Pennsylvania. This is the
first edition of the Spanish Gigaword Corpus, though some of the data
included here has been released previously in other LDC corpora.
The three distinct international sources of Spanish newswire in this
edition, and the time spans of collection covered for each, are as
follows:
- Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2005
- Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2005
- Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2005
The seven-letter codes in the parentheses above include the
three-character source name abbreviations and the three-character
language code ("spa") separated by an underscore ("_") character. The
three-letter language code conforms to LDC's new internal convention
based on the new ISO 639-3 standard.
The seven-letter codes are used in both the directory names where
the data files are found, and in the prefix that appears at the
beginning of every data file name. It is also used (in all UPPER
CASE) as the initial portion of the DOC "id" strings that uniquely
identify each news story.
Data
The overall totals for each source are summarized below. Note that the
"Totl-MB" numbers show the amount of data you get when the files are
uncompressed (i.e. approximately 5 gigabytes, total); the "Gzip-MB"
column shows totals for compressed file sizes as stored on the
DVD-ROM; the "K-wrds" numbers are simply the number of
whitespace-separated tokens (of all types) after all SGML tags are
eliminated.
| Source | #Files | Gzip-MB | Totl-MB | K-wrds | #DOCs |
| AFP_SPA | 139 | 926 | 2731 | 393354 | 1382679 |
| APW_SPA | 144 | 600 | 1806 | 263225 | 886998 |
| XIN_SPA | 52 | 212 | 648 | 94459 | 388561 |
| TOTAL | 335 | 1738 | 5185 | 751038 | 2658238 |
The following tables present "Text-MB", "K-wrds" and "#DOCS" broken
down by source and DOC type; "Text-MB" represents the total number of
characters (including whitespace) after SGML tags are eliminated.
| Text-MB | K-wrds | #DOCs |
| type="advis": |
| AFP_SPA | 40 | 15505 | 40580 |
| APW_SPA | 11 | 6173 | 11112 |
| XIN_SPA | 0 | 0 | 0 |
| TOTAL | 51 | 21678 | 51692 |
|
| type="multi": |
| AFP_SPA | 12 | 10282 | 12514 |
| APW_SPA | 30 | 12519 | 30892 |
| XIN_SPA | 32 | 17773 | 32463 |
| TOTAL | 74 | 40574 | 75869 |
|
| type="other": |
| AFP_SPA | 126 | 28305 | 126530 |
| APW_SPA | 153 | 39038 | 153932 |
| XIN_SPA | 26 | 3325 | 26828 |
| TOTAL | 305 | 70668 | 307290 |
|
|
| AFP_SPA | 2166 | 339271 | 1202785 |
| APW_SPA | 1287 | 205501 | 691062 |
| XIN_SPA | 463 | 73360 | 329270 |
| TOTAL | 3916 | 618132 | 2223117 |
Samples
For an example of the data in this publicaiton, please examine this sample file.
Content Copyright
Portions © 1994-2005 Agence France Presse, © 1993-2005 The Associated Press, © 2001-2005 Xinhua News Agency, © 2006 Trustees of the University of Pennsylvania |