Introduction
Spanish Gigaword Second Edition is a comprehensive archive of newswire text
data that has been acquired over several years by LDC. This second edition updates
Spanish
Gigaword First Edition (LDC2006T12) and adds data collected from January
1, 2006 through December 31, 2008.
The three distinct international sources of Spanish newswire in this
edition, and the time spans of collection covered for each, are as
follows:
- Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008
- Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008
- Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008
The seven-letter codes in the parentheses above include the three-character
source name abbreviations and the three-character language code ("spa") separated
by an underscore ("_") character. The three-letter language code conforms to
LDC's internal convention based on the ISO 639-3 standard. These codes are used
in the directory names where the data files are found and in the prefix that
appears at the beginning of every data file name. They are also used (in all
UPPER CASE) as the initial portion of the DOC "id" strings that uniquely identify
each news story.
Data
The overall totals for each source are summarized below. Note that the "Totl-MB"
numbers show the amount of data obtained when the files are uncompressed (i.e.
approximately 7 gigabytes, total); the "Gzip-MB" column shows totals for compressed
file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number
of whitespace-separated tokens (of all types) after all SGML tags are eliminated.
| Source |
#Files |
Gzip-MB |
Totl-MB |
K-wrds |
#DOCs |
| AFP_SPA |
175 |
1182 |
3512 |
506562 |
1748787 |
| APW_SPA |
180 |
886 |
2721 |
402718 |
1244811 |
| XIN_SPA |
88 |
405 |
1238 |
182543 |
734356 |
| TOTAL |
443 |
2453 |
7471 |
1091823 |
3727954 |
The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by
source and DOC type; "Text-MB" represents the total number of characters (including
whitespace) after SGML tags are eliminated.
|
Text-MB |
K-wrds |
#DOCs |
| type="advis": |
| AFP_SPA |
144 |
20520 |
45446 |
| APW_SPA |
41 |
6173 |
11112 |
| XIN_SPA |
0 |
0 |
0 |
| TOTAL |
185 |
26693 |
56558 |
| type="multi": |
| AFP_SPA |
84 |
12711 |
15346 |
| APW_SPA |
351 |
55758 |
107224 |
| XIN_SPA |
189 |
29970 |
56372 |
| TOTAL |
624 |
98439 |
178942 |
| type="other": |
| AFP_SPA |
275 |
38665 |
160815 |
| APW_SPA |
296 |
40517 |
162448 |
| XIN_SPA |
44 |
6376 |
50168 |
| TOTAL |
615 |
85558 |
373431 |
| type="story": |
| AFP_SPA |
2771 |
434677 |
1527180 |
| APW_SPA |
1875 |
300274 |
964027 |
| XIN_SPA |
911 |
146199 |
627816 |
| TOTAL |
5557 |
881150 |
3119023 |
Samples
Portions © 1994-2008 Agence France Presse, © 1993-2008 The Associated
Press, © 2001-2008 Xinhua News Agency, © 2006, 2009 Trustees of the
University of Pennsylvania |