Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Spanish Gigaword Second Edition

Item Name: Spanish Gigaword Second Edition
Authors: Angelo Mendonca, David Graff, Denise DiPersio
LDC Catalog No.: LDC2009T21
ISBN: 1-58563-518-9
Release Date: Jul 17, 2009
Data Type: text
Data Source(s): newswire
Project(s): EARS, GALE, TIDES
Application(s): information retrieval, language modeling, natural language processing
Language(s): Spanish
Language ID(s): spa
Distribution: 1 DVD
Member fee: $0 for 2009 members
Non-member Fee: US$4000.00
Reduced-License Fee: N/A
Extra-Copy Fee: US$200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Angelo Mendonca, David Graff, Denise DiPersio
2009
Spanish Gigaword Second Edition
Linguistic Data Consortium, Philadelphia

Introduction

Spanish Gigaword Second Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates Spanish Gigaword First Edition (LDC2006T12) and adds data collected from January 1, 2006 through December 31, 2008.

The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:

  • Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008
  • Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008
  • Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008

The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("spa") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC "id" strings that uniquely identify each news story.

Data

The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data obtained when the files are uncompressed (i.e. approximately 7 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_SPA
175
1182
3512
506562
1748787
APW_SPA
180
886
2721
402718
1244811
XIN_SPA
88
405
1238
182543
734356
TOTAL
443
2453
7471
1091823
3727954

The following tables present "Text-MB", "K-wrds" and "#DOCS" broken down by source and DOC type; "Text-MB" represents the total number of characters (including whitespace) after SGML tags are eliminated.

Text-MB K-wrds #DOCs
type="advis":
AFP_SPA
144
20520
45446
APW_SPA
41
6173
11112
XIN_SPA
0
0
0
TOTAL
185
26693
56558
type="multi":
AFP_SPA
84
12711
15346
APW_SPA
351
55758
107224
XIN_SPA
189
29970
56372
TOTAL
624
98439
178942
type="other":
AFP_SPA
275
38665
160815
APW_SPA
296
40517
162448
XIN_SPA
44
6376
50168
TOTAL
615
85558
373431
type="story":
AFP_SPA
2771
434677
1527180
APW_SPA
1875
300274
964027
XIN_SPA
911
146199
627816
TOTAL
5557
881150
3119023

Samples

Portions © 1994-2008 Agence France Presse, © 1993-2008 The Associated Press, © 2001-2008 Xinhua News Agency, © 2006, 2009 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.