Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Spanish News Text

Item Name: Spanish News Text
Authors: David Graff and Gustavo Gallegos
LDC Catalog No.: LDC95T9
ISBN: 1-58563-056-X
Data Type: text
Data Source(s): newswire
Project(s): GALE, TIDES, Tipster, TREC
Application(s): information retrieval, language modeling
Language(s): Spanish
Distribution: 1 CD
Member fee: $0 for 1995, 1996 members
Non-member Fee: N/A (Members Only)
Reduced-License Fee: N/A
Extra-Copy Fee: US $150.00
Member License: yes
Readme File: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: David Graff and Gustavo Gallegos
1995
Spanish News Text
Linguistic Data Consortium, Philadelphia

The Spanish News Corpus consists of journalistic text data from one newspaper (El Norte, Mexico) and from the Spanish-language services of three newswire sources: Agence France Presse, Associated Press Worldstream, and Reuters. (The Reuters collection comprises two distinct services: Reuters Spanish Language News Service and Reuters Latin American Business Report).

All text data are stored on one CD-ROM, in a standard compressed form. The fours sets of newswire data (AFP, APWS and two Reuters services) are each organized as one data file per day of collection. The period covered by these collections runs from December 1993 (for APWS and Reuters) or May 1994 (APWS) through December 1995. (The El Norte data, provided to us by INFOSEL Mexico, are arbitrarily grouped into files of about 1 megabyte in size when uncompressed; date information is not available for individual articles, but the general period of the collection is 1993).

The approximate amounts of data per source (when uncompressed) is indicated below (in total megabytes and millions of words of text):

       Source   MB      MW
       -------------------
        AFP     345     44
        APWS    253     33
        REUSL   333     41
        REULA   233     23
        INFOSEL 209     31
The presentation of text data in these collections is modeled on the TIPSTER corpus. Within each data file, SGML tagging is used (1) to mark article boundaries, (2) to delimit the text portion within each article and (3) to label various pieces of information about the article that are external to the text content (e.g. headlines, bylines and so on).

The copyright holders of this text have requested that it be made available to LDC members only. Due to the release date this corpus is available to 1995 and 1996 members. In order to obtain this corpus, current LDC members must submit a signed User Agreement Form.

Pricing

The Reduced Licensing Fee for this corpus is US$150.

Content Copyright


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.