Introduction
French Gigaword Third Edition is a comprehensive archive of newswire text data that has been
acquired over several years by the Linguistic Data Consortium (LDC) at the University
of Pennsylvania. This third edition updates
French Gigaword Second Edition (LDC2009T28) and adds material collected from
January 1, 2009 through December 31, 2010.
The two distinct international sources of French newswire in this
edition, and the time spans of collection covered for each, are as
follows:
- Agence France-Presse(afp_fre) May 1994 - Dec. 2010
- Associated Press French Service (apw_fre) Nov. 1994 - Dec. 2010
The seven-letter codes in parentheses include the three-character
source name abbreviations and the three-character language code
(fre) separated by an underscore (_) character. The three-letter
language code conforms to theISO 639-2/B standard.
Data
Each data file name consists of the 7-letter prefix plus another
underscore character, followed by a 6-digit date (representing the
year and month during which the file contents were generated by the
respective news source), followed by a .gz file extension,
indicating that the file contents have been compressed using the GNU
gzip compression utility (RFC 1952).
So, each file contains all the usable data received by LDC for the given month from the
given news source.
All text data are presented in SGML form, using a very simple, minimal
markup structure all text consists of printable ASCII, white space,
and printable code points in the Latin1 Supplement character table,
as defined by the Unicode Standard (ISO 10646) for the accented
characters used in French. The Supplement/accented characters are
presented in UTF-8 encoding.
The file dtd/gigaword_f.dtd in the dtd directory provides the formal
Document Type Declaration for parsing the SGML content. The corpus
has been fully validated by a standard SGML parser utility (nsgmls),
using this DTD file.
The SGML structure for this release represents some notable differences relative
to the markup strategy used in early (pre-Gigaword) LDC publications
of newswire data these are intended to facilitate bulk processing of
the present corpus. The major differences are:
- Early corpora usually organized the data as one file per day, or
limited the average file size to one megabyte (MB).
Typical compressed file sizes in the current corpus range from about
0.1 MB to about 10 MB this equates to a range of about 0.5 to 30 MB
per file when the data are uncompressed. In general, these files are
not intended for use with interactive text editors or word processing
software (though many such programs are likely to work reasonably well
with these files). Rather, its expected that the files will be used
as input to programs that are geared to dealing with data in such
quantities, for filtering, conditioning, indexing, statistical
summary, etc.
- Early corpora tended to use different markup outlines (different tag sets)
depending on the data source the data source structural properties were generally
preserved to the extent possible (even though many elements of the delivered
structure may have been meaningless for research use).
The present corpus uses only the information structure that is common
to all sources and serves a clear function: headline, dateline, and
core news content (usually containing paragraphs). The dateline is
a brief string typically found at the beginning of the first paragraph
in each news story, giving the location the report is coming from, and
sometimes the news service and/or date since this content is not part
of the initial sentence, we separate it from the first paragraph (this
was not done prior to the Gigaword corpora).
For all of the documents in this corpus, we have applied a rudimentary (and
approximate) categorization of DOC units into four distinct types. The classification
is indicated by the type=string attribute that is included in each opening
DOC tag. The four types are:
- story : This is by far the most frequent type, and it represents the
most typical newswire item: a coherent report on a particular topic
or event, consisting of paragraphs and full sentences.
- multi : This type of DOC contains a series of unrelated blurbs,
each of which briefly describes a particular topic or event this is
typically applied to DOCs that contain summaries of todays news,
news briefs in ... (some general area like finance or sports), and
so on.
- advis : (short for advisory) These are DOCs which the news service
addresses to news editors -- they are not intended for publication
to the end users (the populations who read the news).
- other : This represents DOCs that clearly do not fall into any of
the above types -- in general, items of this type are intended for broad circulation
(they are not advisories), they may be topically coherent (unlike multi
type DOCS), and they typically do not contain paragraphs or sentences (they
are not really stories) these are things like lists of sports scores, stock
prices, temperatures around the world, and so on.
The overall totals for each source are summarized below. Note that the Totl-MB
numbers show the amount of data when the files are uncompressed (i.e. approximately
15 gigabytes, total) the Gzip-MB column shows totals for compressed file
sizes as stored on the DVD-ROM the K-wrds numbers are simply the number of
white space-separated tokens (of all types) after all SGML tags are eliminated.
| Source |
#Files |
Gzip-MB |
Totl-MB |
K-wrds |
#DOCs |
| afp_fre |
195 |
1503 |
4255 |
641381 |
2356888 |
| apw_fre |
194 |
489 |
1446 |
221470 |
801075 |
| TOTAL |
389 |
1992 |
5701 |
862851 |
3157963 |
Sample
Content Copyright
Portions © 1994-2010 Agence France-Presse, © 1994-2010 The Associated
Press, © 2006, 2009, 2011 Trustees of the University of Pennsylvania
|