Introduction
French Gigaword Second Edition is a comprehensive archive of newswire text
data that has been acquired over several years by LDC. This second edition updates
French
Gigaword First Edition (LDC2006T7) and adds material collected from August
1, 2006 through December 31, 2008.
The two distinct international sources of French newswire in this
edition, and the time spans of collection covered for each, are as
follows:
- Agence France-Presse (afp_fre) May 1994 - Dec 2008
- Associated Press Worldstream, French (apw_fre) Nov 1994 - Dec 2008
The seven-letter codes in parentheses include the three-character source name
abbreviations and the three-character language code ("fre") separated by an
underscore ("_") character. The three-letter language code conforms to LDC's
internal convention based on the ISO 639-3 standard. These codes are used in
the directory names where the data files are found and in the prefix that appears
at the beginning of every data file name. They are also used (in all UPPER CASE)
as the initial portion of the DOC "id" strings that uniquely identify each news
story.
Data
The overall totals for each source are summarized below. The "Totl-MB" numbers
show the amount of data obtained when the files are uncompressed (i.e., approximately
15 gigabytes, total); the "Gzip-MB" column shows totals for compressed file
sizes as stored on the DVD-ROM; and the "K-wrds" numbers are the number of whitespace-separated
tokens (of all types) after all SGML tags are eliminated.
| Source | #Files | Gzip-MB | Totl-MB | K-wrds | #DOCs |
| AFP_FRE |
172 |
2408 |
4079 |
560000 |
2060803 |
| APW_FRE |
171 |
2280 |
1719 |
241324 |
0872573 |
| TOTAL |
343 |
4688 |
5789 |
801324 |
2933376 |
The following tables present "Text-MB", "K-wrds" and "#DOCS" broken
down by source and DOC type; "Text-MB" represents the total number of
characters (including whitespace) after SGML tags are eliminated.
| Source | Text-MB | K-wrds | #DOCs |
| type="advis": |
| AFP_FRE |
88 |
11788 |
48712 |
| APW_FRE |
14 |
2303 |
9235 |
| TOTAL |
103 |
14091 |
57947 |
| type="multi": |
| AFP_FRE |
59 |
8411 |
10269 |
APW_FRE |
194 |
29828 |
52240 |
| TOTAL |
253 |
38239 |
62509 | | type="other": |
| AFP_FRE |
178 |
58514 |
8411 |
| APW_FRE |
82 |
193981 |
29828 |
| TOTAL |
260 |
38239 |
38239 |
| type="story": |
| AFP_FRE |
1824 |
198440 |
27216 |
| APW_FRE |
729 |
87662 |
13006 |
| TOTAL |
2553 |
286102 |
40222 |
The data has undergone a consistent extent of quality control to eliminate
out-of-band content and other obvious forms of corruption. Since the source
data is generated manually on a daily basis, there will be a small percentage
of human errors common to all sources: missing whitespace, incorrect or variant
spellings, badly formed sentences, and so on, as are normally seen in newspapers.
No attempt has been made to address this property of the data.
Samples
For an example of the data in this corpus, please view this image of the text of French Gigaword.
Content Copyright
Portions © 1994-2008 Agence France-Presse, © 1994-2008 The Associated
Press, © 2006, 2009 Trustees of the University of Pennsylvania |