LDC94T4A - Complete UN Parallel Text corpus
LDC94T4B-1 - English text only
LDC94T4B-2 - French text only
LDC94T4B-3 - Spanish text only
This set of three compact discs contains documents provided to the LDC by
the United Nations, for use in research on machine translation technology. The
documents come from the Office of Conference Services at the UN in New York and
are drawn from archives that span the period between 1988 and 1993.
This publication contains the English, French and Spanish archives, with
data from each language stored on a separate disc in the set. Care has been
taken to arrange the document files in a parallel directory structure for each
language, so that corresponding translations of a document are found directly
by means of the directory paths and file names.
All parallel files in this corpus are English-based: for every file on the
English disc, there will be a corresponding file on either the French or
Spanish disc, or both. Tables are included on all discs to assist in
determining which parallels are present. The total content by language is
summarized below (values are approximate):
No. of Millions
Language documents of words
-------------------------------------
English22,00059
French20,00058
Spanish14,40048
French/Spanish
parallel data12,70038 (per language)
-------------------------------------
In preparing the text for publication, we have applied a SGML tagging
(Standard Generalized Markup Language) that preserves all typographic and
meta-information that was present in the UN archival files. For those
researchers who use SGML, a working DTD (Document Type Definition) is provided
on each disc. For those who do not need SGML markup, a simple script is
included, for use with the sed (stream-editor) utility, that will filter
out the SGML-specific material and meta-information, leaving only the plain
text. (Sed is a standard utility on unix systems, and is also available as
free software for MS-based systems). The character set used is the 8-bit ISO
8859-1 Latin1, in which accented letters and some other non-ASCII characters
occupy the upper 128 entries of the character table.
Parallel samples of the three languages in this publication are listed
below.
Based on the combined usage of title strings and document numbers, it was
possible to identify parallel sets amounting to over 60% of the data in the
archive (a total of 56,684 files in 21,986 parallel sets). We have yet to find a
reasonable method for doing a more careful search for parallels in the
remaining 40%. Part of this residue is due to the fact that this corpus
contains only English-based parallel sets parallel sets that included only
French and Spanish versions have not been included in this release.
Users of this corpus must be warned that the parallel sets identified by
this automatic method will include errors. We have observed a number of cases
(over 700 in the corpus as a whole) where the members of a parallel set show a
serious discrepancy in quantity of text. Also, we must expect that at least
some of these sets (and perhaps some less obvious cases) constitute a complete
mismatch. The reftable files in the tables directory give an indication of
the relative consistency among members of parallel set in terms of overall
size. From these tables, the least likely candidates for parallelism can be
easily identified.
Content Copyright
Portions © 1988-1993 United Nations, © 1994 Trustees of the University of Pennsylvania |