LDC94T4A - Complete UN Parallel Text corpus
LDC94T4B-1 - English text only
LDC94T4B-2 - French text only
LDC94T4B-3 - Spanish text only
This set of three compact discs contains documents provided
to the LDC by the United Nations, for use in research on machine
translation technology. The documents come from the Office of
Conference Services at the UN in New York and are drawn from
archives that span the period between 1988 and 1993.
This publication contains the English, French and Spanish archives,
with data from each language stored on a separate disc in the set.
Care has been taken to arrange the document files in a parallel
directory structure for each language, so that corresponding
translations of a document are found directly by means of the
directory paths and file names.
All parallel files in this corpus are English-based: for every file on
the English disc, there will be a corresponding file on either the
French or Spanish disc, or both. Tables are included on all discs to
assist in determining which parallels are present. Due to the nature
and organization of UN translation services and the original
electronic text archives, the process of finding and sorting out
parallel documents yielded a numerous gaps, with many files in each
language having no parallel in other languages.
In preparing the text for publication, we have applied a
fully-compliant SGML format (Standard Generalized Markup Language).
For those researchers who use SGML, a working DTD (Document Type
Definition) is provided on each disc. For those who do not need SGML
markup, a simple script is included that can be used to filter out the
SGML-specific material and leave only the plain text. The character
set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and
some other non-ASCII characters occupy the upper 128 entries of the
character table.
Content Copyright |