Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



UN Parallel Text (Complete)

Item Name: UN Parallel Text (Complete)
Authors: David Graff
LDC Catalog No.: LDC94T4A
ISBN: 1-58563-038-1
Data Type: text
Data Source(s): government documents
Application(s): machine translation
Language(s): English, French, Spanish
Language ID(s): eng, fra
Distribution: 1 DVD
Member fee: $0 for 1994 members
Non-member Fee: US $4000.00
Reduced-License Fee: US $2000.00
Extra-Copy Fee: US $450.00
Non-member License: yes
Member License: yes
Readme File: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: David Graff
1994
UN Parallel Text (Complete)
Linguistic Data Consortium, Philadelphia

LDC94T4A - Complete UN Parallel Text corpus

LDC94T4B-1 - English text only
LDC94T4B-2 - French text only
LDC94T4B-3 - Spanish text only

This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York and are drawn from archives that span the period between 1988 and 1993.

This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names.

All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both. Tables are included on all discs to assist in determining which parallels are present. The total content by language is summarized below (values are approximate):

No. of Millions
Language documents of words
-------------------------------------
English22,00059
French20,00058
Spanish14,40048

French/Spanish
parallel data12,70038 (per language)
-------------------------------------

In preparing the text for publication, we have applied a SGML tagging (Standard Generalized Markup Language) that preserves all typographic and meta-information that was present in the UN archival files. For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included, for use with the sed (stream-editor) utility, that will filter out the SGML-specific material and meta-information, leaving only the plain text. (Sed is a standard utility on unix systems, and is also available as free software for MS-based systems). The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table.

Parallel samples of the three languages in this publication are listed below.

Based on the combined usage of title strings and document numbers, it was possible to identify parallel sets amounting to over 60% of the data in the archive (a total of 56,684 files in 21,986 parallel sets). We have yet to find a reasonable method for doing a more careful search for parallels in the remaining 40%. Part of this residue is due to the fact that this corpus contains only English-based parallel sets parallel sets that included only French and Spanish versions have not been included in this release.

Users of this corpus must be warned that the parallel sets identified by this automatic method will include errors. We have observed a number of cases (over 700 in the corpus as a whole) where the members of a parallel set show a serious discrepancy in quantity of text. Also, we must expect that at least some of these sets (and perhaps some less obvious cases) constitute a complete mismatch. The reftable files in the tables directory give an indication of the relative consistency among members of parallel set in terms of overall size. From these tables, the least likely candidates for parallelism can be easily identified.

Content Copyright

Portions © 1988-1993 United Nations, © 1994 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.