Introduction
1993-2007 United Nations Parallel Text was developed by Google
Research. It consists of United Nations (UN) parliamentary documents from
1993 through 2007 in the official languages of the UN: Arabic, Chinese, English,
French, Russian, and Spanish. There are 673,670 raw text documents and 520,283
word alignment documents.
UN parliamentary documents are available from the UN Official Document System
(UN ODS) at http://ods.un.org/. UN ODS, in
its main UNDOC database, contains the full text of all types of UN parliamentary
documents. It has complete coverage datng from 1993 and variable coverage before
that. Documents exist in one or more of the official languages of the UN: Arabic,
Chinese, English, French, Russian, and Spanish. UN ODS also contains a large
number of German documents, marked with the language other, but these are
not included in this dataset.
For more information, see the UN ODS documentation at http://documents.un.org/help_E.htm.
For more details of the UN bibliographic systems, see http://www.un.org/depts/dhl/unbisref_manual/.
LDC has released parallel UN parliamentary documents in English, French and
Spanish spanning the period 1988-1993, UN
Parallel Text (Complete) (LDC94T4A).
Data
The data is presented as raw text and word-aligned text. The raw text is very
close to what was extracted from the original word processing documents in UN
ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding.
The word-aligned text was normalized, tokenized, aligned at the sentence-level,
further broken into sub-sentential chunk-pairs, and then aligned at the word.
The sentence, chunk, and word alignment operations were performed separately
for each individual language pair.
The files are presented in tar files and compressed using the bzip2 compression
utility. The bzip2 utility is standard in
most Linux releases. For Windows users, there are a variety of decompression
software options. 7-Zip will decompress
tar and bzip2 formats.
Note that in the data/aligned folder, the en-zh-1993.tar.bz2 and en-zh-1994.tar.bz2
archives decompress into empty folders. This is intentional as there is no Chinese
aligned data for those two years.
Samples
Please view this
raw English sample,
raw French sample,
aligned English-French
sample.
Updates
None at this time.
Content Copyright
Portions © 2012 Google Inc., © 1993-2007 United Nations, © 2013
Trustees of the University of Pennsylvania
|