One of the Linguistic Data Consortium's major goals for next year is the acquisition of multilingual text to support research in machine translation and other areas. Parallel multilingual texts are especially valuable, but are extremely difficult to find (see Multilingual Parallel Text Corpora, Susan Armstrong-Warwick, LDC Newsletter, Vol. 1, No. 2, pp 3-4). A significant step forward in this effort has resulted from negotiations with the United Nations in New York. The UN has agreed to make its electronic text archives available for language research, and the LDC has taken on the task of making these archives accessible to the research community. Initial negotiations with the UN were made by Dragon Systems, Inc. beginning in 1990, and were continued by the LDC in December 1992.
The electronic archives consist of all UN documents of public record dating from 1988 to the present. The documents include the proceedings, resolutions and reports of the General Assembly, the Security Council, UNICEF, the Economic and Social Council, and numerous other committees, commissions and councils within the UN. The majority of archival material represents parallel text in the six official languages of the UN: English, French, Spanish, Russian, Arabic and Chinese.
So far, the LDC has received copies on tape of only the English, French and Spanish archives. The amount of text delivered to date is in the neightborhood of 2.5 gigabytes. The full extent of parallelism in these texts is not clear at present; it appears that some portion of the archives is made up of material that exists in only one or two of the three languages.
Obtaining data in the other three official languages will likely require somewhat greater effort, because the UNŐs archiving practices were not consistent across all languages. While the English, French and Spanish archives exist on removable 80 megabyte disk packs, the Chinese, Arabic and Russian data are only found on tape cartridges and /or 5 1/4 floppy disks. Since no texts from these languages have been sent to the LDC yet, it is uncertain what additional effort will be required to transform the text to an accessible format, and how much data (and parallel text) actually exist in these languages.
The UN texts were created and archived on Wang VS computer systems, using the Wang WP word processing program. The tapes delivered to the LDC were copied from the archived disk packs by means of Wang BACKUP. Each of these Wang programs uses its own file formatting scheme, which had to be reverse-engineered at the LDC so that programs could be written to extract the actual text data from the tapes. The LDC's efforts to decipher the WP character encoding, format control codes and file structure were helped substantially by Dominique Petitpierre of ISSCO, as well as by the technical support staff at Wang Office Systems.
The English, French and Spanish texts are being transliterated to the ISO 8859-1 (Latin1) character set, an 8-bit encoding system in which accented characters of European languages (and some other specialized symbols) are provided in the upper half of the 256-character table. Common 7-bit ASCII, or ISO 646, occupies the lower half of the table. In addition, the various WP text formatting conrol codes (such as line-centering, underlining, indentation, tab-stop settings, etc.) are being preserved in the form of Standard Generalized Markup Language (SGML) tags. Considerable care is being taken to ensure that the resulting text files are fully SGML compatible and parsable. In this regard, we are especially grateful to David McKelvie of the HCRC in Edinburgh, Scotland, for providing a critique and verification of some extracted samples, and for creating a complete SGML Document Type Definition (DTD) and character set specification, which will be distributed with the data when it is published.
It is not clear at present what solution will be adopted for character encoding in the three other languages, once these become available from the UN. We would welcome suggestions from LDC members as to the most widely used methods for encoding Arabic, Russian and Chinese, bearing in mind that the resulting files must accomodate some Roman alphabetic and decimal numeric strings interspersed with the running text, as well as, presumably, SGML tags. The LDC would like to develop (or approximate) a consensus on this issue.
The initial publication of parallel text data from the UN is expected to be ready for release in the fall of 1993. It will consist of one CD-ROM each for English, French and Spanish (approximately 650 megabytes per disc). The directory and file structure on each disc will directly reflect the parallel relations among texts (i.e., a given document will have the same path and file name on each disc, with the exception of a single component that describes the anguage of each file). While there will be no attempt to insert tages specifically to mark alignment points within parallel documents, there will be an abundance of cues for alignment within the text itself, owing to the fairly consistent use of typographic formatting (chapter headings, etc.), and frequent use of sequence numbers assigned to each paragraph.
The UN corpus promises to be an invaluable resource for researchers in machine translation. The UN will benefit from this work as well, in that the archives will be returned to them in a converted form that will be portable to their now PC-based word processing systems.