Introduction
Hong Kong Parallel Text was produced by Linguistic Data Consortium (LDC) catalog number
LDC2004T08 and ISBN 1-58563-290-2.
To support the research and development of automatic machine translation
systems, LDC was sponsored to create English-Chinese parallel text collected
from the Hong Kong Special Administrative Region (HKSAR).
Hong Kong Parallel Text
contains data of three sub-corpora, namely Hong Kong Hansards, Hong Kong Laws
and Hong Kong News. Hong Kong Hansards contains the excerpts
from the Official Record of Proceedings of the Legislative Council of the
HKSAR. Hong Kong Laws contains law codes acquired from the
Department of Justice of the HKSAR. Hong Kong News contains press releases
from the Information Services Department of the HKSAR.
Three similar corpora,
Hong Kong Hansards Parallel Text,
Hong Kong Laws Parallel Text and
Hong Kong News Parallel Text were
published in 2000.
The 2000 versions of Hong Kong Hansards
Parallel Text and
Hong Kong News Parallel Text are aligned at document level, while the 2004 versions are aligned at sentence level.
The 2000 and 2004 versions of
Hong Kong News Parallel Text are aligned using different sentence alignment algorithms;
as a result, the 2004 version has better sentence alignment and it also has slightly more data than the 2000 version.
Data
Hong Kong Hansards
Hong Kong Hansards contains excerpts from the Official Record of
Proceedings (hansards) of the Legislative Council of the HKSAR from October 1985 to
April 2003. The LDC downloaded the hansards, which were in pdf format, from the
official website of HKSAR. A total of 1,428 files (714 in Chinese, 714 in
English) were downloaded. One to one correspondence between the English hansards
and the Chinese hansards was indicated by the file names. The LDC converted the
pdf files into plain text files using automatic conversion software and segmented
the files at sentence boundaries. Efforts were made to remove tables
from all files.
Hong Kong Laws
Hong Kong Laws contains statute laws of Hong Kong, downloaded from the Bilingual Laws Information System (BLIS, http://www.justice.gov.hk/),
a searchable electronic database of the statute laws of Hong Kong, established
and updated by the Department of Justice of the HKSAR, in 2000.
The original BLIS database contains statute laws of Hong Kong
in English and Chinese, constitutional instruments, national laws and other
relevant instruments, collections of terms and expressions used in the laws of
Hong Kong and subject indices of Ordinances. This corpus contains only statute
laws of Hong Kong in English and Chinese, constitutional instruments, national
laws and other relevant instruments published up to year 2000.
The original files were in html format, and document level alignment was
indicated by file names. The LDC converted the
html files into plain text files using an automatic conversion software, and segmented
the files at sentence boundaries. Efforts were made to remove tables
from all files.
Hong Kong News
Hong Kong News contains press releases from July 1997 to October 2003 from
the government of HKSAR. The HKSAR publishes press releases in both Chinese and
English on a daily basis. Most press releases are available in both languages,
some were translated from English to Chinese, some were translated from
Chinese to English.
The original files were in html format. The LDC converted the
html files into plain text files using automatic conversion software. Efforts
were made to remove tables from all files.
The original files do not indicate document level alignment in any way. The
document level alignment was done at the LDC, using an automatic document
aligner. The document-aligned files were then segmented at sentence boundaries.
Sentence alignment was performed on all data using champollion, a parallel text sentence alignment software developed at
the LDC. Please see
http://champollion.sourceforge.net for more information about champollion.
Final Data Format and Validation
For the Chinese data, there are approximately 49M-words, while for the
English translation, there are approximately 59M-words in total, and
466K unique words. The following table shows the number of documents,
paragraphs, segments, words and characters for each source.
| Source |
#Documents |
#Paragraphs (English/Chinese) |
#Segments (English/Chinese) |
#English Words |
#Chinese Characters |
| Hong Kong Hansards |
714 |
642,008/632,173 |
1,688,278/1,414,573 |
36,140,737 |
56,618,181 |
| Hong Kong Laws |
42,255 |
423,192/462,283 |
451,884/491,719 |
8,396,243 |
14,868,621 |
| Hong Kong News |
44,621 |
605,183/603,118 |
811,638/775,019 |
14,798,671 |
26,677,514 |
| Total |
87,590 |
1,670,383/1,697,574 |
2,951,800/2,681,311 |
59,335,651 |
98,164,316 |
Updates
There are no updates available at this time.
Copying and Distribution
Permission is granted to the Linguistic Data Consortium to make and distribute
copies of the laws, press releases and news of Hong Kong Special Administrative
Region provided this copyright notice and permission notices are distributed
with all copies.
Permission has been given to the Linguistic Data Consortium to reproduce the laws, press releases, and/or news
articles from the Hong Kong Special Administrative Region Government website
for research, education, and technology development.
Content Copyright
Portions © 1985-2003 The Government of the Hong Kong Special
Administrative Region, © 2004 Trustees of the University of Pennsylvania
Pricing
The Reduced Licensing Fee for this corpus is US$200. |