Introduction
This FTP publication was obtained during January 1999 from the
bilingual website of the Department of Justice of the Hong Kong
Special Administrative Region (HKSAR) of the People's Republic of
China. The retrieved files have been processed and sentence aligned.
LDC wishes to thank the Hong Kong Special Administrative Region of the People's
Republic of China for granting the LDC permission to distribute this data to
the research community.
DATA
This corpora is organized into 19 parallel file pairs for a total of
38 files. Each parallel file pair is named hklaws.nn.[ec] where:
- nn = sequence number and
- the file extensions, c = Cantonese and e = English
Each files holds up to 2,000 sequentially numbered sentences tagged with a sentence
index and sequence number as described below for a total of 37,807 sentence
indices across all 19 file pairs. The sentence numbering spans the file
pairs such that the initial sentence index (in files hklaws.01.e and
hklaws.01.c) is "1," and the last sentence index (in files hklaws.19.e and
hklaws.19.c) is "37807." The sentence numbering establishes the sentence
parallelism; two sentences having the same index and sequence number are
purported to be parallel in content.
Each sentence index may contain one or more sequentially numbered sentences,
with corresponding files in English and Chinese containing the corresponding
sets of sentences. The initial sequence number of each sentence is "1." The
sentence sequence number plus the sentence index number is sufficient to
uniquely identify parallel sentences. There are 313,659 sentences in the
corpora.
Each sentence is of the form:
...
...
...
...
where "#" represents a one to five digit sentence index or sequence number.
Automatic sentence alignment was done at the LDC.
The example.c and example.e files
contains sample corresponding Chinese and English Law files from the corpus.
The Chinese files are encoded in BIG5 with user-defined characters by
HKSAR. See http://www.info.gov.hk/gccs for details.
Copying and distribution
Permission has been granted to the Linguistic Data Consortium to make and
distribute copies of the laws, press releases and news of Hong Kong Special
Administrative Region, provided this copyright notice and permission notice are
distributed with all copies.
Permission has been given to the Linguistic Data Consortium to
reproduce the laws, press releases, and/or news articles from the
Hong Kong Special Administrative Region Government website
for research, education, and technology development.
Updates
There are no updates at this time.
COPYRIGHT
Portions © 1999 The Government of the Hong Kong Special
Administrative Region, © 2000 Trustees of the University of Pennsylvania
Pricing
The Reduced Licensing Fee for this corpus is US$100. |