Introduction
This FTP publication was created when the LDC collected parallel
Chinese - English news articles from the Information Services
Department of Hong Kong Special Administrative Region (HKSAR) of the
People's Republic of China.
LDC wishes to thank the Hong Kong Special Administrative Region of the People's
Republic of China for granting the LDC permission to distribute this data to
the research community.
Data
This corpora contains 18,147 aligned article pairs released by HKSAR from July
1, 1997 to April 30, 2000. Automatic article alignment was done at the
LDC.
The data directory contains 36,294 articles. Each article is a separate file,
thus there are 18,147 article pairs. The files are named using the convention;
yyyymmdd_nnn.[ce] where
- yyyy = year
- mm = month
- dd = date
- nnn = article date sequence number
- c = Cantonese, and e = English.
The example.c and example.e files
contains a corresponding sample news article from the corpus.
The articles were collected by an automated system from the internet. Incoming
data was spooled directly to a "raw collection" file and the raw files were
then processed to produce the following format for release by the LDC.
Table.txt maps the Chinese files (*.c) to the corresponding English files
(*.e).
The Chinese files are encoded in BIG5 with user-defined characters by
HKSAR. Click here for details.
Copying and Distribution
Permission has been granted to the Linguistic Data Consortium to make and distribute
copies of the laws, press releases and news of Hong Kong Special Administrative
Region provided that this copyright notice and the following permission notice are distributed
with all copies.
Permission has been given to the Linguistic Data Consortium reproduce the laws, press releases, and/or news
articles from the Hong Kong Special Administrative Region Government website
for research, education and technology development.
Updates
There are no updates at this time.
Copyright
Portions © 1997-2000 The Government of the Hong Kong Special
Administrative Region, © 2000 Trustees of the University of Pennsylvania
Pricing
The Reduced Licensing Fee for this corpus is US$100. |