Introduction
This corpus is a collection of Korean Press Agency news articles from June 2,
1994 to March 20, 2000. The collection includes articles from the date ranges
listed below. Please click here to see an example of the
newswire. Not all dates in each interval are represented by files or articles:
1994 Jun. 2 to Dec. 31 87 files, 8.6 MB
1995 Jan. 1 to Dec. 31 179 files, 16.9 MB
1996 Jan. 1 to Mar. 29 83 files, 10.6 MB
1997 Jul 28 to Dec. 31 245 files, 48.9 MB
1998 Jan. 2 to Dec. 31 285 files, 64.2 MB
1999 Jan. 3 to Dec. 31 216 files, 56.7 MB
2000 Jan. 3 to Mar. 20 56 files, 13.6 MB
Total 1,151 files 219.5 MB
Data
The articles provided here have been collected by means of a continuous feed
from the news provider over a modem connection. Incoming data from the modem
was spooled directly to a "raw collection" file on a daily basis and the raw
files were then processed to produce the format for release by the LDC. There
are approximately 143,137 articles this corpus. It is probable that there are
duplicate articles in this corpus.
We have taken steps to remove articles that were corrupted by failures or noise
in modem transmission. The kinds of corruption that we were able to eliminate
include truncated articles (a valid end-of-article sequence is not observed
before a valid start-of-article) and invalid character codes within the text
segment of articles. Some corruption may have occurred that did not produce
these symptoms (e.g. service interruptions that might cause partial loss of
data within or across articles or corruptions that garble the content but
happen not to produce any invalid character codes). At present we have no
means for detecting these more subtle problems in the data, but we expect that
they are relatively infrequent.
The format chosen for release consists of SGML tagging (since this gives a
fairly simple and self-explanatory presentation of the data) and the KSC-5601
Korean character encoding.
Updates
There are no updates at this time.
Copyright
Portions Copyright 1994-2000, Korean Press Agency, All Rights Reserved |