Introduction
Korean Newswire Second Edition, Linguistic Data Consortium (LDC)
catalog number LDC2010T19 and isbn 1-58563-564-2, is an archive of
Korean newswire text that has been acquired over several years
(1994-2009) at LDC from the Korean Press Agency. This release
includes all of the content of
Korean
Newswire LDC2000T45 (June 1994-March 2000) as well as newly-collected data.
New in the Second Edition
The second edition contains all data collected by LDC from April
2000 through December 2009.
All material, including that from the first release, has been
converted to UTF-8 (except for more recent data already in UTF-8
format) and processed in LDC's gigaword format. The gigaword format
classifies newswire content into three types: story, multi and other
where "story" refers to an article containing information
pertaining to a particular event on a day; "multi" refers
to an article that contains more than one story relating to
different topics; and "other" refers to articles
containing lists, tables or numerical data, such as sports scores.
A word break error in the original release and in data collected
from January 2002 through February 2005 has been corrected in the
second edition with the result that all Korean text should display
correctly. The error involved a line break in the middle of a word
with the result that an affected word appeared in segments in two
lines. This problem was resolved using word histograms and a
few common rules based on heuristics from the data and has yielded a
90% - 95% word break correction rate. Further information about the
word break correction procedure is available in
Word_Break_Correction_Procedure.txt.
The following table shows for each gigaword classification, the
number of documents in the classification (# DOCS), the number of
space-separated word tokens in the text (K-WORDS) and the
uncompressed file size in kilobytes (TextKB):
|
|
# DOCS
|
K-WORDS
|
TextKB
|
|
story
|
217052
|
37546
|
371722
|
|
multi
|
31
|
21
|
239
|
|
other
|
7318
|
1034
|
8375
|
Data
The directory structure of the corpus is as follows:
. |-common_files
|---docs |---dtd
|-kor_nw_p1v2 |---data
data: This directory contains the corpus
files. Each file contains data collected during the course of a
month. For example, the filename kpa_kor_199406 contains data
collected in June 1994. Each document in a file has a fixed sgml
structure governed by a dtd.
The SGML tagging is as follows:
Consult the
dtd for more information
regarding the sgml structure of a single article.
Not all articles have information in all the tag fields. The dtd
mandates that every article must have a DOC tag and a BODY tag. The
HEADLINE, DATELINE and P tags are optional. Within the
units, tagging is kept to a minimum, typically consisting only of tags to mark paragraph boundaries.
The unique KPA_KOR_yyyymmdd.nnnn string in the DOC tag :
is intepreted in the manner
described below.
yyyy = Year
mm = Month
dd = Day
nnnn = Sequence Number For
all articles that share the same yyyymmdd docid string, the nnnn
substring ensures that the docid is unique in the corpus.
docs: Contains corpus documentation.
dtd: Contains the dtd for the corpus.
Samples
For an example of the data in this corpus, please review this sample file.
Updates
Additional information, updates, bug fixes may be available in the
LDC catalog entry for this corpus at
LDC2010T19.
Content Copyright
Portions © 1994-2009 Korean Press Agency, © 2000, 2010
Trustees of the University of Pennsylvania |