Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Tagged Chinese Gigaword

Item Name: Tagged Chinese Gigaword
Authors: Chu-Ren Huang
LDC Catalog No.: LDC2007T03
ISBN: 1-58563-409-3
Release Date: Jun 20, 2007
Data Type: text
Data Source(s): newswire
Language(s): Mandarin Chinese
Language ID(s): cmn
Distribution: 1 DVD
Member fee: $0 for 2007 members
Non-member Fee: US$4000.00
Reduced-License Fee: US$2000.00
Extra-Copy Fee: US$200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Chu-Ren Huang
2007
Tagged Chinese Gigaword
Linguistic Data Consortium, Philadelphia

Introduction

Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags.

In order to avoid any problems or confusion that could result from differences in character-set specifications in the source data, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the readme file, all characters in the text are either single-byte ASCII or multi-byte Chinese.

All sources have been categorized into four distinct "types":

  • story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
  • multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.
  • advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users."
  • other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.
  • Data

    The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents.
    Source#FilesRzip-MBTotl-MBK-wrds#DOCs
    CNA_CMN16899473637921951769953
    XIN_CMN1686154535471110992261
    ZBN_CMN10402232806641418
    TOTAL34616481212112913712803632

    The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type:


    #DOCsK-wrds
    type="advis":
    CNA_CMN8160751
    XIN_CMN6553711
    ZBN_CMN 00
    TOTAL147131462


    type="multi":
    CNA_CMN3055223429
    XIN_CMN113297516
    ZBN_CMN5541
    TOTAL4193630986


    type="other":
    CNA_CMN10075840258
    XIN_CMN312559999
    ZBN_CMN279130
    TOTAL13229250387


    type="story":
    CNA_CMN1630483727748
    XIN_CMN943132452878
    ZBN_CMN4108427898
    TOTAL26146911208524

    The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006.

    The test result is shown as follows:


    Doc#RefWord#TestWord#MatchWord#Recall (%)Precision (%)F-Score (%)

    Bakeoff 200519011650911644311209196.296.396.2
    Bakeoff 200614890405903278733296.696.796.6

    Note:

    Recall=MatchWord# / RefWord#

    Precision=MatchWord# / TestWord#

    F-Score=2 * Recall * Precision / (Recall + Precision)

    Samples

    For an example of the data contained in this corpus, please view this screen capture(jpg) of the annotated text.

    Content Copyright

    Portions © 2005-2007 Academia Sinica, © 1991-2004 Central News Agency (Taiwan), © 2000-2003 SPH AsiaOne, Ltd., © 1990-2004 Xinhua News Agency, © 2005, 2007 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.