Introduction
CALLHOME Mandarin Chinese Transcripts - XML Version, Linguistic Data Consortium
(LDC) catalog number LDC2008T17 and isbn 1-58563-485-7, was developed at Lancaster
Univeristy, United Kingdom.
LDC's CALLHOME Mandarin Chinese collection includes telephone speech, associated
transcripts and a lexicon. CALLHOME
Mandarin Chinese Speech consists of 120 unscripted telephone conversations
between native speakers of Mandarin Chinese. All calls, which lasted up to thirty
minutes, originated in North America and were placed to locations overseas;
most participants called family members or close friends. CALLHOME
Mandarin Chinese Transcripts covers a contiguous five or ten-minute segment
from each of the telephone speech files. The transcripts are in tab-delimited
format with GB2312 encoding, are timestamped by speaker turn for alignment with
the speech signal and are provided in standard orthography. CALLHOME
Mandarin Chinese Lexicon is comprised of over 40,000 words from twenty CALLHOME
Mandarin transcripts.
CALLHOME Mandarin Chinese Transcripts - XML Version, the latest addition to
this collection, presents the entire original corpus of 120 transcripts in XML
format with UTF-8 encoding, retokenization and part-of-speech (POS) tagging.
The retokenization and POS information were supplied using the Chinese Lexical
Analysis System (ICTCLAS) developed by the Institute
of Computing Technology, Chinese Academy of Sciences, Beijing. ICTCLAS aims
to incorporate Chinese word segmentation, POS tagging, disambiguation and unknown
words recognition into a single theoretical framework using multi-layered hierarchical
hidden Markov models.
In addition to the original applications for Mandarin Chinese CALLHOME data
(e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts - XML Version
will be useful in the grammatical study of spoken Mandarin.
Data
This XML corpus retains all of the linguistic analyses (e.g., timestamps, spoken
features and proper nouns) from the original transcripts release, but the mnemonics
used in the original release were migrated into XML markup following the mapping
rules described below:
All analyses in the original release were retained at the sacrifice of tokenization
and part-of-speech tagging accuracy (e.g., some mnemonics encoding spoken features
may split a word, which can affect the tagging accuracy). However, the results
of the automated processing were substantially post-edited. For example, four
aspect markers in Chinese (-le, -guo, -zhe and zai)
were disambiguated and corrected by hand; all of the classifiers (also called
"measure words") were re-tagged using a more fine-grained annotation
scheme developed on the Lancaster University project. In addition, a large number
of obvious typographical errors in the original release were corrected in the
process of post-editing.
Number of unique words: 6,895
Total number of words: 300,767
Samples
Content Copyright
Portions © 2004-2008 Lancaster University, © 1996, 2008 Trustees of the University of Pennsylvania |