|

|
|
Corpus Documentation
This short guide identifies documentation and meta-data that are typically
provided with corpora. This document has been used internally at LDC to
help staff prepare complete and consistent catalog entries and corpus
documentation. A corpus journal serves as an invaluable tool for monitoring
and reviewing the development of corpora. A journal that is kept current
acts as a reference both to the corpus creators and to those who may wish
to develop similar corpora.
Meta-Data
- Title - unique name
- LDC Catalog Number - unique identifier of form LDCyyyy[STL]nn
where:
- yyyy is the year of release
- S=speech, T=text, L=lexicon and
- nn iterates with each release
- ISBN - unique ISBN number for each corpus
- Data Type - { speech | text | lexicon }
- Data Sources - broadcast, conversation, microphone, mobile-radio,
newswire, parallel, pronunciation, telephone, varied
- Projects - zero or more of the currently 11 projects that have
sponsored data distributed through LDC
- Recommended Applications - one or more of the 18 applications
for which LDC corpora are recommended: discourse analysis, information
retrieval, language identification, language modeling, machine translation,
message understanding, natural language processing, parsing, pronunciation
modeling, prosody, speaker identification, speaker verification, speech
recognition, speech synthesis, spoken dialogue systems, tagging, text
retrieval, topic detection & tracking
- Languages - one of the currently 36 languages represented in
LDC corpora: Albanian, Bulgarian, Canadian French, Chinese, Czech, Danish,
Dutch, Egyptian Arabic, English, Estonian, Farsi, French, Gaelic, German,
Greek, Hindi, Italian, Japanese, Korean, Latin, Lithuanian, Malay, Mandarin,
Mandarin Chinese, Norwegian, Portuguese, Russian, Serbian, Spanish,
Swedish, Taiwanese Putonghua, Tamil, Tibetan, Turkish, Uzbek, Vietnamese
- Distribution - either number of CDs on which data is distributed
or FTP
- Membership year(s) - Membership Years in which corpus was released.
Data is free to those who join LDC for these years.
- Nonmember price - USD price for non-members if corpus available
to non-members
- Non-member license - the license non-members must sign to acquire
this data
- Member license - the license members must sign to acquire this
data, normally not applicable.
- ReadMe File - pointer to release notes where available
- Corpus Documentation - pointer to corpus documentation where
available
- Online Retrieval - pointer to web-accessible copy of the data
where available
Overall nature of the corpus
- What is the primary motivation for producing the corpus? What type(s)
of research does it support?
- Was the corpus created by a specific person or organization? Who are
the authors/sponsors?
- Has the data been formatted or annotated for any specific project
or research program?
Licensing Details
- What are the terms under which one accesses the data?
- Is the corpus an update, extension or revision of an earlier corpus?
If so, do users of the earlier version also have rights to the new version?
- Is the corpus part of a sequence? If so, does it complement or replace
the earlier version?
- Is the corpus part of a set? If so, can it be acquired individually
or only along with other components?
Source Data
- What are the sources of the raw data?
- For broadcast news, provide
- language/variety
- geographic location
- provider names
- original format/encoding
- collection dates
- how collected
- For live recordings, provide
- language/variety
- geographic location
- demographic information about subjects (age, gender, region)
- types of recording devises used
- collection protocol
- For lexicons, provide
- sources of individual entries (broadcast transcripts, conversation
transcripts, newswire, etc)
- number of entries
- definition of fields in each entry
- orthographic form
- romanization where appropriate
- pronunciation
- part of speech or morphological analysis
- gloss
- frequency information
- For speech data, provide audio format, sampling rate and quantization.
- For text data, provide character encoding and markup specification.
Annotation
- What was the goal of the annotation?
- Who performed the annotation (native speakers? what skill level/training
required)
- What were the annotators instructed to do?
- What software was available to support the annotation? Can one acquire
this software?
- What was the level of inter-annotator agreement?
Distribution
- What is the size of the corpus in bytes, characters, words or pieces
of media (CD) as appropriate?
- How is the distribution media/archive organized (directory/file structure)?
- How are the directories and files named?
- What unit is contained in an individual file?
- How is content organized and formatted within a file?
- Is there an index file that indicates how the data is partitioned?
- What kind of software is or was used to access or process the data?
How does one acquire that software?
- Is there documentation available separately?
- Who is the contact for the corpus?
|
|