Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome


Corpus Documentation

This short guide identifies documentation and meta-data that are typically provided with corpora. This document has been used internally at LDC to help staff prepare complete and consistent catalog entries and corpus documentation. A corpus journal serves as an invaluable tool for monitoring and reviewing the development of corpora. A journal that is kept current acts as a reference both to the corpus creators and to those who may wish to develop similar corpora.

Meta-Data

  • Title - unique name
  • LDC Catalog Number - unique identifier of form LDCyyyy[STL]nn where:
    • yyyy is the year of release
    • S=speech, T=text, L=lexicon and
    • nn iterates with each release
  • ISBN - unique ISBN number for each corpus
  • Data Type - { speech | text | lexicon }
  • Data Sources - broadcast, conversation, microphone, mobile-radio, newswire, parallel, pronunciation, telephone, varied
  • Projects - zero or more of the currently 11 projects that have sponsored data distributed through LDC
  • Recommended Applications - one or more of the 18 applications for which LDC corpora are recommended: discourse analysis, information retrieval, language identification, language modeling, machine translation, message understanding, natural language processing, parsing, pronunciation modeling, prosody, speaker identification, speaker verification, speech recognition, speech synthesis, spoken dialogue systems, tagging, text retrieval, topic detection & tracking
  • Languages - one of the currently 36 languages represented in LDC corpora: Albanian, Bulgarian, Canadian French, Chinese, Czech, Danish, Dutch, Egyptian Arabic, English, Estonian, Farsi, French, Gaelic, German, Greek, Hindi, Italian, Japanese, Korean, Latin, Lithuanian, Malay, Mandarin, Mandarin Chinese, Norwegian, Portuguese, Russian, Serbian, Spanish, Swedish, Taiwanese Putonghua, Tamil, Tibetan, Turkish, Uzbek, Vietnamese
  • Distribution - either number of CDs on which data is distributed or FTP
  • Membership year(s) - Membership Years in which corpus was released. Data is free to those who join LDC for these years.
  • Nonmember price - USD price for non-members if corpus available to non-members
  • Non-member license - the license non-members must sign to acquire this data
  • Member license - the license members must sign to acquire this data, normally not applicable.
  • ReadMe File - pointer to release notes where available
  • Corpus Documentation - pointer to corpus documentation where available
  • Online Retrieval - pointer to web-accessible copy of the data where available

 

Overall nature of the corpus

  • What is the primary motivation for producing the corpus? What type(s) of research does it support?
  • Was the corpus created by a specific person or organization? Who are the authors/sponsors?
  • Has the data been formatted or annotated for any specific project or research program?

Licensing Details

  • What are the terms under which one accesses the data?
  • Is the corpus an update, extension or revision of an earlier corpus? If so, do users of the earlier version also have rights to the new version?
  • Is the corpus part of a sequence? If so, does it complement or replace the earlier version?
  • Is the corpus part of a set? If so, can it be acquired individually or only along with other components?

Source Data

  • What are the sources of the raw data?
  • For broadcast news, provide
    • language/variety
    • geographic location
    • provider names
    • original format/encoding
    • collection dates
    • how collected
  • For live recordings, provide
    • language/variety
    • geographic location
    • demographic information about subjects (age, gender, region)
    • types of recording devises used
    • collection protocol
  • For lexicons, provide
    • sources of individual entries (broadcast transcripts, conversation transcripts, newswire, etc)
    • number of entries
    • definition of fields in each entry
      • orthographic form
      • romanization where appropriate
      • pronunciation
      • part of speech or morphological analysis
      • gloss
      • frequency information
  • For speech data, provide audio format, sampling rate and quantization.
  • For text data, provide character encoding and markup specification.

Annotation

  • What was the goal of the annotation?
  • Who performed the annotation (native speakers? what skill level/training required)
  • What were the annotators instructed to do?
  • What software was available to support the annotation? Can one acquire this software?
  • What was the level of inter-annotator agreement?

Distribution

  • What is the size of the corpus in bytes, characters, words or pieces of media (CD) as appropriate?
  • How is the distribution media/archive organized (directory/file structure)?
  • How are the directories and files named?
  • What unit is contained in an individual file?
  • How is content organized and formatted within a file?
  • Is there an index file that indicates how the data is partitioned?
  • What kind of software is or was used to access or process the data? How does one acquire that software?
  • Is there documentation available separately?
  • Who is the contact for the corpus?

About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Tuesday, 22-Oct-2002 17:44:54 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.