Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources
Creating Annotated Data

Red italics: notes intended for review of this page before publication. In creating_annotated.shtml they are currently (Halloween 2007) simply set to display:none.

While it would be reasonable to call this page annotation.shtml, we already have a page by that name. Although that page is explicitly "no longer being maintained" and we have no links to it here, Google shows 33 links to it across the web.

These pages document various aspects of the creation of annotated data at the Linguistic Data Consortium. Readers should not view this as an attempt to prescribe methodology for all annotated corpus creation. It is, instead, a presentation of current and recent practice at LDC based on our experiences. Most of these documents are from specific LDC projects past or present. Some, in boldface with an asterisk, are for the general guidance of resource creators.

This page corresponds to section III of Creating Data Resources. Note that these pages are still under construction. We hope that, incomplete though they are, they will encourage feedback from communities of corpus users.

  1. * General Information
  2. Specification and Selection
    1. text and audio
      1. ACE specification (2005)
      2. ACE selection (2005)
      3. GALE specification (2007)
    2. text
      1. LCTL (2006)
      2. GALE distillation (2007)
    3. audio
      1. MIXER specification (TBD) ("to be updated 10/1")
      2. EARS broadcast and telephone (2004)
    4. * lexicon (2002)
  3. Collection:
    1. text
      1. * WWW Capture (2002)
      2. * News Feeds (2002)
          two of the four external links to initiatives are 404:
        • http://www.stars.com/Internet/Future/xmlnews.html
        • http://www.iptc.org/iptc/
      3. GALE Web data (blogs and newsgroups) (PDF) (2005)
    2. audio
      1. * Wideband (2002)
      2. * Telephone (2002)
      3. GALE Broadcast data (news and talk shows) (PDF)
    3. video and audio
      1. TDT3 Broadcast News (2002)
  4. Transcription (audio/video only):
    1. * Transcription FAQ (1999)
    2. * Philosophy of transcription (2004; written for EARS but generally applicable)
    3. * Transcription Conventions. (2004) Links to many guidelines, including telephone and broadcast speech in several languages, and
      1. meetings (2007)
      2. noisy environments (2000)
    4. * Design Specifications for the Transcription of Spoken Language (2000) [fixed Table of Characters link: http://www.ldc... -> http://projects.ldc... 2007-11-01]
    5. quick transcription (broadcast news and conversation)
      1. GALE quick transcription of Arabic, Chinese, English (PDFs) (2005)
      2. GALE quick rich transcription of Arabic, Chinese, English (PDFs) (2006)
      3. EARS quick transcription (2003)
    6. EARS and GALE careful transcription (PDF, Microsoft Word document) (2004)
    7. Iraqi Arabic telephone (Microsoft Word document) (2004)
  5. Translation:
    1. Multiple Translation (PDF) (2005) Although the examples are for Chinese-English translation, the guidelines were written to be general. [File path on http://projects.ldc.upenn.edu/LCTL/Specifications/ changed to URL, 2007-11-01]
    2. LCTL (PDF) (2006)
    3. GALE Arabic (PDF) (2006)
    4. GALE Chinese (PDF) (2006)
    5. TIDES Arabic (PDF) (2004)
    6. TIDES Chinese (PDF) (2005)
  6. Annotation:
    1. entity
      1. person, organization, location, facility, weapon, vehicle, geo-political entity; event; value; relation
        1. ACE English, Chinese, Arabic, and Spanish (PDFs and Microsoft Word documents)
          toolkit (2005-2006)
      2. personal names and titles, organizations, locations:
        1. LCTL (PDF) (2006)
        2. LCTL Thai (PDF) (2006)
        3. LCTL Tamil (PDF) (2006)
      3. time
        1. LCTL (PDF) (2006)
        2. LCTL Thai (PDF) (2006)
      4. biomedical entities
        1. PennBioIE (2006)
    2. part of speech and syntax
      1. Arabic (2007, in progress) [The links from the GALE Arabic Treebank page are 404.]
      2. GALE English
        1. part of speech guidelines (1995) and addendum (2005)
        2. syntax guidelines (1995) and addendum (2004)
      3. PennBioIE (biomedical)
        1. part of speech (2006)
        2. syntax (2006)
    3. topic detection
      1. TDT-5 (PDF, Microsoft Word document) (2004)
      2. HARD (PDF) (2004)
    4. discourse and syntax
      1. EARS Metadata Extraction (PDF) (2004)
        toolkit
    5. distillation and summarization
      1. GALE distillation (2007)
      2. TIDES summarization (2005)
    6. sociolinguistic variables in speech (audio)
      1. DASL
        (2001)
          404:
        • coding scheme: http://projects.ldc.upenn.edu/DASL/socannspec.html
        • progress: http://www.ldc.upenn.edu/Projects/DASL/socannprogress.html
        Tools, data and best practices (2003)
    7. gesture
      1. FORM (2003)
          404
        • inverse kinematics: http://hms.upenn.edu/software/ik/ik.html

        tools (2003)
    8. general
      1. AGTK toolbase [http://projects.ldc.upenn.edu/gale/Tools/ says "available for free download by external users here" but there is no link.]
      2. AGTK: Annotation Graph Toolkit (2007)
  7. Quality Control:
    1. TDT3 text and audio (2002)
    2. * lexicon (2002) [There is an input field that shouldn't be there, from the string "<INPUT>" in the HTML file.]
    3. GALE Broadcast auditing (PDF) (2006)
    4. TIDES translation (2005)
  8. Format Specifications:
    1. Multiple Translation (PDF. Although the examples are for Chinese-English translation, the guidelines were written to be general.) (2005) [File path on http://projects.ldc.upenn.edu/LCTL/Specifications/ changed to URL, 2007-11-01]
    2. LCTL (PDF) (2006)
    3. TIDES Arabic -> English translation (PDF, Microsoft Word document) (2005)
    4. TIDES Chinese -> English translation (PDF, Microsoft Word document) (2002)
  9. DTDs:
    1. text
      1. LCTL (2006)
    2. annotation
      1. ACE (2005)
      2. LCTL (2006)
    3. lexicon
      1. LCTL (2006)
  10. Permissions:
    1. American National Corpus (2002)
  11. * Documentation (2002)

Project pages

These are only the LDC projects referenced from this page. For a full list, see the Projects page .
  • ACE (Automatic Content Extraction)
  • ANC (American National Corpus)
  • DASL (Data and Annotations for Sociolinguistics)
  • EARS (Effective, Affordable, Reusable Speech-to-Text)
  • FORM (gesture modeling) [contact info way out of date, tools still there]
  • GALE (Global Autonomous Language Exploitation)
  • HARD (High Accuracy Retrieval from Documents)
  • LCTL (Less Commonly Taught Languages)
  • Mixer (widely varying telephone speech)
  • PennBioIE
  • TDT (Topic Detection and Tracking)
  • TIDES (Translingual Information Detection Extraction and Summarization)

About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Thursday, 01-Nov-2007 10:03:23 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.