|

|
|
Linguistic Resources
Creating Data Resources
These pages document various aspects of corpus creation at the Linguistic Data Consortium. Readers should not view this as an attempt to prescribe methodology for all corpus creation. It is, instead, a presentation of current practice at LDC based on our experiences.
Note that this page is still under construction. We have provided an outline of the sections we plan to add and will fill it in as we are able. We hope these pages -- even the outline -- will encourage feedback from communities of corpus users.
- Background (Dave Graff, Shudong Huang, ed.)
- Audio, video data formats (Dave Graff)
- sampling format (sample size, rate, channels)
- fidelity
- existing formats (.sph, .au, .wav, .riff, .aiff, raw, .ra(m)), needs, platforms
- tools
- compression
- Text Data Formats (see also annotation format specifications and DTDs)
- intro (Steven Bird)
- human-readable vs. machine-readable text (Dave Graff)
- markup languages (SGML, HTML, XML, specialized) (Steven Bird)
- tokenization (Dave Graff)
- character encoding (Kevin Walker)
- character set vs. font
- national standards vs. Unicode vs. proprietary/non-standard
- tools
- compression (Dave Graff)
- tools (Dave Graff)
- Data Collection
- Text (each section includes tools)
- WWW (Kevin Walker)
- News Feeds (Kevin Walker)
- scan (Suzanne D., Christopher Cieri)
- keyboard (Christopher Cieri)
- closed-captioning (Dave Graff)
- QC (Jon Wright)
- Audio (each section includes tools)
- Wideband (Masato Kobayashi)
- Telephone(David Miller)
- Broadcast News (Dave
Graff)
- QC (Dave Graff)
- Creating Annotated Data Resources
- corpus specification and selection
- collection
- transcription
- translation
- annotation
- quality control
- format specifications
- DTDs
- permissions
- Building Paradigmatic Data: Lexicons
- corpus specification/structure of entries
(Masato Kobayashi, Shudong Huang)
- QC (Masato Kobayashi, Shudong Huang)
- tools (Christopher Cieri, Dave Graff, Steven Bird)
- Management of corpus building effort (Stephanie Strassel)
- Annotation Training
- Consistency
- Multi-annotator projects
- skill set for annotation team
- managing team
- general QC
- Data Dissemination (Masato Kobayashi, Christopher Cieri, ed.)
- Documentation for corpus
- IPR/informed consent (Andy Cole, Shannon Sears)
- Directory/file structure, data schema (Jon Wright, Masato Kobayashi)
- file/directory naming limitations/best practices
- cross platform issues
|
|