* Note that the directory naming conventions are different for the CD-ROM distribution. However, the filenaming conventions have not changed.
Directory and Filename Structures
All MADCOW data should be organized into the prescribed directory and filename structures as follows:
/<CORPUS>/doc/<DOCFILES>
where,
DOCFILES ::= readme.doc | (optional general information file)
spkrinfo.log | (mandatory speaker information formatted
according to "atis-spkr-info.log")
- OR -
/<CORPUS>/<SPEAKING-MODE>/<SPEAKER>/<SESSION>/<DATA-FILES>
where,
CORPUS ::= atis3
SPEAKING-MODE ::= spon | vspn | read
SPEAKER ::= 001 | ... | zzz (3-character base-36 speaker ID)
SESSION ::= 1 | ... | z (1-character base-36 scenario session ID
DATA-FILES ::= <XXX><UU><S><M><P>.<TYPE>
where,
XXX ::= 001 | ... | zzz (3-character base-36 speaker ID)
UU ::= 01 | ... | zz (2-char. base-36 within-scenario-session
query ID)
S ::= 1 | ... | z (1-char. base-36 scenario-session ID)
M ::= s | r | c (speaking mode:
"s" - spontaneous or
"r" - read version of spontaneous or
"c" - read common or
"v" - voice-only spontaneous)
P ::= s | c | x (microphone:
"s" - Sennheiser,
"c"- Crown,
"x" - pertains to all microphones recorded)
and,
TYPE ::= log | (session log file - special within-scenario-
session query ID of "00" is used in all log
files)
wav | (SPHERE-headered speech waveform file)
sro | ("speech recognizer output" transcription)
lsn | (lexical SNOR transcription derived from .sro)
cat | (query categorization)
win | (wizard input to NLParse)
sql | (SQL query from NLParse to create min (.ref)
answer)
sq2 | (SQL query from NLParse to create max (.rf2)
answer)
ref | (min reference answer from (.sql) SQL query)
rf2 | (max reference answer from (.sq2) SQL query)
squ | (subject questionnaire)
com | (session comment file - special within-scenario-
session query ID of "00" is used in all comment
files)
Note: Although other ATIS file types do exist, only three of the
file types listed above (.log, .wav, .sro) are required as input
from sites contributing initial (unannotated) data. Also note that
some of the file types above (.cat, .win, .sql, .sq2, .ref, and
.rf2) are added by the annotation process. The .lsn files are added
at NIST and are used as input to NL-only systems and for scoring SPREC
results.
example.
b000e1ss.wavNote: The MADCOW ATIS3 corpus will be identified by the database ID (corpus ID) "atis3". This ID should appear in the directory structure and in the waveform file headers.
(speaker b00, query 0e, scenario-session 1, spontaneous speaking mode, Sennheiser mic., waveform file)