Human Communication Research Centre, University of Edinburgh & University of Glasgow
Copyright 1992 Human Communication Research Centre
LICENCE: The copyright holder grants to the purchaser of these CD-ROMs unrestricted licence to use all the corpus materials (speech, transcription, maps, tools, documentation) included herein, subject only to the following restrictions: 1) No onward distribution of the corpus materials is allowed -- copies may be made only for use by the purchaser and his/her research group, for ease of use by that group, etc.; 2) The contribution of HCRC is acknowledged in any public presentation or publication of any work based on the corpus.
The HCRC Map Task Corpus carries no warranty of any kind.
Since HCRC continues to use the Corpus in our own research, we welcome contact with colleagues engaged in similar projects. For this reason we ask purchasers to notify us as a matter of courtesy of the topic of their intended work with these materials.
Funding by Economic and Social Research Council, UK Pre-mastering by the Linguistic Data Consortium, USA
This is CD-ROM 1 of a set of eight. Taken together the full set contains:
All eight CD-ROMs have a common structure.
The top-level directory contains the following files on all eight:
The 0readme in the src directory contains a number of examples of use of the distributed tools to obtain different kinds of information from the corpus.
In addition to the common directories, this CD also contains
q1/
This contains all sampled audio and transcripts for one eighth of the design (see doc/design.sgm for a description of the design), in a directory structure which reflects certain key aspects of the design, as follows:
*For reasons of space, the wordlist recitation sampled audio files for quad 1 are not all on this CD, but are instead found on CD 2, where there is sufficient space.
The displaced wordlists are in their proper place in the directory trees, so that if it were possible to mount all 8 CDs with the same root, the complete directory structure, with q1--q8 all in place, would result.
Note that the top-level file 0dir gives for each file in the corpus the number(s) of the CD(s) on which it appears. In the case of files present on all eight CDs, an asterisk (*) is used.
II. File naming conventions
File names for files associated with talkers (diagnostics, wordlists, information) are constructed to the following model, where [1-8] means a (quad) number between 1 and 8, [en] means e or n for eye-contact or no-eye-contact, [ab] means a or b, [12] means 1 or 2 and [dw]? means d (for diagnostic), w (for wordlist) or nothing (for information):
q[1-8][en]t[ab][12][dw]?
For example, q2etb2d.nst is the NIST header for the diagnostic reading by the 2nd talker of pair b of quad 2, eye-contact condition, and would be found on CD 2 in q2/e/diagnost.
In the cases where a wordlist recitation is split across two files, suffixes 'p' and 'q' are used, e.g. q8nta1wp.ses, q8nta1wq.ses.
File names for files associated with conversations are constructed to the following model, where [1-8] means a (quad or conversation) number between 1 and 8 and [en] means e or n for eye-contact or no-eye-contact:
q[1-8][en]c[1-8]
For example, q4nc3.ses is the sampled speech for conversation 8 of quad 4, no-eye-contact condition, and would be found on CD 4 in q4/n/c3.
Note that as each conversation has an id, and each turn has a number,
to refer to an individual turn in a standard way, use
MTC1:
III. Use of SGML (Standard Generalized Markup Language)
The transcripts, documentation and some of the associated materials
included in this corpus are marked up using SGML, following the draft
guidelines of the Text Encoding Initiative (TEI). We have been quite
scrupulous in observing the guidelines for document headers, where we
have changed very little of what has been distributed by the TEI. In
the body of the transcripts, mindful of the needs of those who will
read them as they stand and/or process them with tools which are not
sensitive to SGML markup, we have had to deviate rather more from TEI
norms. All the files anywhere in the corpus with extension ".sgm" are
SGML-conformant, as validated by version 1.0 of the public domain
UNIX(TM) tool sgmls, which is included herewith in the src directory.
Two different ways of accessing the transcripts as conformant SGML/TEI
documents are provided:
The file doc/editorl.sgm provides detailed information about the
editorial conventions and markup used in the transcripts.
Public entity references are used throughout for external references,
and the script in src/mtei documents the search path which is required
for those references to succeed.
For further information about these issues, see lib/tei/0readme and
the DTD files in the same directory.
IV. Contacts
The production work on these CDs, as opposed to the corpus itself, was
done by Henry S. Thompson and Miles Bader, HCRC, University of Edinburgh.
Pre-mastering was done by David Graff, LDC, University of Pennsylvania.
The CDs were pressed by Discovery Systems, Dublin, Ohio.
For further information and for notification of use of the corpus as
per the request above, please send electronic mail to
maptask@uk.ac.edinburgh (JANET)
or surface mail to
Of course, non-SGML-based tools can access the .trn files directly,
either in the top-level trans directory, or at the leaves of the
directory tree. The files contained in each place are identical, and
are provided in duplicate purely for convenience in accessing them in
different ways.
maptask@edinburgh.ac.uk (INTERNET)
Map Task
Human Communication Research Centre
University of Edinburgh
2 Buccleuch Place
Edinburgh EH8 9LW
SCOTLAND
UNIX is a trademark of AT&T Bell Laboratories.
PostScript is a trademark of Adobe Systems Incorporated.