The HCRC Map Task Corpus

Version 1.0

Human Communication Research Centre, University of Edinburgh & University of Glasgow

Copyright 1992 Human Communication Research Centre

LICENCE: The copyright holder grants to the purchaser of these CD-ROMs unrestricted licence to use all the corpus materials (speech, transcription, maps, tools, documentation) included herein, subject only to the following restrictions: 1) No onward distribution of the corpus materials is allowed -- copies may be made only for use by the purchaser and his/her research group, for ease of use by that group, etc.; 2) The contribution of HCRC is acknowledged in any public presentation or publication of any work based on the corpus.

The HCRC Map Task Corpus carries no warranty of any kind.

Since HCRC continues to use the Corpus in our own research, we welcome contact with colleagues engaged in similar projects. For this reason we ask purchasers to notify us as a matter of courtesy of the topic of their intended work with these materials.

Funding by Economic and Social Research Council, UK Pre-mastering by the Linguistic Data Consortium, USA

This is CD-ROM 1 of a set of eight. Taken together the full set contains:

I. Directory structure and file contents

All eight CD-ROMs have a common structure.

The top-level directory contains the following files on all eight:

0dir
A complete listing of all files, giving the CD on which each can be found
1readme
This file, with the CD number changing from one CD to the next.
maptask.sgm
A TEI Corpus file of the complete set of transcripts. Contains basic documentation about the corpus.
The top-level of each CD contains the following directories in all cases:
doc/
ASCII and/or PostScript(TM) versions of various papers about the corpus: START HERE
etc/
Miscellaneous useful bits and pieces
lib/
Resources for included tools
src/
UNIX(TM) scripts and C sources for useful tools
trans/
A complete set of transcripts
All of these, as well as most of the others listed below, contain further `0readme' files with more detailed descriptions of their contents.

The 0readme in the src directory contains a number of examples of use of the distributed tools to obtain different kinds of information from the corpus.

In addition to the common directories, this CD also contains

q1/

This contains all sampled audio and transcripts for one eighth of the design (see doc/design.sgm for a description of the design), in a directory structure which reflects certain key aspects of the design, as follows:

wordlst1.sgm
The script for the wordlist recitations (see below)
e/
The eye-contact condition
n/
The no-eye-contact condition
maps/
Bitmaps and other information for the maps used here
The e/ and n/ directories have the same structure:
0readme
Includes a brief description of the transcript files
diagnost/
Dialect diagnosis materials: NIST header (.nst), SAM header (.seo), sampled speech (.ses)
talkers/
Information about the talkers
wordlist/
Sampled audio of the wordlist recitations* NIST header (.nst), SAM header (.seo), sampled speech (.ses)
c1/
... Conversations
c8/
Each conversation directory has the following files Some of the wordlist recordings are split across two files, as there were discontinuities at recording time. The recording for one subject, q8nta2, is missing.

*For reasons of space, the wordlist recitation sampled audio files for quad 1 are not all on this CD, but are instead found on CD 2, where there is sufficient space.

The displaced wordlists are in their proper place in the directory trees, so that if it were possible to mount all 8 CDs with the same root, the complete directory structure, with q1--q8 all in place, would result.

Note that the top-level file 0dir gives for each file in the corpus the number(s) of the CD(s) on which it appears. In the case of files present on all eight CDs, an asterisk (*) is used.

II. File naming conventions

File names for files associated with talkers (diagnostics, wordlists, information) are constructed to the following model, where [1-8] means a (quad) number between 1 and 8, [en] means e or n for eye-contact or no-eye-contact, [ab] means a or b, [12] means 1 or 2 and [dw]? means d (for diagnostic), w (for wordlist) or nothing (for information):

q[1-8][en]t[ab][12][dw]?

For example, q2etb2d.nst is the NIST header for the diagnostic reading by the 2nd talker of pair b of quad 2, eye-contact condition, and would be found on CD 2 in q2/e/diagnost.

In the cases where a wordlist recitation is split across two files, suffixes 'p' and 'q' are used, e.g. q8nta1wp.ses, q8nta1wq.ses.

File names for files associated with conversations are constructed to the following model, where [1-8] means a (quad or conversation) number between 1 and 8 and [en] means e or n for eye-contact or no-eye-contact:

q[1-8][en]c[1-8]

For example, q4nc3.ses is the sampled speech for conversation 8 of quad 4, no-eye-contact condition, and would be found on CD 4 in q4/n/c3.

Note that as each conversation has an id, and each turn has a number, to refer to an individual turn in a standard way, use MTC1::, e.g. MTC1:q4nc3:32 is the Instruction Follower saying "No. Sorry."

III. Use of SGML (Standard Generalized Markup Language)

The transcripts, documentation and some of the associated materials included in this corpus are marked up using SGML, following the draft guidelines of the Text Encoding Initiative (TEI). We have been quite scrupulous in observing the guidelines for document headers, where we have changed very little of what has been distributed by the TEI. In the body of the transcripts, mindful of the needs of those who will read them as they stand and/or process them with tools which are not sensitive to SGML markup, we have had to deviate rather more from TEI norms. All the files anywhere in the corpus with extension ".sgm" are SGML-conformant, as validated by version 1.0 of the public domain UNIX(TM) tool sgmls, which is included herewith in the src directory.

Two different ways of accessing the transcripts as conformant SGML/TEI documents are provided:

  1. Via the top-level corpus file maptask.sgm, which encompasses the entire 128 transcripts;
  2. Via the individual .sgm files at the leaves of the directory tree, which each embody exactly one transcript.
Of course, non-SGML-based tools can access the .trn files directly, either in the top-level trans directory, or at the leaves of the directory tree. The files contained in each place are identical, and are provided in duplicate purely for convenience in accessing them in different ways.

The file doc/editorl.sgm provides detailed information about the editorial conventions and markup used in the transcripts.

Public entity references are used throughout for external references, and the script in src/mtei documents the search path which is required for those references to succeed.

For further information about these issues, see lib/tei/0readme and the DTD files in the same directory.

IV. Contacts

The production work on these CDs, as opposed to the corpus itself, was done by Henry S. Thompson and Miles Bader, HCRC, University of Edinburgh.

Pre-mastering was done by David Graff, LDC, University of Pennsylvania.

The CDs were pressed by Discovery Systems, Dublin, Ohio.

For further information and for notification of use of the corpus as per the request above, please send electronic mail to

maptask@uk.ac.edinburgh (JANET)
maptask@edinburgh.ac.uk (INTERNET)

or surface mail to

	Map Task
	Human Communication Research Centre
	University of Edinburgh
	2 Buccleuch Place
	Edinburgh EH8 9LW
	SCOTLAND

UNIX is a trademark of AT&T Bell Laboratories.
PostScript is a trademark of Adobe Systems Incorporated.