Gulf Arabic Conversational Telephone Speech, Transcripts is a database containing transcripts of 975 Gulf Arabic speakers taking part in
spontaneous telephone conversations in Colloquial Gulf Arabic. A
total of 976 conversation sides are provided (one speaker appears on
two distinct calls). The average duration per side is about 5.7
The data was collected and transcribed in 2004 by Appen Pty Ltd., Sydney, Australia.
Each transcript file is a tab-delimited flat table, where each line
contains information and text for a single contiguous utterance,
presented via the following fields:
- beginning time stamp in seconds, in square brackets ("[5.7189]")
- ending time stampe in seconds, in square brackets
- channel/speaker-ID ("A:" or "B:")
- "consonant skeleton" orthography for the utterance, in UTF-8
- "diacritized" orthography for the utterance, in ASCII
The ASCII field is the Buckwalter transliteration of the fully
"vowelized" (pronunciation) form of the utterance. Within fields 4
and 5, word boundaries are marked by space characters in the normal
way, following common practices of Arabic orthographic convention
(e.g. all definite articles and many conjunctions and prepositions are
attached as prefixes to the following word).
Transcript tokens enclosed in single parentheses -- e.g. "(DHk)" --
represent annotation marks for non-speech events or conditions, such
as laughter, noise, etc. Multi-token strings within single
parentheses involve words in some other language (typically English)
or some other Arabic dialect.
Double parentheses, either with or without tokens enclosed within them
-- e.g. "(())", "((word))" or "((word1 word2))" -- represent regions
where the transcriber was unable to tell for sure what was said.
The "consonant skeleton" orthography is intended to reflect common orthographic
practice in written Arabic (i.e. Modern Standard Arabic (MSA)), but without
being bound strictly by the specific spellings of MSA words. That is, there
may be novel (dialect-specific) words and changes of consonant quality (hence
altered spelling) in words that are cognate between MSA and Gulf Arabic.
The "vowelized" orthography is restricted to a character set that
allows words to be rendered coherently in Arabic script (with all
diacritics present as needed to represent short vowels, etc), but is
intended to reflect the perceived pronunciation of each token. As a
result, a given word (type), having a multiple occurrences in the text
with identical "skeletal" spellings, may have multiple distinct
"vowelized" spellings. In some cases, these different spellings
simply reflect pronunciation variants, while in other cases, they
represent distinct morphological forms (with distinct contextual
meanings) where the semantic differences are conveyed solely by the
the short vowels (i.e. the diacritics).
For an example of the data in this publication, please view this screen capture.
Portions © 2006 Trustees of the University of Pennsylvania