Kilany, Hanaa, H. Gadalla, H. Arram, A. Yacoub, A. El-Habashi, A. Shalaby, K. Karins, E. Rowson, R. MacIntyre, P. Kingsbury, and C. McLemore. LDC Egyptian Colloquial Arabic Lexicon. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. ----------------------------------------------------------- Description of the LDC Egyptian Colloquial Arabic lexicon ----------------------------------------------------------- CONTENTS 1. Summary abstract 2. Lexicon information fields 3. Orthographic convention (romanization) 4. Orthographic convention (Arabic script) 5. Character/letter correspondence table 6. Phonology table 7. Stress information 8. Morphological tags 9. Word source and frequency 10. Arabic script/romanization correspondence table ----------------------------------------------------------------------- 1. Summary abstract The LDC Arabic lexicon was compiled primarily for support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. This lexicon represents the first electronic pronunciation dictionary of Egyptian Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect of ECA that this dictionary represents is Cairene Arabic. This lexicon consists of 16,441 words. The LDC Arabic lexicon contains tab-separated information fields, including orthographic representation in both the LDC romanization as well as Arabic script, morphological, phonological, stress, source, and frequency information for each word. The lexical entries found in this lexicon come from three sources: (1) the 80 LVCSR CallHome training transcripts, (2) the 20 LVCSR CallHome development test (devtest) transcripts, and (3) entries from the Badawi & Hinds print dictionary of Egyptian Colloquial Arabic [Badawi, El-Said and Hinds, Martin (1986) "A Dictionary of Egyptian Arabic: Arabic-English". Librairie du Liban.] ----------------------------------------------------------------------- 2. Lexicon information fields The LDC Arabic lexicon contains seven tab-separated information fields: Field 1: orthographic form (headword) in LDC romanized script Field 2: orthographic form of the headword in Arabic script Field 3: pronunciation of the headword Field 4: primary stress information of the headword Field 5: morphological analysis of the headword Field 6: word frequency in training transcripts Field 7: word frequency in devtest transcripts Field 8: word occurence in the Badawi & Hinds print dictionary In the fields containing pronunciation, stress and morphological information, alternate forms or analyses are separated by two slashes "//". More on each of these fields is described in sections 3 - 9 below. ----------------------------------------------------------------------- 3. Orthographic convention (LDC romanization) The first field in the Arabic lexicon contains the romanized orthographic representation of the Arabic word. The bulk of the words found in this lexicon come from the transcripts of the 80 LVCSR Arabic training and 20 LVCSR Arabic devtest conversations collected and transcribed at the LDC. The original transcription of the recorded conversations was done in the romanized version of ECA developed at the LDC. The romanized orthography of ECA (using ASCII characters) is phonemically based, and attempts to preserve both word identity and word pronunciation while limiting ambiguity. More documentation on this can be found with the released LVCSR CallHome Arabic transcripts. ----------------------------------------------------------------------- 4. Orthographic convention (Arabic script) The second field in the Arabic lexicon contains the Arabic script equivalent of the romanized headword from which it is derived. In turn, the LVCSR Arabic transcripts were converted from the original romanized script to Arabic script via replacement with the orthographic form found in this lexicon. The Arabic script representations of words in this lexicon were created using the Arabic character set available in MULE (Multi-Lingual Emacs). The character correspondences are one-to-one where this is possible (see the correspondence table in section 6.) . There are a number of general instances where the romanized character sequence differs from the Arabic script character sequence: 1. In verbal forms, the romanized script indicates stem-vowel length distinctions which are not found in the Arabic script. 2. Where the romanized script writes (historical) /th/ as the spoken /s/ or /t/, and (historical) /dh/ as /z/ or /d/, the Arabic script version writes both the /th/ and /dh/ where these are pronounced as /s/ and /z/ respectively. This is schematized below: MSA: s th t th z dh d dh \ / \ / \ / \ / LDC romanization: s t z d / \ | / \ | LDC ECA script: s th t z dh d 3. The LDC romanized script indicates "doubled" consonants in ECA with two orthographic letters. The Arabic script version would be expected to indicate consonant quantity or duration with a "shadda". Since the "shadda" is unfortunately not currently available in the MULE Arabic character set, consonant duration is not indicated in the Arabic script. 4. Initial vowel correspondences between the romanized and Arabic script versions is the following: a Alif A Alif with madda E/I Alif and ya i Alif O/U Alif and waw u Alif Since the glottal stop or "hamza" is usually not pronounced in ECA, we do not regularly include this character either in the romanization or in the Arabic script. ----------------------------------------------------------------------- 5. Character/letter correspondence table Refer to the file "scr2rom.tbl" ----------------------------------------------------------------------- 6. Phonology table The third field in the lexicon contains pronunciation information of each headword. The phonetic symbols used are adapted from the romanization of ECA provided in section 6. above. The symbol used, its phonetic description, and an example word from Arabic is provided in the table below. This lexicon contains some alternate pronunciations of words, including the variants of the words with the morphophonemic marker "tEh marbUta" /B/. In most words, orthographic /q/ is pronounced as a voiceless glottal stop in ECA. However, in those somewhat rare instances where it is pronounced as a voiceless pharyngeal stop, its pronunciation is given as [Q]. In other cases, the pronunciation is left as [a]. This gives rise to two phonetic symbols used for the glottal stop: /C/ and /q/. However, retaining these two symbols in the pronunciation field allows one to trace the origin of the glottal stop: either a hamza or qAf. If there is more than one pronunciation of a headword, the alternate pronunciations are separated by a "//". Phonology table of the LDC Arabic lexicon LDC symbol Phonetic description Sample word C voiceless glottal stop b voiced bilabial stop t voiceless dental stop g voiced velar stop H voiceless pharyngeal fricative x voiceless velar fricative d voiced dental stop r voiced alveolar flap z voiced alveolar fricative s voiceless alveolar fricative $ voiceless alveopalatal fricative S voiceless alveolar velarized fricative D voiced dental velarized stop T voiceless dental velarized stop Z voiced velarized interdental fricative c voiced pharyngeal fricative G voiced uvular fricative f voiceless labio-dental fricative q voiceless glottal stop Q voiceless pharyngeal stop k voiceless velar stop l voiced alveolar lateral m voiced bilabial nasal n voiced alveolar nasal h voiceless glottal fricative w voiced bilabial continuant y voiced palatal continuant v voiced labio-dental fricative j voiced alveopalatal affricate @ low front unrounded vowel a low back unrounded vowel i high front unrounded vowel u high back rounded vowel % long @ A long a I long i O long back mid rounded vowel U long u E long front mid unrounded vowel ay front upgliding diphthong aw back upgliding diphthong ----------------------------------------------------------------------- 7. Stress information The fourth information field in the lexicon contains information about the primary word stress in the language. Each syllable of the word is indicated by a number, with unstressed syllables indicated by "0" and the stressed syllable indicated by "1". Only one stress per word is indicated. If there is more than one pronunciation provided for a headword, there is corresponding stress information, also separated by "//". ----------------------------------------------------------------------- 8. Morphological tags The fifth information field of the Arabic lexicon contains morphological information about the headword. The abbreviations used are explained below. Different bits of information in this field are separated by a plus "+". For possessive suffixal, direct and indirect object endings, parts of the morpheme are separated by a "-" to distinguish this information from the core word. If there is more than one possible morphological parse for a given word, the different parses are separated by two slashes "//". The first entry for any morphological tag is the base (or traditional "look-up" form) of the headword. Part of speech tags: :adj adjective :adv adverb :article definite article :conj conjunction :dem demonstrative pronoun :interj interjection :interjportion part of a multi-word interjection :modal modal verb :noun noun :num numeral :part particle :part-itr interrogative particle :part-neg negative particle :part-voc vocative particle :part-int introductory particle :pple-act active participle :pple-pass passive participle :prep preposition :pro pronoun :prorel relative pronoun :vbn verbal noun :verb verb Morphological attributes: +1st first person +2nd second person +3rd third person +amb ambiguous +article definite article +coll collective +conj(_prefix) conjunction prefix (e.g. /fa/) +DO direct object +dual dual +elative elative +fem feminine +fut future tense +gen genitive suffix +imp imperfect tense +inan inanimate +inv invariant +IO indirect object +masc masculine +neg negative marker +NEG negative markers on verbs +nom nominative suffix +part particle not as a separate part of speech +past past tense +plural plural +preppre prepositional prefix (e.g. /li/) +pres present tense +prop proper name +sg singular +subj subjunctive mood +sufxprep suffixal preposition /l/ (for indirect object) The morphosyntactic information found in the Arabic lexicon does not distinguish morphosyntactic analyses for verbs that are "ambiguous" in their romanization. For example, the following verbal pair (with disambiguated "=a" and "=h" for the Arabic script equivalent) has the following morphosyntactic information: $Ufu=a $Af:verb+imp+2nd-masc-sg+DO-3rd-masc-sg//$Af:verb+imp+2nd-plural $Ufu=h $Af:verb+imp+2nd-masc-sg+DO-3rd-masc-sg//$Af:verb+imp+2nd-plural These verbs are actually analyzed as: $Ufu=a $Af:verb+imp+2nd-plural $Ufu=h $Af:verb+imp+2nd-masc-sg+DO-3rd-masc-sg What the morphosyntactic information refers to is the "ambiguous" romanized form (without the "=" information) $Ufu $Af:verb+imp+2nd-masc-sg+DO-3rd-masc-sg//$Af:verb+imp+2nd-plural This has resulted because the verbal transducer (and hence the morphosyntactic information) was created prior to final "disambiguation" of the verbal forms in romanized script. An ultimate solution/correction to this problem would be to change the orthographic conventions developed at the LDC to always account for these distinctions (which are nevertheless homophones), and to make changes in the verbal transducer accordingly. This is one small example that illustrates how difficult it is to devise an acceptable standardized orthography for spoken language. ----------------------------------------------------------------------- 9. Word source and frequency All word frequency information is based upon the romanized headword found in the first column of the dictionary. Training words (field 6): The sixth tab-separated field in the lexicon contains information about frequency of the word in the training transcripts. Devtest words (field 7): The seventh tab-separated field in the lexicon contains information about frequency of the word in the development test (devtest) transcripts. Badawi-Hinds (field 8): The eighth tab-separated field in the lexicon contains a "1" if the word comes from the Badawi & Hinds dictionary and not from the training or devtest transcripts. Otherwise, this field contains a "0". ----------------------------------------------------------------------- 10. Arabic script/romanization correspondence table The character correspondences between Arabic script and the LDC romanization of ECA is provided in the table below, along with a phonetic description of the symbol used. (You will need to use mule to view the Aracbic script characters in this table.) This table is also stored in the file "scr2rom.tbl". LDC correspondence table for Egyptian Colloquial Arabic Arabic LDC Arabic name Phonetic description Á C hamza voiceless glottal stop (frequently combined with an adjacent alif, yA, or wAw "chair" or realized as "madda") È b bA voiced or voiceless bilabial stop Ê t tA voiceless dental stop Ì g gIm voiced velar stop Ì j jIm voiced alveopalatal affricate Í H HA voiceless pharyngeal fricative Î x xA voiceless velar fricative Ï d dAl voiced dental stop Ñ r rA voiced alveolar flap Ò z zEn voiced alveolar fricative Ð z dhAl voiced alveolar fricative Ó s sIn voiceless alveolar fricative Ë s thA voiceless alveolar fricative Ô $ $In voiceless alveopalatal fricative Õ S SAD voiceless alveolar velarized fricative Ö D DAD voiced dental velarized stop × T Tah voiceless dental velarized stop Ø Z Zah voiced velarized interdental fricative Ù c cEn voiced pharyngeal fricative Ú G GEn voiced uvular fricative á f fA voiceless labio-dental fricative á v vi voiced labio-dental fricative â q qAf voiceless pharyngeal stop ã k kAf voiceless velar stop ä l lAm voiced alveolar lateral å m mIm voiced bilabial nasal æ n nUn voiced alveolar nasal ç h hA voiceless glottal fricative è w wAw voiced bilabial continuant é/ê y yA voiced palatal continuant (é- connected only on right or unconnected) (ê- connected on both sides or left only) É B tEh marbuta morphophonemic feminine marker a fatHa low front unrounded vowel i kasra high front unrounded vowel u Damma high back rounded vowel Ç A alif long a é/ê I yA long i è O wAw long back mid rounded vowel è U wAw long u é/ê E yA long front mid unrounded vowel ay front upgliding diphthong aw back upgliding diphthong