-*- mode: outline -*- The corpus for Project Santiago, Croatian in two alphabets * Introduction The following is a design analysis of the corpus made for Project Santiago to be used in Croatian by speakers reading one text in two alphabets, Latin and Cyrillic. It is interesting to note that, while reference will be made to a single language, within that referenced language and enormous set of natural variants occurs. **Dialect While there is no single dialect of the spoken language that could be safely considered the most prevalent, the ijekavski dialect has been chosen as representative of a large number of Croatian speakers with significant overlap into Bosnian and Serbian dialects. The term -ije derives from the manifestation of the -e vowel (formerly jat, in Old Church Slavic). The ekavski dialect is rare within Croatia, but another dialect, the ikavski, prevails, especially along the Dalmatian coast. Speakers from these other dialects will likewise replace the sto in the stokavian dialect with sta; ca or kaj are also possible especially within the ijekavski variant. . Certain changes in the dialect revert to the consonants, causing a palatalization. Pesma in the ekavski dialect would be pjesma in the -ijekavski variant. However, except for nj and lj no palatalized consonant is phonemically recognized. Other palatalizations such as in pesma/pjesma have been avoided for the sake of dialectal unity insofar as that is possible. Tonal registers in the accented vowel do not technically attend the Croatian ijekavski dialect. Vocalic length is not considered phonemic within the same speech group. However, many ijekavski speakers do use tones and length within their vowels. The following corpus is too short to give proper representation of all 36 accented and long combinations in conjunction with all 24 consonants, a total of 864 CV possibilities, exclusive of word initial vowels. Since Croatian has retained the open syllable principle of early Slavic, it was possible to use open syllables for the accented vowel occurrence, which was judged optimal for modelling concerns. Resonants including liquids, were also avoided when possible at syllable boundary to avoid coarticulation. The Serbian graphic system has a one-to-one correspondence to the Croatian or latin aplphabet. The digraphs Dzh , lj, and nj are all single elements in the cyrillic. The latin script can be exchanged for cyrillic without great modification. *Croatian phones ** 1.0 Vowels *** 1.1 There are six base vowels, /a/, /e/, /i/, /o/, /r/, /u/ . The /e/ in some dialects alternates with /i/ (ikavski dialect) or /ije/ (ijekavski dialect). Some /e/ sounds remain intact within those dialects, depending on the history of the word. *** 1.2 This corpus represents the ijekavski dialect. Technically, /ije/ is one long vowel. However, this corpus will treat the two syllables as two vowels, using the standard assignation of rising or falling tones across both vowels. Falling tone is marked on /i/, rising tone is marked on /e/. Both these syllables are invariably short. *** 1.3 All vowels retain the phonetic quality typical of accented vowels. There are no diphthongs. In the enclitics, je, se, the vowels are so short quantitatively (as short as 50 ms.) that it is often difficult to assign qualitative measures to them at all; they sometimes barely exceed VOT [voice onset timing], for example). However, purists of the language would not believe there is qualitative deterioration. In careful speech or emphatic situations, hypercorrection or extenuation occurs in these words, drawing them out to 80-90 ms, the length more typical of an ordinary syllable with a short-unstressed vowel. *** 1.4 All vowels have six possible realizations, two without stress and two each of two different tonal stresses: **** Unstressed v - short V - long **** Stressed ***** Falling Rising ****** Short v V v V ****** Long v V v V ** 2.0 Consonants *** 2.1 There are six pairs of consonants by voice: /ca/ /dj/ /ch/ /dzh/ /f/ /v/ /k/ /g/ /p/ /b/ /s/ /z/ /sh/ /zh/ /t/ /d/ *** 2.2 There are seven unpaired resonants, including two semi-vowels*: /j/ /l/* /nj/ /m/ /n/ /nj/ /r/* *** 2.3 There are two unpaired, unvoiced consonants: /c/ /h/ ** 3.0 Spelling conventions and phonetics Croatian is spelled phonetically, despite the loss of morphemic distinctions: On je tezhak. Ona je teshka. *** 3.1 Assimilation across word boundaries can still take place when words occur together in speech.