Folie 11 von 26
Notizen:
In the archive signal and transcription will be automatically aligned. The precision of the alignment depends on the quality of the sound and the transcription. The alignment is based on forced speech recognition.
Our system was developed using version 2.0 of Entropic's Hidden Markov Toolkit and a grapheme to phoneme converter. The input consisting of orthographic text and sampled speech data is preprocessed and then given to the HTK Viterbi decoder which does the actual alignment. The grapheme to phoneme converter partly allows to handle morphological pecularities of German which influence the pronunciation, and it can easily be adapted to any other natural language.
It consists of 740 context-sensitive rules for pronunciation and about 500 rules formulated according to a right linear regular grammar to describe morphology. Further information will be given within the next section.
HTK depends on a regular grammar for the Viterbi decoder to prune the search space of possible word sequences. For alignment, the regular grammar is simply the linear concatenation of phonemes, resp. words. As already mentioned above, the only variability is given by pauses between words. The grammar is generated in such a way that it represents these phenomena adequately. A language model does not have to be considered.
Furthermore, Viterbi decoding needs vectors that describe the sampled speech data in terms of features. Our system uses spectral information represented by 12 mel frequency ceptral coeãcients plus overall energy and their first and second derivatives, giving a total of 39 parameters per vector. The vectorframes have a distance of 10 ms and are calculated using a Hamming weighted window of 25.6 ms length.
The HMMs were trained by the ERBA (Erlanger Bahnauskunft) material consisting of 40 speakers both male and female and 100 sentences per speaker.
The training material consists of six hours length with a sample rate of 16 kHz and a resolution of 16 bit per sample. The phonemes are modeled by context independent left-to-right HMMs with 3 emitting states, single mixture gaussian output probability density functions and a diagonal covariance matrix. There aren't any skip transitions per model and the model for speech pauses has the same topology.
Because of the great variety of dialect forms, there is little chance to make them part of a pronunciation dictionary. Furthermore, there is no convention of how to transcribe them "correctly", and there is a need to transcribe special kinds of utterances without using the IPA transcription. Thus, a rule-based phonetization tool has to be part of the aligner in order to be flexibel enough to handle these requirements. The tool we developed transcribes words according to context-sensitive rules of pronunciation. Most of these rules are formulated according to the standard pronunciation of German and are able both to consider morphological segmentations of words or not to consider them. In general, dialect forms are not morphologically segmented but so far it has proven to work sufficiently when we enable the aligner to run on audioles of a length of 20 minutes which are completely uttered and transcribed in dialect. The results of these runs can be integrated into the database without any further correction.'
R. Schmidt/ R. Neumann: Automatic Text-Speech Alignment: Aspects of Robustication in: V. Matousek, P. Mautner, J. Okélícová, P. Sojka (Eds.): Text, Speech and Dialogue, Second International Workshop, TSD '99, Proceedings, Plzen 1999.