CSLU: Spoltech Brazilian Portuguese Version 1.0, Linguistic Data Consortium (LDC) catalog number LDC2006S16 and ISBN 1-58563-383-6, contains microphone speech from a variety of regions in Brazil with phonetic and
orthographic transcriptions. The utterances consist of both
read speech (for phonetic coverage) and responses to questions
(for spontaneous speech). The corpus contains 477 speakers and
8,080 separate utterances. A total of 2,540 utterances have been
transcribed at the word level (without time alignments), and
5,479 utterances have been transcribed at the phoneme level (with
time alignments). Protocol design, recording and transcription
were performed by the Universidade Federal do Rio Grande do Sul
and the Universidade de Caxias do Sul.
The data has been recorded at 44.1 kHz
(mono, 16-bit) and stored in RIFF format. The
recording was conducted with a direct connection
from the microphone to the sound card.
The sound card was SoundBlaster-compatible.
For the prompted sentences, the sentence
was hidden from view when recording began, so
that the speaker might utter the sentence
more naturally. Verification of the recording
quality was performed immediately after
each utterance recording; the data-collection
software allowed the speaker to re-record
utterances in case the recording was not of
sufficient quality. The acoustic environment
was not controlled, in order to allow for
background conditions that would occur in
For an example of the data in this corpus, please listen to this audio sample and examine its transcript.
Portions © 1994-2002 Center for Spoken Language Understanding, Oregon Health & Science University, © 2006 Trustees of the University of Pennsylvania