Developers: Bill Grundy, Jared Bernstein, Elizabeth Rosenfeld, Amir Najmi, Psi Mankoski.
Author: Jared Bernstein
Entropic Research Laboratory, Inc. 600 Pennsylvania Ave. SE Washington, DC 20003
Entropic Research Laboratory designed the Latino-40 database to provide a set of recordings for training speaker-independent systems that recognize Latin-American Spanish. The resulting database, called Entropic Latino-40, was recorded in the period from 11 July through 9 September 1994, in Palo Alto, California.
The database comprises about 5000 utterance files. These files include about 125 utterances from each of 40 different speakers, 20 male and 20 female. The recordings were all made with a high-quality, head-mounted microphone (Shure SM10A) in an office environment, and the utterances were digitized in 16-bit samples at 16 kHz.
The Linguistic Data Consortium provided 13,000 sentences that had been selected (apparently from Latin American newspaper text) by people working at Texas Instruments. No documentation was available on the sentence set, and the sentences include a number of anomalous or ambiguous forms. The sentences are all shorter than 80 characters, and are not grouped into larger constituents like paragraphs or stories.
Each of 13,000 sentences is identified by its own number sss1 through sss13000. The set of sentences was divided into 13 distinct sets of 1000 sentences each, and each successive speaker read from the next subset of 1000 sentences, rotating through the 13 subsets. For each speaker, the first 125 acceptable sentences are included in the Latino-40 data base. It was necessary to reject 10 or 15 sentences for many speakers, and as many as 150 for one speaker, in order to find 125 acceptable ones. The following is a sample of 20 sentences from subset 3 that includes the longest sentence (sss3816) among the entire 13,000:
sss3800 Hay problemas muy serios en estos momentos. sss3801 Se habrán ilustrado varias características. sss3802 A Venezuela y a Guyana les irá particularmente bien. sss3803 A su juicio, tal hipótesis era técnicamente inadecuada. sss3804 No hacemos gestos porque no somos actores, aqadis. sss3805 Queremos un poco de seguridad para regresar. sss3806 Este límite máximo se aplica a tres escalas. sss3807 Si el Consejo acepta eso, no tengo nada que decir. sss3808 Esto puede lograrse a través de una fórmula legal. sss3809 No vamos a defraudar al pueblo, dijo Reina. sss3810 El economista de la firma, no lo cree. sss3811 Postergan juicios políticos en contra de los ministros. sss3812 Esa es la parte fácil, dijo el portavoz. sss3813 También se debe tener en cuenta otro elemento. sss3814 En poco tiempo se iban a instalar diez más. sss3815 ?Es esto un símbolo del dominio japonés de la electrónica? sss3816 En mil novecientos ochenta y nueve no se delegó ese tipo de autorización. sss3817 Procura evitar accidentes y vertimientos. sss3818 Además, se promoverán actividades industriales. sss3819 Los hombres utilizan ese tiempo para esconder sus armas. sss3820 Ha abierto los ojos y mueve las manos.
Speakers were all paid volunteers who had been informally solicited in the Palo Alto area. All speakers were adults; they ranged in age from 18 to 59 years of age. All claimed to be native speakers of Latin American Spanish, although one speaker was completely rejected because his accent sounded Brazilian to the person verifying the recordings. Seven speakers were from Peru; five each from Argentina, Columbia, Guatemala, and Nicaragua; three from Venezuela; and two each from Chile, Costa Rica, Cuba, El Salvador, and Mexico.
The speakers are identified with four characters; two letters and two numbers. The first letter identifies the country of origin: e.g. 'a' for Argentina, 'p' for Peru, etc., but 'c' for Colombia, 'b' for Cuba, 'h' for Chile, and 'r' for Costa Rica. The second letter identifies the speaker's gender: 'm' male or 'f' female. The two-digit number is an arbitrary identifier in the range 01 to 40.
The forty speakers are identified as follows:
id age subset origin verifier note af01 30 5 Santa Cruz, Argentina af13 27 12 Buenos Aires, Argentina af14 43 5 Buenos Aires, Argentina am19 41 10 Buenos Aires, Argentina am26 28 4 Buenos Aires, Argentina bm21 30 2 Havana, Cuba bm22 55 4 Havana, Cuba cf11 52 2 Cali, Colombia cf30 34 5 Bogota, Colombia cm02 23 11 Bogota, Colombia cm05 40 7 Bogota, Colombia cm07 37 8 Bogota, Colombia gf10 30 13 Quetzaltenango, Guatemala gf18 27 1 Guatemala City, Guatemala poor reading gf20 34 12 San Marcos, Guatemala gf38 30 8 Guatemala, Guatemala gm06 29 10 San Marcos, Guatemala 119 sentences; poor reading gm17 18 12 Guatemala City, Guatemala hf28 43 9 Valparaiso, Chile hf39 39 11 Vina del Mar, Chile hm12 59 9 Santiago, Chile mf27 28 9 D. F., Mexico mm32 32 9 Durango, Mexico nf34 23 6 Granada, Nicaragua slow reading nf35 29 10 Managua, Nicaragua nm15 54 6 Managua, Nicaragua nm23 44 7 Managua, Nicaragua pf31 39 13 Lima, Peru pf33 37 2 Lima, Peru slow reading pf37 23 10 Cusco, Peru pf40 40 3 Lima, Peru uvular /rr/ pm03 36 3 Lima, Peru pm16 57 3 Lima, Peru pm24 31 4 Lima, Peru poor reading rf29 59 7 San Jose, Costa Rica rf36 35 11 San Jose, Costa Rica poor reading sf09 46 8 San Salvador, El Salvador sm04 24 6 San Salvador, El Salvador poor reading vf08 28 11 Valencia, Venezuela vm25 33 5 Caracas, Venezuela
Speakers were seated in an upholstered chair facing the console of a Silicon Graphics Indy (SGI) workstation computer. Each speaker was introduced to the procedure to be followed and signed a consent to participate in the data collection. Speakers were instructed how to control the recording software, how to wear the microphone properly, and how to judge whether or not a particular read rendition would be acceptable.
The speakers wore a Shure SM10A unidirectional head-worn dynamic microphone, and controlled the recording session at their own pace using a recording program designed for the purpose. Control of the recordings was principally accomplished through a "record" button that displayed the text of the Spanish sentence, and initiated recording. The recording of a sentence was typically ended by pushing a "record next" button, that terminated the recording of the current sentence and then initiated the recording and display of the next sentence. Speakers had access to a full set of other controls that permitted them to play and re-record earlier sentences if they wished, and move about in the database they were constructing.
After an initial period during which an Entropic supervisor monitored the speaker's reading and recording control, speakers were left to monitor their own reading and recordings.
Speech signals went from the Shure microphone through a Rane MS-1 preamplifier into the 'line input' jack on the SGI Indy workstation. The gain of the Rane preamplifier and the SGI system were set and checked once toward the beginning of the recording session and were left fixed at that level.
The room was a small carpeted office with a floor area of approximately 3.9 m by 2.9 m and a ceiling height of 2.7 m. The room was heated and cooled by forced air that entered via a vent high on the wall above a large cabinet and about 3 m from the subject's head. The room had two doors that were usually left open, and sometimes exposed the microphone to passing conversation or incoming call signals from various nearby telephones.
Except for the carpeted floor, most surfaces in the room were hard and smooth. For example, subjects sat at a table with a plastified hardwood surface; there was a large white board immediately to the subject's right, and the wall behind the computer console was entirely glass.
Physical dimensions: 3.9 m x 2.9 m (floor to ceiling 2.7 m) door <-------- glass wall ------------> ----------| | | | | |------------------------------------ | | | ________ | | | | | | | | | | | | table | SGI | | | |cabinet| | | console| | | | | | -------- | | | | | | | | | ============================== | | | | |_______| subject | | seated | | | _ | _ _________| door _ | | _ -------------- | file | _ | bookshelf | | cabinet | |-------------------------------------------------
The recordings were verified to be fluent and to correspond to the presented text. Verification was performed by an educated Argentine. Verification was accomplished primarily by rejecting spoken renditions that did not correspond to the original text. In general, text was not altered to correspond to an acceptable, but variant, spoken token. Some sentences were excluded because of anomalies in the text as presented. A sentence token was considered a fluent reading if it contained all and only the printed words in the correct order (no false starts, or repeats) and the words were pronounced in accordance with any accepted Spanish letter-sound values. This leaves some inconsistencies due to dialect differences, but, more importantly, it leaves some foreign words (especially proper names) pronounced with pseudo-English or pseudo-French values.
The raw speech files were processed to delete excessive initial and final silences, using a modified version of the find_ep endpointing program that is part of Entropic's ESPS package. The files are distributed in NIST SPHERE compressed format.
File headers are formatted as in the following example: [as printed by SPHERE "h_read"]
database_id latino_40 database_version 1.0 sample_rate 16000 sample_n_bytes 2 sample_sig_bits 16 sample_coding pcm,embedded-shorten-v1.09 channel_count 1 microphone Shure SM-10a prompt_type printed recording_site ERL Palo Alto native_language spanish geographic_origin Santa Cruz, Argentina age 30 gender Female sample_count 76801 prompt_text No habiendo objeciones, así quedó acordado. sample_max 14030 sample_min -13585 sample_byte_format 10 sample_checksum 64953 speaker_Id af01