USC-SFI MALACH Interviews and Transcripts English, LDC Catalog Number LDC2012S05
and ISBN 1-58563-602-9, was developed by The University of Southern California's
Shoah Foundation Institute (USC-SFI), the University of Maryland, IBM and Johns
Hopkins University as part of the MALACH
(Multilingual Access to Large Spoken ArCHives) Project. It contains approximately
375 hours of interviews from 784 interviewees along with transcripts and other
Inspired by his experience making Schindler's List, Steven Spielberg
established the Survivors of the Shoah Visual History Foundation in 1994 to
gather video testimonies from survivors and other witnesses of the Holocaust.
While most of those who gave testimony were Jewish survivors, the Foundation
also interviewed homosexual survivors, Jehovah's Witness survivors, liberators
and liberation witnesses, political prisoners, rescuers and aid providers, Roma
and Sinti (Gypsy) survivors, survivors of eugenics policies, and war crimes
trials participants. Within several years, the Foundation's Visual History Archive
held nearly 52,000 video testimonies in 32 languages, representing 56 countries;
it is the largest archive of its kind in the world. In 2006, the Foundation
became part of the Dana and David Dornsife College of Letters, Arts and Sciences
at the University of Southern California in Los Angeles and was renamed as the
USC Shoah Foundation Institute for Visual History and Education.
The goal of the MALACH project was to develop methods for improved access to
large mulitnational spoken archives; the focus was advancing the state of the
art of automatic speech recognition (ASR) and information retrieval. The characteristics
of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies,
heavy accents, age-related coarticulations, un-cued speaker and language switching
and emotional speech -- were considered well-suited for that task. The work
centered on five languages: English, Czech, Russian, Polish and Slovak. USC-SFI
MALACH Interviews and Transcripts English was developed for the English speech
The speech data in this release was collected beginning in 1994 under a wide
variety of conditions ranging from quiet to noisy (e.g., airplane overflights,
wind noise, background conversations and highway noise). Original interviews
were recorded on Sony Beta SP tapes, then digitized into a 3 MB/s MPEG-1 stream
with 128 kb/s (44 kHz) stereo audio. The sound files in this release are compressed
in MP3 format at a sampling frequency of 44.1 kHz.
Approximately 25,000 of all USC-SFI collected interviews are in English and
average approximately 2.5 hours each. The 784 interviews included in this release
are each a 30 minute section of the corresponding larger interview. Due to the
way the original interviews were arranged on the tapes, some interviews are
clipped and have a duration of less than 30 minutes. Certain interviews include
speech from family members in addition to that of the subject and the interviewer.
Accordingly, the corpus contains speech from more than 784 speakers, who are
more or less equally distributed between males and females. The interviews also
include accented speech over a wide range (e.g., Hungarian, Italian, Yiddish,
German and Polish).
This release includes transcripts in trs format of the first 15 minutes of each
interview. The transcripts were created using Transcriber
1.5.1 and later modified.
For a sample of the audio in this release, use this
None at this time.
Portions © 2012 USC Shoah Foundation Institute, © 2012 Trustees
of the University of Pennsylvania