The National Science Foundation funded the Multilingual Access to Large Spoken Archives (MALACH) Project to support the advancement of automatic speech recognition and information retrieval. The collection of spoken material used for this research included 116,000 hours of emotional and highly accented speech from interviews gathered by the Survivors of the Shoah Visual History Foundation (SSVHF).
Stephen Spielberg established the SSVHF in 1994 following his work on Schindler’s List. His main goal: to videotape as many first person accounts of the Holocaust as possible before losing the stories forever. The collection, now at the University of Southern California under the renamed USC Shoah Foundation Institute (USC-SFI), is the largest archive of its kind in the world and includes over 50,000 interviews of Holocaust survivors and witnesses. USC-SFI’s primary mission is to use the visual history testimonies in educational settings to prevent prejudice, intolerance and bigotry.
To unlock the full potential of the collection, the MALACH Project supported the development of an indexing system to facilitate searching the archive. Speech recognition experiments exploited the material’s disfluencies, heavy accents, age-related coarticulations, language switching and emotional speech.
LDC has two sets of MALACH data in the catalog. In 2012 LDC released USC-SFI MALACH Interviews and Transcripts English. Developed by USC-SFI, the University of Maryland, IBM and Johns Hopkins, this corpus contains approximately 375 hours of speech from 784 interviewees along with transcripts and other documentation.
USC-SFI MALACH Interviews and Transcripts Czech is the most recent MALACH publication in LDC’s Catalog. Developed by USC-SFI and the University of West Bohemia, this release is divided into training and test data and includes 229 hours of speech, with 143 hours transcribed, from 420 interviewees.
The Australian news channel, ABC, recently aired a short documentary about Language Preservation 2.0: Crowdsourcing Oral Language Documentation using Mobile Devices, a collaboration between LDC and the University of Melbourne to collect stories and oral histories from speakers of endangered languages in Brazil and Papua New Guinea.
With just the click of a button, Steven Bird and his team of researchers are able to record languages in remote locations using a hand-held mobile device with the newly developed android application: Aikuma. Nicknamed after the Usarufa’s word for “meeting”, the app is easy to use by individuals unfamiliar with technology and allows for the recording of the language, translations and speaker metadata.
Bird and his research team are finding that the local speakers enjoy the process of recording and translating their language and agree with the importance of preserving their linguistic and cultural heritage for generations to come.
The ease of these recording devices is resulting in a database that Bird hopes will someday serve as an audio Rosetta Stone. LDC provides key technical support through transcription and archiving in order to make the recordings maximally usable and accessible. Eventually the collection will be housed in the Language Commons.
The ABC documentary on the project can be viewed here.
For more information about the project, click here.
LDC is pleased to announce the student recipients of the Spring 2014 LDC Data Scholarship program! This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen two proposals to support. The following students will receive no-cost copies of LDC data:
- Skye Anderson ~ Tulane University (USA), BA candidate, Linguistics. Skye has been awarded a copy of LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 for her work in author profiling.
- Hao Liu ~ University College London (UK), PhD candidate, Speech, Hearing and Phonetic Sciences. Hao has been awarded a copy of Switchboard-1 Release 2, and NXT Switchboard Annotations for his work in prosody modeling.
LDC's Catalog received another 5-star rating from the Open Language Archives Community (OLAC) for 2013. OLAC determines star ratings based on the number of archives with fresh catalogs (that is, updated within the last 12 months) and the number of archives with five-star metadata (that is, fully conforming to best practices as agreed upon by the community without known data integrity problems). LDC routinely receives 5-star ratings for its Catalog metadata.
Podcasts from the complete set of staff interviews conducted as part of LDC's 20th Anniversary can be accessed from the LDC Blog. Hear what long-time staffers had to say about their experiences at LDC.
Christopher Cieri, Executive Director -- Chris reflects on the road that brought him to LDC, some of his early responsibilities and Consortium activities.
David Graff, Lead Programmer -- Dave was one of LDC's first staff members and offers some insights on LDC's early days.
Yiwola Awoyale, Moussa Bamba, Researchers -- Yiwola and Moussa discuss how they came to LDC, their work on West African langauges and how it benefits multiple communities.
Natalia Bragilveskaya, Business Manager; Ilya Ahtaridis, Membership Coordinator; Marian Reed, Marketing Coordinator -- Natalia, Ilya and Marian recall the early days of LDC and the development of its interactions with the University of Pennsylvania, sponsors, members, licensees and collaborators.