Introduction
Turkish Broadcast News Speech and Transcripts was developed by Boğaziçi
University, Istanbul, Turkey and contains approximatley 130 hours of Voice
of America (VOA) Turkish radio broadcasts and corresponding transcripts. This
is part of a larger corpus of Turkish broadcast news data collected and transcribed
with the goal to facilitate research in Turkish automatic speech recognition
and its applications, such as speech retrieval.
The VOA material was collected between December 2006 and June 2009 using a
PC and TV/radio card setup. The data collected during the period 2006-2008 was
recorded from analog FM radio; the 2009 broadcasts were recorded from digitial
satellite transmissions. A quick manual segmentation and transcription approach
was followed.
Speech recognition and retrieval experiments using the larger corpus can be found
in the following journal article: Ebru Arisoy, Dogan Can, Siddika Parlak, Hasim Sak,
and Murat Saraclar, "Turkish Broadcast News Speech and Transcripts Transcription and Retrieval,"
IEEE Transactions on Audio, Speech and Language Processing, 17(5):874-883, July 2009.
For more information please visit
http://busim.ee.boun.edu.tr/~speech
or contact the principal investigator, Murat Saraçlar.
Data
The data was recrded at 32 kHz and resampled at 16 kHz. After screening for
recording quality, the files were segmented, transcribed, and verified. The
segmentation occurred in two steps, an initial automatic segmentation followed
by manual correction and annotation which included information such as background
conditions and speaker boundaries.
The transcription guidelines were adapted from the LDC HUB4 and
quick transcription guidelines. An English version of the adapted
guidelines is provided with the data here.
The manual segmentations and transcripts were created by native Turkish speakers
at Boğaziçi University using Transcriber.
The transcriptions are provided in the ISO-8859-9 (Latin5) character set.
Samples
Please follow the links below for samples:
Sponsorship
Funding for this corpus collection effort came from TUBITAK Project 105E102 and Bogazici University
Research Fund Project 05HA202.
Updates
None at this time.
Content Copyright
Portions © 2012 Murat Saraçlar, Trustees of the University of Pennsylvania
|