This data set consists of eight text files containing transcripts for
Voice of America satellite radio news broadcasts in Arabic. The broadcasts were
recorded by the Linguistic Data Consortium at transmission time between June 2000 and
January 2001.
Six broadcasts are 60 minutes long, and two broadcasts are 120
minutes long. The file names indicate the date (YYYYMMDD) and the
begin and end times (HHMM EST) of the original transmission. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201.
Data
The character encoding is entirely in ASCII: Buckwalter
transliteration is used for rendering the Arabic text content. Time
alignment and structural markup are rendered via "pseudo-SGML" tags,
which are presented one tag per line, with the first character of the
line being an open angle bracket.
The lines of transcription text (i.e. the speech and annotation
content between the time-stamp tags) all begin with a single space
character, and present exactly one token per line. (A "token" may be
a spoken Arabic word, a punctuation mark, or a single Arabic word
enclosed by "(%" and ")", which represents an annotation of a
non-speech condition or event (e.g. "music", "noise", "laugh", etc).
Samples
For an example of the data contained in this corpus, please examine this screenshot of the transcription.
Content Copyright
Portions © 2000, 2001, 2002, 2005, 2006 Trustees of the University of Pennsylvania |