Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Arabic Broadcast News Transcripts

Item Name: Arabic Broadcast News Transcripts
Authors: Mohamed Maamouri, David Graff, Christopher Cieri
LDC Catalog No.: LDC2006T20
ISBN: 1-58563-420-4
Release Date: Dec 19, 2006
Data Type: text
Data Source(s): broadcast news
Application(s): machine learning, machine translation
Language(s): Modern Standard Arabic
Language ID(s): arb
Distribution: Web Download
Member fee: $0 for 2006 members
Non-member Fee: US $400.00
Reduced-License Fee: US $200.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Mohamed Maamouri, David Graff, Christopher Cieri
2006
Arabic Broadcast News Transcripts
Linguistic Data Consortium, Philadelphia

This data set consists of eight text files containing transcripts for Voice of America satellite radio news broadcasts in Arabic. The broadcasts were recorded by the Linguistic Data Consortium at transmission time between June 2000 and January 2001.

Six broadcasts are 60 minutes long, and two broadcasts are 120 minutes long. The file names indicate the date (YYYYMMDD) and the begin and end times (HHMM EST) of the original transmission. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201.

Data

The character encoding is entirely in ASCII: Buckwalter transliteration is used for rendering the Arabic text content. Time alignment and structural markup are rendered via "pseudo-SGML" tags, which are presented one tag per line, with the first character of the line being an open angle bracket.

The lines of transcription text (i.e. the speech and annotation content between the time-stamp tags) all begin with a single space character, and present exactly one token per line. (A "token" may be a spoken Arabic word, a punctuation mark, or a single Arabic word enclosed by "(%" and ")", which represents an annotation of a non-speech condition or event (e.g. "music", "noise", "laugh", etc).

Samples

For an example of the data contained in this corpus, please examine this screenshot of the transcription.

Content Copyright

Portions © 2000, 2001, 2002, 2005, 2006 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.