June 1997 This release of CallHome Arabic data consists of both the romanized and Arabic script versions of the 80 training, 20 devtest, and 20 evaltest transcripts. Three subdirectories can be found here: doc/ tools/ transcrp/ Each of these contains the following: doc/ contains ten files that describe the transcript corpus: ar-trans.doc - describes Arabic transcripts callinfo.doc - describes callinfo.tbl callinfo.tbl - provides audit information for each channel devtest.ids - provides a list of callids for the devtest evaltest.ids - provides a list of callids for the evaltest spkrinfo.doc - describes spkrinfo.tbl spkrinfo.tbl - provides demographic information on the telephone call originators iso-spec.doc - describes ISO encoding used for Arabic script scr2rom.tbl - describes LDC romanization to Arabic script correspondence train.ids - provides a list of callids for training set tools/ contains two sets of tools: mule/ package for viewing and editing ISO 8859-6 Arabic text (useful for the ".scr" files in this release) scr_conv/ tools used to create the Arabic script (.scr) files Includes a simple parser for the romanized transcripts that may be useful for other purposes. transcrp/ contains three subdirectories: train/ contains two subdirectories: roman/ contains the 80 training transcripts in romanized (original) form (.txt files) script/ contains the 80 training transcripts in Arabic script form (ISO 8859-6 Arabic) (.scr files) devtest/ contains two subdirectories: roman/ contains the 20 devtest transcripts in romanized (original) form (.txt files) script/ contains the 20 devtest transcripts in Arabic script form (ISO 8859-6 Arabic) (.scr files) evaltest/ contains two subdirectories: roman/ contains the 20 previously-released evaltest transcripts in romanized (original) form (.txt files) script/ contains the 20 previously-released evaltest transcripts in Arabic script form (ISO 8859-6 Arabic) (.scr files)