Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome


TDT Data QC

Due to its many interrelated components, the TDT3 Text Corpus requires additional quality checking to verify that the dependencies between the components hold. There are two types of TDT3 sources, broadcast and newswire, in other words, audio and text. The audio files created from the broadcast sources make up the TDT3 Audio Corpus, but text information is also derived from this audio data. Therefore the TDT3 Text Corpus is derived from both broadcast and newswire. Newswire data, along with transcripts of the audio data, are tagged with SGML. Each of these SGML files is used to create a token file, which, although still tagged with SGML, is basically one word per line and has no sentence or paragraph structure. Similar text token files are created from the audio files using speech recognition. TDT3 contains both English and Mandarin, and Mandarin token files, whether created from the transcripts or the audio, are translated using machine translation software. And finally, because each source file may contain several stories, all the different token files have corresponding boundary table files, which indicate which tokens make up each story.

Regardless of the specific content of these files, the issue here is that there are a variety of dependencies within the corpus. The presence of a particular file indicates the presence of corresponding files of different types. In putting together the corpus, files can be left out by accident, or files can be included that shouldn't be, perhaps because it's original source file was removed from the corpus. So an extra measure of quality checking was done, amounted to verifying the overall contents of the corpus. Given each file, the files of other types that go along with it should also present, and given a corresponding set of files, each file should contain the same set of stories. These were the main points of the quality checking, however other aspects were checked, some involving dependencies in the corpus, others involving file content.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Tuesday, 22-Oct-2002 17:44:56 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.