![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
||||
|
|
TDT3 English Video Sources OverviewCNN Headline News is a 24-hour/day cable-TV broadcast, which presents top news stories continuously throughout the day. Some portion of the daily broadcast schedule includes closed-caption transcription. (No other form of transcription exists for this program in the normal course of events.) Typically, 20 or more distinct news stories are covered in each 30-minute portion of programming. When closed captions are provided, they include markers that flag changes of topic and changes of speaker. The closed caption text stream is padded with null bytes so that the text content is reasonably well aligned in time with the audio content. The accuracy of the caption text is reasonably good, but within any given 30 minute broadcast, one should expect to find numerous cases (dozens?) where whole words or phrases in the audio are missing in the captions, as well as a few cases (less than a dozen?) where the caption text is clearly wrong (i.e. the captioner has misspelled, or misunderstood what was said). ABC World News Tonight and NBC Nightly News are a daily 30-minute news broadcasts that typically cover about a dozen different news items. Closed captioning, provided for every broadcast, is presumably of better quality than the CNN Headline news captions, because the program content is more carefully prepared (there tend to be fewer mistakes in the caption content); still, owing to differences between rate of speech and rate of text display, it is not uncommon to find words and phrases that were spoken but omitted from the captions. Story and speaker turn boundaries are marked in roughly the same way as we find in the CNN captions. The MSNBC News with Brian Williams is a one-hour broadacst that airs only on weekdays. Each program tends to have a "focus" topic that occupies at least 15 minutes of the hour, and this often includes one or more interview segments. There are also regular "feature" pieces on a daily or weekly basis, in which the anchor person reviews the covers of prominent daily newspapers or weekly news magazines. Each broadcast comes with closed captioning. The insertion points for topic boundaries are determined by the people who create the closed caption transcription, and it is not clear to us exactly how this is done. We have observed that when an announcer quickly lists a series of stories that are coming up in the broadcast, the captioner typically marks a boundary at each mention of a story. LDC annotators have also seen cases where a topic boundary was apparently inserted by mistake in the middle of story. Another characteristic of topic boundary marking by captioners is that they do not mark the end of a topic, but only the beginning point of the next topic. If a news story is followed directly by a commercial break (without the announcer listing the stories to follow), there is no boundary marked at the beginning of the commercial. (However, it is possible to detect other cues in the closed-caption stream that flag the onset of commercial breaks, and we make use of these cues when capturing the stream to a text file.) Sampling and captureEach daily ABC and NBC broadcast and up to four 30-minute sections of CNN are recorded each day. The CNN segments are drawn from that portion of the daily schedule that happens to include closed captioning; CNN provides captioning over a 16-hour period each weekday, and an 8-hour period each weekend day, so we typically collect four half-hour samples per day on weekdays, and three per day on weekends. These broadcasts are captured directly from a cable TV connection. The signal first goes to a VCR, which is programmed to put the broadcast on VHS tape. The audio line output from the VCR is connected to a DAT recording deck, and this in turn has its digital audio output connected to a Townshend DATLink unit, which in turn has a SCSI connection to a Sun sparc workstation. In addition, the VCR's video output is passed through a closed-caption decoder, which converts the closed caption signal into ASCII text and sends this data over a serial connection to the same sparc workstation. So, at broadcast time, the VCR comes on under its own control to tape the broadcast; the DAT recorder is activated by a remote-control emitter that is connected to the DATLink unit and controlled by the workstation. A control process is scheduled on the workstation to execute at broadcast time to (a) start the DAT deck in "Record" mode, (b) start sampling on the DATLink device and (c) start receiving closed caption text data on the serial port. The DAT deck samples the signal at 32 KHz for storage on the DAT cartridge, and the DATLink downsamples the digital audio to 16KHz for storage into a disk file. When the DATLink recording process is done, the VCR shuts itself off, a remote-control signal is sent to the DAT deck to stop its recording, the serial port connection is closed, and the control process on the workstation runs a quality check program on the resulting data files. The results of that quality check (waveform and text file sizes, min and max waveform sample values, number of occurrences of apparent peak clipping in the waveform) are placed into an Oracle database table together with the file-id for the broadcast. If the quality check indicates any problem with either the closed caption text file or the waveform file, the video tape is checked, and if it was recorded correctly, it is used as the signal source (the particular broadcast is played back in its entirety) to redo the digital audio capture (to DAT and disk) and the closed caption capture. Due to cable service limitations at the LDC, the MSNBC broadcasts must be captured on video tape at another location, and the capture of digital audio and closed captioning is done via VCR playback of the tapes. The same workstation and tape drives are used as for the live capture of the other video sources; except for the MSNBC source being a video tape, all other aspects of digital capture are identical. Issues and problems The distribution of topic boundary marks in the closed caption text
shows some clear shortcomings (not marking the ends of stories in some
conditions), and some practices that may pose problems for annotators
(marking a boundary at every short phrase during a listing of upcoming
stories). Annotators must be careful to make sure that story boundaries
are marked where they need to be, adding marks where necessary. They must
also decide when the amount of text between two story boundaries is too
short to be called a story; in this regard, a boundary would be marked
as the start of a story if the following content consists of two or more
independent clauses of informative text about a news item. This will mean
that a listing of upcoming news like the following will not be classed
as three independent stories, but rather as a single region of non-news
material (where ">>" indicates topic boundaries present in the original
closed caption text): >> In this hour, the latest report from the Persian Gulf, >> more leaks from Kenneth Starr's special investigation, >> and another tornado hits central Florida. Another property of closed caption text that we observe on occasion is that a portion of the broadcast may contain a spoken report (comprising an indenpendent news story) for which little or no closed caption text is provided. For example, an announcer may utter two or three independent, informative clauses reporting a brief news item, but the closed caption text will contain none or only one of these cluases. In this case, the person doing the segmentation will classify the segment boundary as an "untranscribed news segment", meaning that there is not enough text provided to pass the "two-clause" rule, even though the audio signal does contain enough information to constitute a story. Segments marked in this way will not be treated during the topic labeling stage of annotation. Audio segmentationWhen the closed caption and waveform data have been confirmed to be okay, the broadcast episode is made available for manual verification of the time stamps associated with the topic (story) boundaries. The task of the annotator is three-fold:
This annotation is done using a modified version of the closed caption text file in a hybrid emacs/xwaves user interface developed by the LDC. When the audio segmentation has been completed for a given episode, the modified text file is filtered to produce the common TDT2 SGML format. Every file is gone over twice, with the second pass being done by a different annotator. |
|||
|
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data
Contact ldc@ldc.upenn.edu | ||||