Introduction
Broadcast News Lattices, Linguistic Data Consortium (LDC) catalog number LDC2011T06
and isbn 1-58563-578-2, was developed by researchers at Microsoft and Johns
Hopkins Unviersity (JHU) for the Johns
Hopkins 2010 Summer Workshop on Speech Recognition with Conditional Random Fields.
The lattices were generated using the IBM
Attila speech recognition toolkit
and were derived from transcripts of approximately 400 hours of English broadcast
news recordings. They are intended to be used for training and decoding with
Microsofts segmental CRF toolkit for speech recogntion, SCARF.
The goal of the JHU 2010 workshop was to advance the state-of-the-art in core
speech recognition by developing new kinds of features for use in a Segmental
Conditional Random Field (SCRF). The SCRF approach generalizes Condtional Random
Fields to operate at the segment level, rather than at the traditional frame
level. Every segment is labeled directly with a word. Features are then extracted
which each measure some form of consistency between the underlying audio and
the word hypothesis for a segment. These are combined in a log-linear model
(lattice) to produce the posterior possibility of a word sequence given the
audio.
Data
Broadcast News Lattices consists of training and test material, the source
data for which was taken from various corpora distributed by LDC.
Training Data
The training lattices total 152251 and were derived from the following data
sets:
1996
English Broadcast News Speech LDC97S44 1996
English Broadcast News Transcripts (HUB4) LDC97T22 (104 hours)
1997
English Broadcast News Speech (HUB4) LDC98S71 1997
English Broadcast News Transcripts (HUB4) LDC98T28 (97 hours)
TDTD4
Multilingual Broadcast News Speech Corpus LDC2005S11 TDT4
Multilingual Text and Annotations LDC2005T16 (300 hours)
The lattices can be related to the original audio files via the file train.db.gz
which lists for each segment a tag-name, segment number, the original audio
file, channel (always 0), start time, and end time (in seconds). A sample line
is as follows:
19960510_NPR_ATC#Ailene_Leblanc 0001 19960510_NPR_ATC.sph 0 76.767 89.404
|
This sample line corresponds to the release lattice labeled:
19960510_NPR_ATC#Ailene_Leblanc@0001.dc
The file train.Bdc contains denominator
lattices. The file train.Bnc has the numerator
lattices containing the subset of paths consistent with the training transcriptions.
The file train.Btr consists of the transcriptions.
The file train.Bbase contains the baseline (one-best)
word detections from the Attila system.
The lattices were generated from an acoustic model that included LDA+MLLT, VTLN,
fMLLR based SAT training, fMMI and mMMI discriminative training, and MLLR. The
lattices are annotated with a field indicating the results of a second confirmatory
decoding made with an independent speech recognizer. When there was a correspondence
between a lattice link and the 1-best secondary output, the link was annotated
with +1. Silence links are denominated with 0 and all others with -1. Correspondence
was computed by finding the midpoint of a lattice link and comparing the link
label with that of the word in the secondary decoding at that position. Thus,
there are some cases where the same word shifted slightly in time receives a
different confirmation score.
Test Data
The test lattices are derived from the English broadcast news material in 2003 NIST Rich Transcription Evaluation Data LDC2007S10. Bbase
and Bdc files are provided, along with the db file
rt03.db.gz to link the segments to times in the original waveform.
Scoring scripts may be obtained from the NIST Rich Transcription website.
SCARF Toolkit
The SCARF toolkit is available for download from the SCARF
website.
Related Publications
A full description of the lattice generation process can be found in Zweig
et al., Speech Recognition with Segmental Conditional Random Fields: Final
Report from the 2010 JHU Summer Workshop, MSR Technical Report MSR-TR-2010-173.
Updates
Additional information, updates, bug fixes may be available in the
LDC catalog entry for this corpus at
LDC2011T06.
Samples
| Source | Denominator Lattices |
20010206_1830_1900_ABC_WNT#aaron_brown@0001.base
# baseline
2
A 5
HALF 20
CENTURY 56
AGO 95
LORRAINE 132
WAGNER 175
WAS 207
A 219
KID 239
WITH 263
A 270
CRUSH 300
THE 376
OBJECT 416
OF 446
HER 458
AFFECTION 497
AND 565
HER 583
CONSIDERABLE 637
ATTENTION 716
WAS 817
A 826
HUNKY 847
YOUNG 880
ACTOR 909
NAMED 934
RONALD 960
REAGAN 995
1012
|
20010206_1830_1900_ABC_WNT#aaron_brown@0001.dc
1 2 confirm=0
3 5 A confirm=1
6 31 HALF confirm=1
32 77 CENTURY confirm=1
78 110 AGO confirm=1
111 151 LORRAINE confirm=1
111 151 LORAINE confirm=-1
152 196 WAGNER confirm=1
197 212 WAS confirm=1
197 215 WAS confirm=1
213 221 THE confirm=-1
216 219 A confirm=-1
220 253 KIT confirm=-1
220 254 KIT confirm=-1
220 255 KID confirm=-1
222 253 KIT confirm=-1
222 254 KIT confirm=-1
222 255 KID confirm=1
254 265 WITH confirm=1
254 267 WITH confirm=1
255 265 WITH confirm=1
255 267 WITH confirm=1
256 265 WITH confirm=1
256 267 WITH confirm=1
266 272 THE confirm=-1
268 270 A confirm=-1
271 327 CRUSH confirm=-1
271 327 CRASH confirm=-1
273 327 CRUSH confirm=-1
328 360 ~SIL confirm=0
|
Content Copyright
Portions ©1996-1998, 2000-2001 American Broadcasting Company, Inc., ©
1996-1998, 2000-2001 Cable News Network LP, LLLP, © 2000-2001 National
Broadcasting Company, © 1996-1998 National Public Radio, Inc., © 1996-1998
National Satellite Cable Corporation, © 1996-1998, 2005, 2007, 2011 Trustees
of the University of Pennsylvania
The World is a co-production of Public Radio International and the British
Broadcasting Corporation and is produced at WGBH Boston. |