Introduction
ICSI Meeting Speech was produced by Linguistic Data
Consortium (LDC) catalog number LDC2004S02 and ISBN 1-58563-285-6.
The ICSI Meeting corpus is a collection of 75 meetings collected at the
International Computer Science Institute in Berkeley during the
years 2000-2002. The meetings included are "natural" meetings in
the sense that they would have occurred anyway: they are generally
regular weekly meetings of various ICSI working teams, including the
team working on the ICSI Meeting Project. In recording meetings of
this type, we hoped to capture meeting dynamics and speaking styles
that are as natural as possible given that speakers are wearing
close-talking microphones and are fully cognizant of the recording
process. The speech files range in length from 17 to 103
minutes, but generally run just under an hour each.
Word-level orthographic transcriptions are available as ICSI Meeting Transcripts.
Data
The collection includes 922 speech files, for a total of approximately 72 hours of Meeting Room speech.
The speech is structured as one subdirectory per meeting, containing wavefiles for each channel (and possible .blp file, specifying any censored intervals).
The audio was collected at a 48 kHZ sample-rate, downsampled on the
fly to 16 kHz. Audio files for each meeting are provided as separate
time-synchronous recordings for each channel, encoded as 16-bit linear
(big-endian) wavefiles, shorten-compressed in NIST SPHERE format.
The meetings were simultaneously recorded using close-talking
microphones for each speaker (generally head-mounted, but early
meetings contain some lapel microphones), as well as six table-top
microphones: four high-quality omnidirectional PZM microphones arrayed
down the center of the conference table, and two inexpensive microphone
elements mounted on a mock PDA. All meetings were recorded in the same instrumented meeting room.
In addition to recording the meetings themselves, the participants were also asked to read digit strings,
similar to those found in TIDIGITS, at the start or end of the meeting. This small-vocabulary
read-speech component of the recordings -- using the same meeting
room, speakers, and microphones -- provides a valuable supplement
to the natural conversational data, allowing a factorization of the
speech challenges offered by the corpus. For all but a dozen of the meetings included in the corpus, at least some
of the participants read digit strings; for the great majority of
meetings, all participants did. The digit readings are included
as part of the wavefiles for the meeting as a whole and are fully
transcribed as part of the associated transcripts.
There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native
English speakers, varying in fluency from nearly-native to challenging-to-transcribe.
Sponsorship
The collection and preparation of this corpus was made possible in
large part through funding from DARPA, both through the Communicator
project and through a ROAR "seedling," the Swiss IM2 project (National
Centre of Competence in Research, sponsored by the Swiss National
Science Foundation), and a supplementary award from IBM.
Updates
There are no updates available at this time. More information is available at http://www.ICSI.Berkeley.EDU/Speech/mr.
Content Copyright
Portions © 2000-2003 International Computer Science Institute, © 2004 Trustees of the University of Pennsylvania |