September 2022 Newsletter

Thursday, September 15, 2022

New Corpora 

Xi’an Guanzhong Object Naming


Upcoming Policy Change to LDC’s Open Memberships
LDC is changing its open membership year policy beginning January 1, 2023.  Only one membership year will be open for joining – the current membership year. The 2022 membership year will close for joining on December 31, 2022. We expect this change to have a minimal impact on members, while allowing us to streamline our processes to serve members better. LDC’s many membership benefits will remain the same and organizations choosing to join membership years in advance will still be able to do so. If you have any questions about this change, please don’t hesitate to contact our membership office.

LDC at Interspeech 2022
LDC is proud to sponsor the Workshop for Young Female Researchers in Speech (YFRSW) to be held in-person as an Interspeech 2022 pre-conference satellite event on September 17. Also, be sure to check out the collaborative work of LDC’s Mark Liberman, “The mapping between syntactic and prosodic phrasing in English and Mandarin”, presented during the On-Site Oral Session: Phonetics and Phonology on Wednesday, September 21, 13:30-15:30 KST. 

LanguageArc: Citizen Science for Language 
LanguageARC is a citizen science web portal for language research developed by LDC with the support of the National Science Foundation (grant #1730377). 

LanguageARC brings together researchers and participants from the general public interested in language to form a community dedicated to support and advance language-related research and development. Contributors to this online community can participate in a variety of language-related tasks and activities such as reading text, answering questions, describing images or video, creating or evaluating transcriptions for audio clips or developing translations into their native languages. LanguageARC includes projects in languages other than English, such as French, Sesotho and Swedish. Xi’an Guanzhong Object Naming LDC2022S09, released this month in LDC’s Catalog and described below, is an example of a data set developed using LanguageArc. New projects will be added on an ongoing basis.
Sign up for a LanguageARC account today to start making real contributions to language knowledge and research. Please share this information with colleagues, students and anyone who might be interested in participating in the language activities on this website. If you are a researcher interested in creating a project on Language ARC, please reach out on the site’s Contact page.
Find LanguageArc on Facebook

30th Anniversary Highlight: Switchboard 
Switchboard-1 Release 2 (LDC97S62) is considered the first large collection of spontaneous conversational telephone speech (Graff & Bird, 2000). It consists of approximately 260 hours of recordings collected by Texas Instruments in 1990-1991 (Godfrey et al., 1992). The first release of the corpus (later superseded) was published by NIST and distributed by LDC in 1993.

Participants were 543 speakers (302 male, 241 female) from across the United States who accounted for around 2,400 two-sided telephone conversations. A robot operator handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. Roughly 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. 

This gold standard data set has been used for many HLT applications, including speaker identification, speaker authentication, and speech recognition. It is considered one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (Deshmukh et al., 1998) as well as a key resource for studying the phonetic properties of spontaneous speech (Greenberg et al., 1996). Annotation tasks based on Switchboard include discourse tags/speech acts, part-of-speech tagging and parsing, and sentiment analysis.  

The Switchboard series includes Switchboard Credit CardPhase IIPhase III, the Switchboard Cellular collection, and new recordings from 18 Switchboard participants in the 2013 Greybeard corpus.

All Switchboard corpora are available in the Catalog for licensing by Consortium members and non-members. Visit Obtaining Data for more information.