November 2022 Newsletter
New Corpora
BOLT English Translation Treebank – Egyptian Arabic SMS/Chat
Samrómur Children Icelandic Speech 1.0
Third DIHARD Challenge Development
Announcements
Join LDC for Membership Year 2023
It’s time to renew your LDC membership for 2023. Current (2022) members who renew their membership before March 1, 2023 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 1.
In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 900+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.
Plans for 2023 publications are in progress. Among the expected releases are:
- AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 hours of Ukrainian conversational telephone speech and broadcast news with 1.2 million words of corresponding orthographic transcripts
- 2019 NIST SRE: audiovisual and leaderboard challenge sets based on amateur videos and Tunisian Arabic telephone speech, respectively
- DEFT English ERE: English text from assorted genres annotated for entities, relations and events
- Mixer 3 and Mixer 7 speech collections: thousands of hours of telephone speech and metadata from Mixer 3 (multiple languages) and Mixer 7 (Spanish, plus interviews and transcript readings)
- CALLFRIEND Russian: 100 telephone conversations among native speakers, transcripts and a lexicon, released in separate speech and text data sets
- REMIX Telephone Collection: English telephone speech from 385 participants in previous Mixer studies
- LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Indonesian, Swahili, Tagalog, Tamil, Zulu)
For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.
Fall 2022 LDC Data Scholarship Recipients
LDC congratulates the following Fall 2022 data scholarship recipients:
- Nelson Filipe Costa: Concordia University (Canada); PhD, Machine Learning. Nelson is awarded a copy of Penn Discourse Treebank Version 3.0 (LDC2019T05) for his work in discourse relationships and mapping.
- Paul Pope: University of Eastern Finland (Finland); MA, Linguistic Data Sciences. Paul is awarded a copy of ETS Corpus of Non-Native Written English (LDC2014T06) for his research on text classification.
- Abhinav Singh: Sharda University (India); PhD, Forensic Science. Abhinav is awarded a copy of TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) for his research on forensic speech recognition.
- Lucas Zheng: Deerfield Academy (USA); High School Scholar. Lucas is awarded copies of Arabic Treebank Part 1 v. 4.1 (LDC2010T13) and Arabic Treebank Part 2 v. 3.1 (LDC2011T09) for his work on analyzing syntactic and lexical similarities across MSA genres and POS-tagging for MSA.
- Students can learn more about the LDC data scholarship program on the Data Scholarships page.
Spring 2023 Data Scholarship Application Deadline
Applications are now being accepted through January 15, 2023 for the Spring 2023 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.
30th Anniversary Highlight: CALLFRIEND
The CALLFRIEND series is a multi-language collection of unscripted telephone conversations conducted by LDC in the 1990s to support language identification technology development (Liberman & Cieri, 1998). Covered languages are American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. For English, Mandarin and Spanish, the collection includes two distinct dialects. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America.
This speech data was the foundation for NIST’s Language Recognition Evaluations conducted from 1996-2007. The first editions of the CALLFRIEND series published in LDC’s Catalog in 1996 contain 60 calls evenly split into 20 calls each for a training partition to develop language models, a development partition for parameter tuning, and an evaluation partition to test performance (Torres-Carrasquillo, et al., 2004).
Beginning in 2014, LDC released second editions for American English (LDC2019S21, LDC2020S08), Canadian French (LDC2019S18), Egyptian Arabic (LDC2019S04), Farsi (LDC2014S01), and Mandarin Chinese (LDC2018S09, LDC2020S06). The goal of the second editions is to facilitate continued widespread use of the data, specifically, by updating the audio files to .wav format, simplifying the directory structure, adding documentation and metadata, and combining the training, development and evaluation splits. CALLFRIEND Farsi Second Edition also includes additional telephone recordings and a separate transcripts release (LDC2014T01).
In addition to work on language identification, CALLFRIEND corpora have been used in a variety of research tasks, including subject omission in Korean (Lee 2012), contemporary Persian vowels in casual speech (Jones 2019), Mandarin telephone closings among familiars (Huang, 2020), and adjective constructions in English conversation (Bybee & Thompson, 2021), among many others.
To learn more about the CALLFRIEND collection or about other LDC corpora used for language identification research, search the Catalog by the “recommended application” and select “language identification” from the list.