Classic Corpora in LDC’s Catalog: CALLFRIEND

phones

The CALLFRIEND series is a multi-language collection of unscripted telephone conversations conducted by LDC in the 1990s to support language identification technology development (Liberman & Cieri, 1998). Covered languages are American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. For English, Mandarin and Spanish, the collection includes two distinct dialects. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. 

This speech data was the foundation for NIST’s Language Recognition Evaluations conducted from 1996-2007. The first editions of the CALLFRIEND series published in LDC’s Catalog in 1996 contain 60 calls evenly split into 20 calls each for a training  partition to develop language models, a development partition for parameter tuning, and an evaluation partition to test performance (Torres-Carrasquillo, et al., 2004). 

Beginning in 2014, LDC released second editions for American English (LDC2019S21LDC2020S08), Canadian French (LDC2019S18), Egyptian Arabic (LDC2019S04), Farsi (LDC2014S01), and Mandarin Chinese (LDC2018S09LDC2020S06). The goal of the second editions is to facilitate continued widespread use of the data, specifically, by updating the audio files to .wav format, simplifying the directory structure, adding documentation and metadata, and combining the training, development and evaluation splits. CALLFRIEND Farsi Second Edition also includes additional telephone recordings and a separate transcripts release (LDC2014T01).

In addition to work on language identification, CALLFRIEND corpora have been used in a variety of research tasks, including subject omission in Korean (Lee 2012), contemporary Persian vowels in casual speech (Jones 2019), Mandarin telephone closings among familiars (Huang, 2020), and adjective constructions in English conversation (Bybee & Thompson, 2021), among many others. 

To learn more about the CALLFRIEND collection or about other LDC corpora used for language identification research, search the Catalog by the “recommended application” and select “language identification” from the list. 

Classic Corpora in LDC’s Catalog: ACE

Data Extraction

Data Extraction

The objective of the Automatic Content Extraction (ACE) program was to develop the capability to extract meaning (entities, relations and events) from multimedia sources (Doddington, et al., 2004). LDC supported ACE by creating annotation guidelines, corpora and other linguistic resources, including training and test data for the common task research evaluations (Strassel, et al., 2003Huang, et al., 2004). 

There are multiple data sets in LDC’s Catalog from the program. One that regularly makes the list of LDC’s top ten most licensed corpora is ACE 2005 Multilingual Training Corpus (LDC2006T06). This data set contains 1,800 files of mixed genre text in English, Arabic, and Chinese annotated for entities, relations, and events. The genres include newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech. 

Another popular data set, ACE 2004 Multilingual Training Corpus (LDC2005T09), consists of varied genre text in English (158,000 words), Chinese (307,000 characters, 154,000 words), and Arabic (151,000 words) annotated for entities and relations.

ACE 2007 Multilingual Training Corpus (LDC2014T18) has the complete set of Arabic and Spanish training data for the 2007 ACE technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions.

Other ACE corpora in the Catalog include ACE 2005 SpatialML Annotations in English and Mandarin (LDC2008T03LDC2010T09, and LDC2011T02), Datasets for Generic Relation Extraction (reACE)TIDES Extraction (ACE) 2003 Multilingual Training DataACE-2 Version 1.0ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (TERN), and more. 

For the full list of available ACE data, visit LDC’s Catalog and select the ACE research project in the search menu. For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, visit LDC's ACE webpage.

Classic Corpora in LDC’s Catalog: Switchboard

phone call picture

Switchboard-1 Release 2 (LDC97S62) is considered the first large collection of spontaneous conversational telephone speech (Graff & Bird, 2000). It consists of approximately 260 hours of recordings collected by Texas Instruments in 1990-1991 (Godfrey et al., 1992). The first release of the corpus (later superseded) was published by NIST and distributed by LDC in 1993.

Participants were 543 speakers (302 male, 241 female) from across the United States who accounted for around 2,400 two-sided telephone conversations. A robot operator handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. Roughly 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. 

This gold standard data set has been used for many HLT applications, including speaker identification, speaker authentication, and speech recognition. It is considered one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (Deshmukh et al., 1998) as well as a key resource for studying the phonetic properties of spontaneous speech (Greenberg et al., 1996). Annotation tasks based on Switchboard include discourse tags/speech acts, part-of-speech tagging and parsing, and sentiment analysis.  

The Switchboard series includes Switchboard Credit CardPhase IIPhase III, the Switchboard Cellular collection, and new recordings from 18 Switchboard participants in the 2013 Greybeard corpus.

All Switchboard corpora are available in the Catalog for licensing by Consortium members and non-members. Visit Obtaining Data for more information. 

Classic Corpora in the Catalog: The LDC Gigawords

sample text from corpus

 

Giga: a combining form meaning “billion,” used in the formation of compound words (Source: https://www.dictionary.com/browse/giga-)

LDC’s Gigaword corpora are a natural outgrowth of its vast decades-long multi-language newswire collection. Newswire data was originally collected, annotated, and distributed for use in many sponsored projects and was also released through the LDC catalog in tailored data sets. Then came the idea of making LDC’s entire newswire collection available by language with a simple, minimal markup to support a broad range of NLP/HLT tasks. The first ArabicChinese and English gigaword editions were released in 2003; subsequent cumulative releases through fifth editions in 2011 represent LDC’s newswire collection spanning 1994-2010 in those languages. French and Spanish gigawords were first published in 2006, culminating in the release of third editions in 2011, likewise covering newswire collected by LDC through 2010.

The community has used, and continues to use, these data sets in numerous ways. Automatic text summarization is a favorite, and current work in this area applies deep learning principles (see, e.g., Gao et al. 2020, English). Gigawords are also useful for text source classification (Huang et al. 2003, Chinese), information extraction (Lan et al. 2020, Arabic), knowledge extraction and distributional semantics (Napoles et al. 2012, English) and natural language understanding (Ganitkevitch 2013, English), among other fields. Recent variations like the annotated and concretely annotated English gigawords add syntactic, semantic, and coreference annotations to this billion word text collection. 

All Gigaword corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

Classic Corpora in LDC’s Catalog: ATIS

ATIS0 Complete CD

The ATIS corpora were among the first publications that appeared with the launch of LDC’s catalog in 1993. ATIS0 Complete (LDC93S4A) is comprised of spontaneous speech, read speech and other material from participants in the ATIS collection that is contained in ATIS0 Pilot (LDC93S4B), ATIS0 Read (LDC93S4B-2) and ATIS0 SD-Read (LDC93S4B-3).

The ATIS (Air Travel Information Services) collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory for Computer Science, National Institute for Standards and Technology and SRI International.

The ATIS collection has been widely used to further research in spoken language understanding and slot filling (Kuo et al., 2020). Other data sets published from the collection include ATIS2 (LDC93S5), ATIS3 Training and Test Data (LDC94S19,LDC95S26) and, more recently, Multilingual ATIS (LDC2019T04) and ATIS - Seven Languages (LDC2021T04).

All ATIS corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.