October 2022 Newsletter | Linguistic Data Consortium

New Corpora

Announcements

Membership Year 2023 Publication Preview
The 2023 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 hours of Ukrainian conversational telephone speech and broadcast news with 1.2 million words of corresponding orthographic transcripts
2019 NIST SRE: audiovisual and leaderboard challenge sets based on amateur videos and Tunisian Arabic telephone speech, respectively
DEFT English ERE: English text from assorted genres annotated for entities, relations, and events
Mixer 3 and Mixer 7 speech collections: thousands of hours of telephone speech and metadata from Mixer 3 (multiple languages) and Mixer 7 (Spanish, plus interviews and transcript readings)
CALLFRIEND Russian: 100 telephone conversations among native speakers, transcripts, and a lexicon, released in separate speech and text data sets
REMIX Telephone Collection: English telephone speech from 385 participants in previous Mixer studies
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools in various languages (e.g., Indonesian, Swahili, Tagalog, Tamil, Zulu)

Check your inbox in the coming weeks for more information about membership renewal. 

LDC Data and Commercial Technology Development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

30th Anniversary Highlight: ACE
The objective of the Automatic Content Extraction (ACE) program was to develop the capability to extract meaning (entities, relations and events) from multimedia sources (Doddington, et al., 2004). LDC supported ACE by creating annotation guidelines, corpora and other linguistic resources, including training and test data for the common task research evaluations (Strassel, et al., 2003; Huang, et al., 2004).

There are multiple data sets in LDC’s Catalog from the program. One that regularly makes the list of LDC’s top ten most licensed corpora is ACE 2005 Multilingual Training Corpus (LDC2006T06). This data set contains 1,800 files of mixed genre text in English, Arabic, and Chinese annotated for entities, relations, and events. The genres include newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech.

Another popular data set, ACE 2004 Multilingual Training Corpus (LDC2005T09), consists of varied genre text in English (158,000 words), Chinese (307,000 characters, 154,000 words), and Arabic (151,000 words) annotated for entities and relations.

ACE 2007 Multilingual Training Corpus (LDC2014T18) has the complete set of Arabic and Spanish training data for the 2007 ACE technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions.

Other ACE corpora in the Catalog include ACE 2005 SpatialML Annotations in English and Mandarin (LDC2008T03, LDC2010T09, and LDC2011T02), Datasets for Generic Relation Extraction (reACE), TIDES Extraction (ACE) 2003 Multilingual Training Data, ACE-2 Version 1.0, ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (TERN), and more.

For the full list of available ACE data, visit LDC’s Catalog and select the ACE research project in the search menu. For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, visit LDC's ACE webpage.