October 2018 Newsletter

Monday, October 15, 2018

New Publications

Concretely Annotated English Gigaword

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014

TRAD Arabic-French Parallel Text -- Newswire


Fall 2018 LDC Data Scholarship Recipients

Congratulations to the recipients of LDC's Fall 2018 Data Scholarships:

Utkrist Adhikari: University of Bonn (Germany); M.Sc, Computer Science. Utkrist is awarded a copy of Treebank-2 for his research in named entity recognition, super sense tagging, and semantic role labeling.

Vitaliya Remneva: Higher School of Economics, National Research University (Russia); M.Sc, System and Software Engineering. Vitaliya is awarded a copy of ETS Corpus of Non-Native Written English for her work in author profiling through natural language processing.

Tian Xiaoyu: Shanghai International Studies University (China); MA, Linguistics. Tian is awarded a copy of Tagged Chinese Gigaword Version 2.0 for her research in causative construction variations in Mainland Chinese, Taiwan Chinese, and Singapore Chinese.

W. Victor H. Yarlott: Florida International University (US); Ph.D., School of Computing and Information Sciences. Victor is awarded a copy of ACE2005 Multilingual Training Corpus for his research in relation extraction.

For information about the program, visit the Data Scholarship page.

Membership Year 2019 Publication Preview

The 2019 Membership Year is fast approaching and plans for next year’s publications are in progress. Among the expected releases are:

  • SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle school students performing collaborative learning tasks, includes audio recordings, orthographic transcriptions, manual annotation of collaboration, and related documentation
  • Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal University and Brandeis University, semantic representation of approximately 10,000 Chinese sentences following the basic principles of AMR using web source data from Chinese Treebank 8.0 (LDC2013T21)
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Arabic, East Asian, English, Mandarin)
  • TAC KBP: English entity discovery and linking, nugget detection and event argument data, Chinese slot-filling data
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Amharic, Guarani, Igbo, and Lithuanian
  • HAVIC Med Progress Test data: web video, metadata, and annotations for developing multimedia systems
  • BOLT: discussion forums, SMS, word-aligned and tagged data in all languages (Chinese, Egyptian Arabic, English)

Check your inbox in the coming weeks for more information about membership renewal.