New Publications
Concretely Annotated English Gigaword [1]
TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 [2]
TRAD Arabic-French Parallel Text -- Newswire [3]
Announcements
Fall 2018 LDC Data Scholarship Recipients
Congratulations to the recipients of LDC's Fall 2018 Data Scholarships:
Utkrist Adhikari: University of Bonn (Germany); M.Sc, Computer Science. Utkrist is awarded a copy of Treebank-2 for his research in named entity recognition, super sense tagging, and semantic role labeling.
Vitaliya Remneva: Higher School of Economics, National Research University (Russia); M.Sc, System and Software Engineering. Vitaliya is awarded a copy of ETS Corpus of Non-Native Written English for her work in author profiling through natural language processing.
Tian Xiaoyu: Shanghai International Studies University (China); MA, Linguistics. Tian is awarded a copy of Tagged Chinese Gigaword Version 2.0 for her research in causative construction variations in Mainland Chinese, Taiwan Chinese, and Singapore Chinese.
W. Victor H. Yarlott: Florida International University (US); Ph.D., School of Computing and Information Sciences. Victor is awarded a copy of ACE2005 Multilingual Training Corpus for his research in relation extraction.
For information about the program, visit the Data Scholarship page [4].
Membership Year 2019 Publication Preview
The 2019 Membership Year is fast approaching and plans for next year’s publications are in progress. Among the expected releases are:
- SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle school students performing collaborative learning tasks, includes audio recordings, orthographic transcriptions, manual annotation of collaboration, and related documentation
- Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal University and Brandeis University, semantic representation of approximately 10,000 Chinese sentences following the basic principles of AMR using web source data from Chinese Treebank 8.0 (LDC2013T21 [5])
- Multilanguage conversational telephone speech: developed to support language identification research in related languages (Arabic, East Asian, English, Mandarin)
- TAC KBP: English entity discovery and linking, nugget detection and event argument data, Chinese slot-filling data
- IARPA Babel Language Packs (telephone speech and transcripts): languages include Amharic, Guarani, Igbo, and Lithuanian
- HAVIC Med Progress Test data: web video, metadata, and annotations for developing multimedia systems
- BOLT: discussion forums, SMS, word-aligned and tagged data in all languages (Chinese, Egyptian Arabic, English)
Check your inbox in the coming weeks for more information about membership renewal.