New LDC Website Coming Soon

Look for LDC's new website in the coming weeks. We've revamped the design and site plan to make it easier than ever to find what you're looking for. The features you use the most -- the catalog, new corpus releases and user login -- will be a short click away. We expect the LDC website to be occasionally unavailable for a few days at the end of September as we make the switch and thank you in advance for your understanding.


New Corpora

Egyptian Arabic Informal Web Data: BOLT Arabic Discussion Forums: developed by LDC, 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes

Task Specifications

BOLT developed technology that enables English speakers to retrieve and understand information from informal foreign language sources including chat, text messaging and spoken conversations. The genres of interest to BOLT were characterized by inherent variation and inconsistency, motivating the development of new collection and annotation methods. 


Heterogeneous Audio Visual Internet Collection (HAVIC)

LDC built a large corpus of multi-modal data to support research in a variety of areas including spoken term detection and video event detection. The HAVIC (Heterogeneous Audio Visual Internet Collection) Corpus consists of thousands of hours of “real world” video data collected from the internet. The corpus especially targeted user-generated video content as opposed to professionally-produced or commercial video content.


Subscribe to Linguistic Data Consortium RSS