Software Development

LDC's Software Development group has extensive experience in the creation and management of data collection and processing pipelines ranging from large-scale broadcast and telephone audio recording to text chat collections as well as tools covering all aspects of text, audio, image and video scouting, indexing, search, annotation as well as annotation workflow management and quality control including techniques specifically developed for treebank annotation. Nearly all LDC annotation tools are now web-enabled and based upon a common framework that allows reuse across projects and tasks.

LDC has developed or acquired numerous Human Language Technologies and integrates them into annotation workflows where appropriate. These include: speech activity detection, decoding and segment level classification using deep learning, language identification, forced alignment, content duplicate identification, sentence segmentation, tokenization, named entity tagging, morphological analysis for lexicon and treebank development and syntactic parsing for treebank development. One example is the Penn Phonetics Forced Aligner (P2FA) an automatic phonetic segmentation toolkit based on HTK [1]. It can be used to align audio with corresponding orthographic transcript for a number of languages, including American English, Mandarin Chinese, Brazilian Portuguese, Spanish, Japanese, and Korean.

LDC's software developers are equipped with desktop development workstations, computational servers, relational database servers, web servers, software development resources (e.g., various compilers, interpreters, debuggers, text editors, GUI-builders, IDEs, revision control systems), issue tracking systems, e-mail discussion lists, a wiki-based knowledge base and other documentation.