LDC's Software Development group has extensive experience in the creation and management of data collection and processing pipelines ranging from large-scale broadcast and telephone audio recording to text chat collections as well as tools covering all aspects of text, audio, image and video scouting, indexing, search, annotation as well as annotation workflow management and quality control including techniques specifically developed for treebank annotation. Nearly all LDC annotation tools are now web-enabled and based upon a common framework that allows reuse across projects and tasks.
LDC has developed or acquired numerous Human Language Technologies and integrates them into annotation workflows where appropriate. These include: speech activity detection, decoding and segment level classification using deep learning, language identification, forced alignment (including the Penn Phonetics Forced Aligner with models for American English, Mandarin Chinese, Brazilian Portuguese, Spanish, Japanese, Korean), content duplicate identification, sentence segmentation, tokenization, named entity tagging, morphological analysis including for lexicon and treebank development and syntactic parsing for treebank development.
LDC's software developers are equipped with desktop development workstations, computational servers, relational database servers, web servers, software development resources (e.g., various compilers, interpreters, debuggers, text editors, GUI-builders, IDEs, revision control systems), issue tracking systems, e-mail discussion lists, a wiki-based knowledge base and other documentation.