Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  
LDC Facilities

Introduction

The Linguistic Data Consortium has its offices on the top floor of 3600 Market Street in Philadelphia's University City Science Center. The eighth floor suite, with over 11,000 usable square feet, was configured specifically for LDC with 22 single, double and triple offices, large and small conference rooms, a recording booth, a focus group room and six laboratories including separate labs for broadcast news collection, participant recruiting, annotation and publications plus specially equipped telecommunications and data closets, corpus packaging workroom/mailroom and pantry.

The large conference room seats 30 and contains Windows and Unix workstations, power and network outlets for guests, a wireless network access point, high resolution computer projector, overhead projector and a large screen TV connected to the University's cable network. The small conference room seats four with network outlets for guests.

The consortium also has a specialized server area within the Franklin Building Annex, connected to offices with a redundant fiber optic Gigabit Ethernet link.

[ top ]

IT and Networking Infrastructure

LDC IT infrastructure is a comprehensive, autonomous and internally-managed system within University of Pennsylvania computing. This allows for the modularity and flexibility required to best match the needs of research projects. LDC infrastructure includes:
  • In-house Storage-as-a-Service (SaaS) system, that integrates an heterogeneous set of storage solutions, spreading from DAS, to NAS, and fibre channel based SAN. More than 150TB are currently online with the capability to scale up to 1PB rapidly, and transparently to services.
  • A comprehensive system of backup and disk replication.
  • Standard Internet services (e-mail,web, etc) more than 300 Gigabit switched ports a /22 private IP space, a public /24, and public /23 on two separate networks. Two autonomous WiFi networks are provided on the private IP space as well
  • A helpdesk and support service for LDC staff, and researchers, with real-time monitoring systems, alarming, and trouble ticket processing tools
In the following list some more details about LDC servers and IT infrastructure:

LDC maintains several systems supporting the following functions:

  • Web Servers:
    • Virtualized web infrastructure consisting of 13 virtual servers
  • Virtualization Servers:
    • 2x Quad Core AMD Server, 64 GB of RAM (production ESX environment)
    • 1x Quad Core AMD Server, 24 of RAM (web production)
    • 2x Quad Core Xeon Server, 16Gb of RAM (production)
    • 2x Quad Core Xeon Server, 16Gb of RAM (development)
  • Database Servers:
    • MySQL - AMD Opteron Quad-core server with 8 GB RAM
    • MySQL - Virtualized slave server
  • File Servers: Multiple servers supporting over 250 TB of storage
    • iSN Cloverleaf systems fully redundant/hi availability serving NFS, FC and SMB volumes
    • Sun 5320 with redundant fiber channel RAIDs serving about 50TB
    • fast fiber channel redundat RAIDs serving about 30 TB
    • Virtualized SCP server for data distribution
  • Backup System:
    • Mirror/replication and snapshots on iSN
    • Replication system (Sun 4500) for dynamic data
    • 2 Tape robots for static data
    • VTL on site and on cloud computing services
    • 2 Backup servers
  • Administrative:
      Mail servers:
    • AMD Quad-core server with 64 GB of RAM
    • externally located Dual Opteron server with 2 GB of RAM
  • Admin Directory space - Restricted Access Storage Area
  • Network:
    • Public
      • 2x24 switched Gigabit Ethernet ports
      • Optic fibre to the Internet
      • 1 CIDR /23 network (equivalent to 2 class-C networks)
      • 1 CIDR /24 network (equivalent to a 1 class-C network)
    • Private - 264 fully switched Gigabit Ethernet ports
    • 2 Wireless networks 802.11n
    • Optic fibre to the main server room
    • Real time monitoring server with alarming services

    In addition we have approximately seventy (70) Annotation/Transcription workstations running various operating systems such as Solaris, Windows, FreeBSD, and Linux. Sixty of these workstations are collected in four common work areas of varying size.

    [ top ]

    Human Subjects Data Collection Laboratories

    LDC maintains facilities for producing recordings of speech both on-site and via telephone.

    • The first is an acoustically treated soundbooth. The soundbooth has been designed to isolate the speaker from extraneous noise sources; the door has an acoustical seal and drop sweep, and the window has multiple panes of glass. In order to minimize the amount of equipment located within the soundbooth, wall-plates and in-wall microphone cables were included in the design. All recording equipment can be located outside of the booth at an operator's station with a direct cable link to up to four microphones within the soundbooth.

    • LDC has two offices which have been converted into multichannel audio recording spaces. While the dimensions of the rooms differ, the physical layout of furniture, microphones, and recording equipment is consistent across the two rooms. Because of the consistent spatial relationship between installed microphones, it is possible for us to make recordings of speakers in the two rooms while controlling for distance. Each room can support up to sixteen separate microphones, and 16 separate audio tracks can be recorded simultaneously. Each room includes a modern Digital Audio Workstation with 16 direct microphone inputs, a telephone digital hybrid system, , an analog matrix mixer, and customized, socket controllable recording software. The Telephone Digital Hybrid systems allows for the direct connection of an analog Telephone line to the Digital Audio workstation; this is significant because it allows us to make recordings of individuals over the installed  microphones and the telephone line simultaneously. The Analog matrix mixers allows for routing, mixing and redistribution of speech in real time; this allows us to relay modified signals back to a subject in the process of being recorded for the purposes of evaluating the effects of masking noise on speech production. The digital audio workstations has adequate storage to accommodate extended data collections and includes custom scripts for automatically transferring recordings to the main LDC network in an automated fashion. In addition to this infrastructure, the LDC has a set of microphones including shotgun, pressure zone, large diaphragm condenser, lavalier, headmounted, and array microphones.

    LDC operates three computer telephony systems for specifically for collecting speech from the telephone network. Each system is connected to a dedicated T-1 line, which provides 24 audio channels and has Toll-Free service enabled. The systems incorporate Dialogic telephony hardware; specifically, each system houses a Dialogic D/480JCT-2T1 telephony board which can perform interactive voice response functions and call logging functions. In addition, one of the systems incorporates an AudioCodes DP6409 Passive-Tap call logging board. The telephony hardware provides the ability to record up to12 two person conversations simultaneously. Customized IVR software is installed on each system; the telephony application handles all interactions with callers, connects callers to one another, and starts/stops recordings. Each system includes a set of supporting software which handles automatic transfers of recordings to the main LDC network.

     

    [ top ]

    Broadcast Data Collection Laboratories

    LDC operates an extensive collection system dedicated to the capture and processing of broadcast content from a wide range of sources. The system is able to collect audio and video from satellite, CATV, and off-the-air. The satellite reception facilities allow us to address up to three simultaneous C-Band and Ku-Band satellite downlinks, as well as Dish Network and DirecTV satellite downlinks.


    The C-Band and Ku-Band feeds are utilized primarily for DVB-S Free-To-Air and Conditional Access International programming from the Galaxy-19 and Galaxy IIIC satellites. The system currently includes twelve DVB-S satellite receivers, twelve Dish network receivers, one SCOLA receiver and one DirecTV receiver. In the case of Dish Network, the LDC maintains subscriptions to a wide range of international channels. The signal reception component of the collection system also incorporates two ATSC receivers and six CATV demodulators, all of which are computer controllable. These tuners are dedicated to local English programming from the Southeast Pennsylvania broadcast region.


    This wide array of signal reception equipment feeds a computer controllable Audio/Video matrix switch which can route content from any receiver to video monitors, closed caption decoders, and the Recnode cluster. The Recnode cluster is a set of eight linux computers, each of which can record two simultaneous audio/video streams. The AV streams are captured as DV25, along with closed captions, and are then processed to extract audio and compressed to MPEG-4. In total, the Recnode cluster can log sixteen simultaneous audio/video streams, and can process up to 192 hours of content per day.


    All collection activity is driven by a supervisor computer with a customized scheduling database. The supervisor computer is responsible for controlling receivers, audio video matrix routing, and recording job initialization. The system also incorporates eight TB of local storage, dedicated automatic speech recognition systems, dedicated multimedia transcoding systems, a 24TB LTO4 tape backup system, and two experimental logging systems which can be used to capture entire transponder transport streams from satellite downlinks. The Broadcast Collection system is designed to be highly modular, highly reliable, and fully automatic

    [ top ]

    Facilities For Off-Site Broadcast Collection

    In addition to the primary broadcast collection system, the LDC has also deployed two portable broadcast collection platforms outside of the United States. Each portable platform is a TiVO style digital video recording (DVR) system capable of recording two streams of A/V material simultaneously. The platform includes integrated analog CATV (NTSC and PAL) and digital Satellite DVB-S reception components; it supports international specifications and is capable of recording programming outside of the United States. The system has a very small footprint and is suitable for transportation as a piece of carry-on luggage.

    The portable platform and the main LDC collection system share the same code base and rely on a modular, unified hardware specification.  Improvements in the main collection platform therefore translate into benefits for both platforms. The portable system runs Ubuntu linux, using a WinTV-PVR-500 for analog cable and a Technotrend Premium S-2300 PCI DVB-S receiver for DVB satellite reception. dvbstream is  utilized for satellite recording, and  ivtv is used for cable recording.

    The portable platform deployed in Hong Kong is currently dedicated to collecting multiple streams of CCTV programming and is maintained by local technical staff. The platform deployed in North Africa  is maintained remotely by personnel at LDC. Recordings are scheduled from LDC and automatically downloaded into LDC’s collections server. In each case, LDC is able to collect high-quality broadcast data with minimal equipment and in the case of data collected in North Africa, to receive that data immediately.

    [ top ]

    Publications Laboratory

    The LDC Publications Group maintains a robust production capacity and can produce publications on a variety of media. The LDC's Publication Laboratory is equipped with two Rimage CD/DVD duplicators and an OmniClone hard drive replicator. The CD/DVD duplicators have capacity of two hundred discs and can print full color labels with high resolution graphics right on the disc face. With addition of a Blu-Ray DVD duplicator, LDC can now utilize high capacity optical media, allowing for DVD releases of up to 50 Gb in size on a single disc. Large data sets may also be produced to span discs. Each disc contains an install script to reassemble the parts.

    Publications uses a Just in Time Inventory system with a web interface to provide quick and responsive fulfillment of orders. LDC can also produce very large publications on hard drives, producing up to fifteen copies at once. The hard drive replicator can also perform diagnostics, erase sensitive data from drives in bulk as well as repair bad hard drives. This allows the Publications Group to maintain pools of reusable hard drives dedicated to specific projects. All systems employ hardware and software verification to ensure the reliability and quality of all releases.

    [ top ]

    Software Development Infrastructure

     

    LDC's technical staff has developed a large amount of custom-built software for data collection, data processing, manual annotation of text, audio, image and video data (e.g., transcription, translation, named entity annotation, relation annotation), annotation workflow management, text indexing and searching, automatic annotation (e.g., language identification, content duplicate identification, segmentation, tokenization, tagging, morphological analysis), and quality control. These software resources are ready for reuse in similar future tasks. In particular, some of these resources, such as AGTK, are component-oriented and are specifically designed for reuse in various applications. LDC also has experience in using a wide range of third-party software for research, data production, and software development. All of these software resources are accessible from any of our centrally managed Linux and FreeBSD workstations via NFS file volumes. A newly developed application can be immediately deployed in data collection and annotation tasks by LDC staff members.

    LDC's software developers are equipped with desktop development workstations, computational servers, relational database servers, web servers, software development resources (e.g., various compilers, interpreters, debuggers, text editors, GUI-builders, IDEs, revision control systems), issue tracking systems, e-mail discussion lists, a wiki-based knowledge base and other documentation.

    [ top ]


  • About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

    Contact ldc@ldc.upenn.edu
    Last modified: Friday, 06-Jan-2012 16:53:27 EST
    © 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.