Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  
Use of LDC Corpora by Students

Ways LDC corpora have been used for student research and for teaching purposes at university summer school programs.

Use of LDC Data by High School Students -- June 16, 2010

Last month, LDC announced our new LDC Data Scholarship program which will provide university students with no-cost access to LDC data.  Please stay tuned for further announcements and application materials for the fall 2010 program cycle.  LDC data is used for research by students at over 1000 universities and colleges across the globe. What our data users may not know is that even younger students are using LDC data in their research. Our databases have been used by high school (secondary school) students for science fair projects.  We’d like to highlight two intriguing examples. 

Carmel High School, Carmel, CA, USA.  Dylan Freedman, a student at Carmel High School in Carmel, CA, investigated the creation of an efficient English text compression algorithm based on the n-grams contained in the Web 1T 5-gram Version 1 (LDC2006T13) database. The algorithm he developed worked particularly well on small text files around 100 bytes long; for such files the average compression ratio was about 30%, whereas traditional compression algorithms were not able to compress such small files.

Dylan worked with two advisers, Craig Martell and George Dinolt, from the Naval Postgraduate School in Monterey, CA who assisted in developing project ideas and in supplying the official jargon for some natural language ideas he implemented. For example, although he didn’t originally know the term by name, Dylan used a Katz's back-off model when querying the n-gram database to calculate the rankings for how often a word would occur given a few preceding context words. 

His project, A Novel Approach to Text Compression Using N-Grams, won the Grand Prize at the Monterey County Science Fair, which enabled Dylan to participate in the Intel International Science and Engineering Fair (Intel ISEF) earlier this year in San Jose, CA.  At Intel ISEF, Dylan won awards from the IEEE Computer Society and the Office of Naval Research.

Albuquerque Public Schools, Albuquerque, NM, USA.  In 2005, students enrolled in a Computer Science class in the Career Enrichment Center at Albuquerque Public Schools, Albuquerque, NM, used TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) and The CMU Kids Corpus (LDC97S63) for a science fair project. A team of three students used a neural network to see if they could identify characteristics of a speaker; in particular, they wanted to see if they could differentiate speakers by age (above or below a certain age) and by gender. In order to complete the project, the students trained the neural network using the LDC speech data.  The students received recognition at both the district and state levels of the New Mexico Science and Engineering Fair.

LDC would like to thank the high school students who contributed to this article and wish them continued success in their studies!

[ top ]

LSA Summer Institute and LDC Corpora -- August 17, 2007

The LDC was pleased to provide access to several LDC corpora for students at the Linguistic Society of America (LSA) 2007 Summer Institute. This year's institute, entitled 'Empirical Foundations for Theories of Language', was hosted by Stanford University. The institute drew researchers and students from across the globe and included a number of courses that provided students with hands-on experience in working with linguistic data. The following examples, which demonstrate how large-scale databases can be used for teaching purposes, were submitted by course instructors at the LSA 2007 Summer Institute:.

In "Information Structure and Word Order Variation", taught by Betty J. Birner and Gregory Ward, students used LDC corpora to collect tokens of various constructions displaying non-canonical word order, with the goal of discovering how various categories of information are distributed in these non-canonical constructions. Among the corpora used were the Treebank and Brown Corpus (including the Wall Street Journal), and Switchboard.

In “Pronunciation Variation and Psycholinguistics”, taught by Susanne Gahl, students examined pronunciation variants and fluctuations in speaking rate in the Switchboard corpus, with the aim of understanding the mechanisms underlying human language production and comprehension.

For "Paraphrase and Usage" taught by Annie Zaenen, Cathy O'Connor, and Tom Wasow, students were required to initiate a small corpus study in order to receive credit. The focus of the class was grammatical alternations and the factors that determine their relative frequencies. The purpose of the project requirement was to give students hands-on experience in exploring usage data. Students used a variety of corpora for their projects, including the Treebank and TIPSTER.

The LDC looks forward to collaborating with LSA for future institutes.

[ top ]

EMLS Summer School -- July 21, 2006


European Masters in Language and Speech (EMLS) is a network of European Universities providing education in natural language processing and speech communication sciences. EMLS organizes regular summer schools which attract considerable interest of students from both NLP and speech processing domains. .

For this year’s summer school in Utrecht (NL), members of Speech@FIT group, (Faculty of Information technology, Brno University of Technology, Czech republic), prepared two tutorials making use of LDC corpora: .

during “Speech recognition based on Hidden Markov Models” given by Jan “Honza” Cernocky, students built a recognizer of connected digits using HTK tools. The recognizer is comparable to the Aurora-ETSI standard and on clean data, it has more than 99% word accuracy. TI-DIGITS database from LDC was used in this tutorial. .

- in “Phoneme posterior estimation and acoustic keyword-spotting”, given by Igor Szoke, students got acquainted with theory and practice of phoneme recognition and posterior estimation by neural network and with their use in acoustic keyword spotting. The tutorial was based on the one of LDC classics: TIMIT.

LDC supported these two tutorials with the data – while the use of TIDIGITS was limited to the EMLS, EMLS students were offered in kind copies of TIMIT including the documentation for home use.

More Information:
EMLS homepage
Utrecht summer school page
Speech@FIT home

[ top ]


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Monday, 24-Mar-2008 12:30:44 EST
© 1992-2007 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.