LDC Institute

The LDC Institute is a seminar series on issues broadly related to linguistics, computer science, natural language processing and human language technology development. Featured speakers include researchers from LDC, the Penn community and distinguished scholars from around the globe.

LDC Institute Archive

2023
2022
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002

2023

Repetition and Information Flow in Music and Language

David Temperley, Eastman School of Music

December 1, 2023; 11:00-12:30pm

The theory of Uniform Information Density makes two predictions regarding the use of repetition: (1) when a pattern is repeated, variable aspects of the pattern will be less probable in the second instance than in the first; (2) less probable patterns will have a higher tendency to be used repetitively. In this talk, Temperley presents recent research testing these two predictions with regard to language and music. Regarding prediction (1): in syntactically-matched coordinate constructions (e.g. the big dog and the small cat), the second coordinate tends to have less frequent words than the first; in music, when a melodic pattern is immediately repeated in an altered form, the alterations tend to lower the schematic probability of the pattern (e.g. by increasing the interval size). Regarding prediction (2): in music, unusual melodic devices (such as escape tones and anticipations) tend to be used repetitively; in language, rare syntactic constructions show a higher tendency than common ones to be used repetitively (in coordinate constructions and elsewhere). Intriguingly, the syntactic constructions that are most often used repetitively tend to be associated with persuasive rather than informative discourse, implying an emotional commitment on the part of the speaker (such as the construction Det Adj, e.g., (the rich); this suggests a further connection with music.

A New Database, Family Tree and Origins Hypothesis for the Indo-European Language Family

Dr. Paul Heggarty, Pontificia Universidad Católica del Perú in Lima

September 6, 2023; 10:00-11:30am

A recent article in Science presents a new language database and family tree analysis of the Indo-European languages, and a new hypothesis on their origins and expansion. Indo-European is dated to some 8100 years ago, as a central estimate of when it began to spread and diverge. This date, and the family tree structure, fit with neither the Steppe nor the farming hypothesis for Indo-European origins. Instead, separate aspects of each combine into a new ‘hybrid’ hypothesis: Indo-European did not originate on the Steppe, but in the northern arc of the Fertile Crescent, and only some of its main branches in Europe came through the Steppe, as a secondary staging-post.

This talk sets out all aspects of this wide-ranging, cross-disciplinary research. It covers key issues in Indo-European studies; in cognacy databases; in methodology for Bayesian phylogenetic analysis of language families; and in how the ancient DNA record fits with a hybrid hypothesis of Indo-European origins.

A Conversation with Roberto Pieraccini and Mark Liberman

March 3, 2023; 11:30-1:00pm

Both Roberto Pieraccini and Mark Liberman started their careers at AT&T Bell Laboratories, one of the most respected US research institutions, more than 30 years ago and continued, until today, to pursue the technology and science of human language This is an opportunity for them to share ideas about their work and experiences and to provide a critical view at the evolution of their respective fields, and how that helped shape the current technological and scientific outlook. This session will be structured as a fireside chat where the speakers will elaborate on a number of questions asked by a moderator and conclude with open questions from the audience.

Roberto’s career highlights include positions in research (CSELT, IBM Research) and industry (SpeechWorks International, SpeechCycle, JIBO, Google) focusing on statistical natural language understanding and reinforcement learning for automated dialogue systems. Mark left Bell Labs as the Head of the Linguistics Research Department to join the University of Pennsylvania with appointments as Professor in the Department of Linguistics and the Department of Computer and Information Science. He is also LDC’s founder and Director. His research interests include, among others, corpus-based phonetics; speech and language technology; clinical applications of linguistic analysis; and the phonology and phonetics of lexical tone and its relationship to intonation.

2022

What is a linguistic variety? Linguistic coherence, variation and nominalisation processes in Cockney

Amanda Cole, University of Essex

November 30, 2022; 12:00-1:30pm

This talk approaches the deceptively complex question: what is a linguistic variety? In research on human language there is often much variability in how a given linguistic variety is defined or identified in terms of speakers’ demographic and social characteristics, and/or the constituent linguistic features. This creates potential problems in terms of research reproducibility and verifiability. In this talk, Amanda Cole considers a linguistic variety as being, firstly, spoken by a group of speakers with some shared characteristics and, secondly, linguistically coherent. This latter point means that a linguistic variety is not defined by any single linguistic feature, but instead, it includes many linguistic features which co-vary. Linguistic coherence does not require all speakers of the same variety to speak identically. Instead, a linguistic variety occurs on a continuum which includes internal variation from a centre of gravity but is sufficiently different to demarcate it from other varieties. Cole takes the case study of Cockney, an urban variety of southern British English. She presents data and results on linguistic coherence, linguistic variation and change, and linguistic boundary-marking and nominalisation processes in Cockney and related varieties to probe the linguistic and social nature of linguistic varieties.

LDC Institute Archive

2023

Repetition and Information Flow in Music and Language

A New Database, Family Tree and Origins Hypothesis for the Indo-European Language Family

A Conversation with Roberto Pieraccini and Mark Liberman

2022

What is a linguistic variety? Linguistic coherence, variation and nominalisation processes in Cockney

French CrowS-Paris: Extending a challenge dataset for measuring social bias in masked language models to a language other than English

2020

Describing typical language development in early childhood in South Africa: Harnessing local knowledge though online technologies

2019

Construction and Analysis of the Chinese Abstract Meaning Representation Corpus

A Tutorial on Finite-State Text Processing

2018

Transcribing and Sorting Cairo Geniza Fragments in Partnership with Citizen Humanists: Scribes of the Cairo Geniza

Boundary-Based MWE Segmentation and Applications

2017

Introduction to Beijing Advanced Innovation Center for Language Resources (ACLR): Objective, Mission, and Projects

Corpus of Political Speeches in Greater China

Language Resources from the LLT Group at the Hong Kong Polytechnic University: From phonological neighbourhood to semantic relata, from grammar to emotion

A Framework for Conducting Non-Expert Translations and Summarizations

2016

The Growth in Grammar Corpus: On Working with Children (But not Animals)

The Language Grid: Multi-Language Service Platform for Intercultural Collaboration

Social Data Research at a National Laboratory

2015

Multimodal Interaction Standards at the World Wide Web Consortium

2014

Comparing Dialect and Accented Pronunciations on the Basis of Transcriptions and Articulography

Aikuma: A Mobile App for Collaborative Language Documentation

2013

The Corpus of Interactional Data: A Large Multimodal Annotated Resource

2012

The Sociolinguistic Archive and Analysis Project: Data, Tools and Applications

2011

Building a universal corpus of the world's languages

Coding Conventions for Archival Sharing

Free recall of word lists; empirical and theoretical issues

Contact, Restructuring, and Decreolization: The Case of Tunisian Arabic

2010

Language Technology Resources for Sanskrit and other Indian Languages at Jawaharlal Nehru University, India

Bibliotheca Alexandrina: The oldest library in the digital age

Ibrahim Shihata Arabic UNL Center at Bibliotheca Alexandrina

U.S. Supreme Court Corpus (SCOTUS)

2009

Variations Across Languages, Divisions Within Communities: Languages, Schools and the Internet in Tunisia

The LDC Standard Arabic Morphological Tagger

Building an ASL Corpus Project

2008

Development of Resources and Techniques for Processing of Some Indian Languages

2007

HTML Templates for LDC Sponsored Projects

Speaking Arabic in Iraq and the Middle East: Reflections on Three Tours of Duty

Programming Specifications: Procedures and Practices

2006

Comparing Linguistic Annotations -- Issues in Harmonization and Quality Control

Recording and Annotation of Speech Data via the WWW - A Case Study

LDC Online

Pros and Cons of Different Annotation Workflow Systems

Recent Trends in Annotation Tool Development at LDC

2005

Building a Lexicon Database for Arabic Dialects

Less Commonly Taught Languages (LCTLs)

The Teaching of Berber in Morocco: Reality and Perspectives

Functional Morphology

2004

Arabic Propbank

Project Santiago

Tongue-Tied in Singapore: A Language Policy for Tamil?

The Contextualization of Linguistic Forms across Timescales

Interfaces for Parser and Dictionary Access

2003

Finite State Morphology using Xerox Software

Searching through Prague Dependency Treebank

The Pennsylvania Sumerian Dictionary Project

Arabic Language: Issues and Perspectives

Collections

Data and Annotations for Sociolinguistics (DASL): Using digital data to address issues in sociolinguistic theory

Towards a Comprehensive, Empirical Analysis of Linguistic Data: the case of Regional Italian vowel systems

(1) BITS and other Machine Translation Collection Projects
(2) Overview of Machine Translations
(3) BITS and other Machine Translation Collection Projects