Penn GEF Americas Workshop

International Workshop on Data Intensive Research on Languages of the Americas

Organized by LDC with the support of the Penn Global Engagement Fund, this two-day workshop in Mexico City on May 24-25, 2018 brought together linguists and computer scientists from Mexico, Brazil, Chile, Argentina, and the United States to discuss the opportunities and challenges of constructing and sharing language resources in the languages of the Americas, e.g., Spanish, Portuguese, Caribbean dialects, and indigenous languages.

Penn GEF participant picture

Presentations highlighted data intensive research on corpus and language resource creation, documenting indigenous languages, speech technology, phonological analysis, and morphological analysis for a variety of languages including Mexican Spanish, American Spanish, Brazilian Portuguese, Chuj, Tojolabal, Yucateco, Huasteco, Nahuatl, Wixarika, and Southern Cone languages.

Participants also discussed needs and strategies for building a community and ongoing forum (for example, a yearly or biennial conference) for that community to meet, network, collaborate, and share resources and methods.


LDC Activities Related to Languages of the Americas 
Christopher Cieri, Mark Liberman, Denise DiPersio, Linguistic Data Consortium
Available: Slides in PDF

Coming Soon: the new CIEMPIESS datasets for speech recognition in Mexican Spanish
Carlos Mena, National Autonomous University of Mexico (UNAM)
Available: Slides in PDF

Building Resources for Human and Computational Language Processing of Portuguese 
Aline Villavicencio, Federal University of Rio Grande do Sul
Available: Slides in PDF

Finding the needle in the hay stack: Frustrations and lessons learned from collecting and annotating data
Thamar Solorio, University of Houston
Available: Slides in PDF

Documenting, archiving, and mobilizing Southern Cone languages (South America)
Lucía Golluscio, Universidad de Buenos Aires and Consejo Nacional de Investigaciones Científicas y Tecnológicas

Introducing NIEUW: Novel Incentives and Workflows for Eliciting Linguistic Data
Christopher Cieri and James Fiumara, Linguistic Data Consortium
Available: Slides in PDF

A set of Brazilian-Portuguese databases for speech synthesis
Alexandre Maciel, University of Pernambuco
Available: Slides in PDF

Measuring morphological similarities for low-resource languages
Alfonso Medina, El Colegio de México
Available: Slides in PDF

GlobalTIMIT: Progress and Prospects
Mark Liberman, Linguistic Data Consortium
Available: Slides in PDF

Speech Technology Research at LPTV
Néstor Becerra Yoma, Universidad de Chile
Available: Slides in PDF

Building Corpora in Portuguese    
Livy Real, GLiC - São Paulo University
Available: Slides in PDF

Large-scale analysis of Spanish/s/-lenition using audiobooks
Neville Ryant and Mark Liberman, Linguistic Data Consortium
Available: Paper in PDF

Mexican Indigenous Corpora
Ivan Meza, National Autonomous University of Mexico (UNAM)
Available: Slides in PDF

CORDIAM: A diachronic and diatopic corpus of American Spanish
Alexander Gelbukh and Grigori Sidirov, National Polytechnic Institute