Penn GEF Americas Workshop
International Workshop on Data Intensive Research on Languages of the Americas
Organized by LDC with the support of the Penn Global Engagement Fund, this two-day workshop in Mexico City on May 24-25, 2018 brought together linguists and computer scientists from Mexico, Brazil, Chile, Argentina, and the United States to discuss the opportunities and challenges of constructing and sharing language resources in the languages of the Americas, e.g., Spanish, Portuguese, Caribbean dialects, and indigenous languages.
Presentations highlighted data intensive research on corpus and language resource creation, documenting indigenous languages, speech technology, phonological analysis, and morphological analysis for a variety of languages including Mexican Spanish, American Spanish, Brazilian Portuguese, Chuj, Tojolabal, Yucateco, Huasteco, Nahuatl, Wixarika, and Southern Cone languages.
Participants also discussed needs and strategies for building a community and ongoing forum (for example, a yearly or biennial conference) for that community to meet, network, collaborate, and share resources and methods.
Presentations
LDC Activities Related to Languages of the Americas
Christopher Cieri, Mark Liberman, Denise DiPersio, Linguistic Data Consortium
Available: Slides in PDF
Coming Soon: the new CIEMPIESS datasets for speech recognition in Mexican Spanish
Carlos Mena, National Autonomous University of Mexico (UNAM)
Available: Slides in PDF
Building Resources for Human and Computational Language Processing of Portuguese
Aline Villavicencio, Federal University of Rio Grande do Sul
Available: Slides in PDF
Finding the needle in the hay stack: Frustrations and lessons learned from collecting and annotating data
Thamar Solorio, University of Houston
Available: Slides in PDF
Documenting, archiving, and mobilizing Southern Cone languages (South America)
Lucía Golluscio, Universidad de Buenos Aires and Consejo Nacional de Investigaciones Científicas y Tecnológicas
Introducing NIEUW: Novel Incentives and Workflows for Eliciting Linguistic Data
Christopher Cieri and James Fiumara, Linguistic Data Consortium
Available: Slides in PDF
A set of Brazilian-Portuguese databases for speech synthesis
Alexandre Maciel, University of Pernambuco
Available: Slides in PDF
Measuring morphological similarities for low-resource languages
Alfonso Medina, El Colegio de México
Available: Slides in PDF
GlobalTIMIT: Progress and Prospects
Mark Liberman, Linguistic Data Consortium
Available: Slides in PDF
Speech Technology Research at LPTV
Néstor Becerra Yoma, Universidad de Chile
Available: Slides in PDF
Building Corpora in Portuguese
Livy Real, GLiC - São Paulo University
Available: Slides in PDF
Large-scale analysis of Spanish/s/-lenition using audiobooks
Neville Ryant and Mark Liberman, Linguistic Data Consortium
Available: Paper in PDF
Mexican Indigenous Corpora
Ivan Meza, National Autonomous University of Mexico (UNAM)
Available: Slides in PDF
CORDIAM: A diachronic and diatopic corpus of American Spanish
Alexander Gelbukh and Grigori Sidirov, National Polytechnic Institute