Introduction
CETEMPublico Version 1.7 (Corpus de Extractos de Textos Electronicos
MCT/Publico), produced by the Linguistic Data Consortium (LDC) as catalog
number LDC2001S04 with ISBN 1-58563-206-6, is a corpus of newspaper texts from
the Portuguese daily newspaper Publico, compiled for purposes of research and
development in natural language processing (NLP) by the Computational
Processing of Portuguese Project, under an agreement between Publico and the
Portuguese Ministry of Science and Technology (MCT).
Data
The corpus includes the text of approximately 2,600 editions of Publico,
produced between 1991 and 1998, and amounting to approximately 180 million
words. CETEMPublico Version 1.7 contains 1,504,258 extracts (CETEMPublico
Version 1.0 had 1,567,625). Version 1.7 was created in Oslo on August 6, 2001
and uses SGML tagging. The corpus is in 196 compressed text files, with names
in the form cetemXXX.gz, from cetem001.gz to cetem196.gz.
This corpus was designed to assist researchers who develop computer programs
processing the Portuguese language and who would need raw material for their
work. In addition, the authors wished for the corpus to be useful to everyone
who studies the Portuguese language and wishes to verify their hypotheses in
previously organized text material. The online and the CQP versions are meant
for such users, who are, in any case, also welcome to get it on CD in order to
process the corpus locally, possibly by means of the corpus processing system
of their choice.
More detailed information is available at http://www.linguateca.pt/cetempublico.
Updates
There are no updates at this time.
Content Copyright |