| Archive Name: | University of Helsinki Language Corpus Server |
| Archive URL: | http://www.ling.helsinki.fi/uhlcs/ |
| Host Institution: | University of Helsinki, Department of General Linguistics |
| Country: | Finnland |
| Contact Person: |
Pirkko Suihkonen University of Helsinki, Department of General Linguistics / Max Planck Institute for Evolutionary Anthropology Department of Linguistics Inselstrasse 22 Tel. +49-(0)341-9952 328 Fax: +49-(0)341-9952 119 |
| Email Address: | suihkonen@eva.mpg.de, suihkone@ling.helsinki.fi |
2.1
If the archive has a catalog in a standardized format, what fields does it
contain? If not, what contextual information about the resources are
collected? What other information would you like to collect if you could?
There is no catalog in a standardized format. Information on the data is located on the web-address of the databank. In most cases, the web- pages contain information on the owners of the corpora, and if the data is published, inform
ation on the publication. Also the address of the corpora on the databank, and the names and addresses of the contact persons are given.
2.2
If the electronic catalog conforms to some standard, please tell
us the name of the standard.
2.3
To what extent have the archived materials been cataloged
electronically?
2.4
If there is an online public access catalog, please give its URL.
http://www.ling.helsinki.fi/uhlcs/data/corpora.html; information on the data is linked with the names of languages.
3.1 What geographical regions and languages are covered?
| Main Regions Covered: | Asia Europe |
| Approx Number of Languages: | 60 |
| Main Languages: | Uralic languages including Finnish, Turkic languages, Swedish, English, German, Russian, Iranian languages |
3.2 Please give impressionistic estimates of the archive holdings for each of the data types.
|
3.3
Please list any other data types which are not included above,
or any other comments on the archive holdings:
There are quite a lot of data that are originally written in the Cyrillic characters. The plans for converting the data to the UNICODE are in progress, but nothing finished in that work. The UNICODE should work in the editor (there i
s Emacs-editor on the UHLCS), and also in various phases of the research work. On the server, there are also tools for reserach work on the server, such as automatic morphological analyzers for Finnish, Swedish and English.
3.4
What proportion of the holdings are unique to
the archive and not available elsewhere?
a significant amount
4.1
To what extent are the archive holdings published
electronically, where "published" means that there is
a well-defined procedure such that
anyone at all can get a standard copy of the data,
either on digital media or over the internet?
nothing published
4.2
To what extent are the archive holdings accessible over the web?
a significant amount
4.3
Is permission required before materials can be accessed?
virtually always
4.4
Is there any fee for materials?
no
4.5
How are author and/or editor defined for the electronic publications?
Is there a bibliographical citation method?
No special citation method is recommended. The language directories contain a README-file, in which the reserachers are asked to mention in research papers and reports, if the corpora are used as reserach material. In the reports, at
least the name and owner of the corpora, a and name of the databank must be mentioned.
4.6
Do the electronic publications have ISBN numbers?
4.7
What plans are there to expand the electronic publication of archive holdings?
In the UHLCS, there are, e.g., novels that could be published electronically. In practice, this is not possible, because the UHLCS does not have economical recourses in order to develop the databank in that way. Right now, for increa
sing the accessibility of the data, work for defining the metadata of the data located on the UHLCS in in progress.
5.1
Who is the legal owner of archived materials?
The data are owned by different institutions or single reserachcers. For instance, there are data that are owned by reserachers who have encoded and edited the data. There are also data owned by the Institute for Bible Translation, o
r printing houses and authors of the books that are given to the UHLCS.
5.2
Beyond legal ownership,
are there any asserted or perceived moral rights concerning
archived materials?
Do the holders of the archive see the original speakers or
their representatives as controlling publication?
The data are received to the UHLCS in that way that a special agreement is signed by the Department of General Linguistics that is the representative of the UHLCS, and the owners of the data. In this agreement, the UHLCS has promised
to follow normal copyright rules in the use of the data. The data are received for reserach work and teaching, and, for instance, data cannot be moved from the UHLCS without permission of the owner of the data.
5.3
In cases where no electronic publication is planned, why is this so?
(e.g. funding, licensing, technical know-how, lack of interest).
As mentioned above, the UHLCS does not have economical resources for developing the UHLCS. The Department of General Linguistics maintains the UHLCS, and numerous researchers and students use the data located on the UHLCS.
5.4
Is any of the data in a proprietary format (e.g. MS Word)? If so,
are there plans to transfer it to an open standard (e.g., XML)?
There are large amount of data that are in the SGML-format. There are also a lot of data that are only running texts, and the work for transferring data to the XML-format is in progress. This work should be finished in 2000.
i>
6.
Do you have any other comments about digital archives of
language material, or on this survey?
There is a linguistic laboratory at the Udmurt State University in Izhevsk, Udmurtia, Russia, and in that laboratory, there are good collections of Udmurt data. The name of the contact person is Prof. Nasibullin. The address is: Udmu
rt State University, Universitetskaya st. 1, 426 034 Izhevsk, Russia.