|

|
|
ECI Multilingual Text
| |
| Item Name: | ECI Multilingual Text |
| Authors: | LDC |
| LDC Catalog No.: | LDC94T5 |
| ISBN: | 1-58563-033-3 |
| Data Type: | text |
| Data Source(s): | broadcast conversation, broadcast news, dictionaries, journal articles, news magazine, newswire, varied |
| Application(s): | information retrieval, language modeling, machine translation |
| Language(s): | Albanian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin Chinese, Modern Greek, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Portuguese, Russian, Serbian, Slovenian, Spanish, Standard Malay, Swedish, Turkish |
| Language ID(s): | ALS, BUL, DAN, DEU, ELL, ENG, EST, FRA, GLA, JPN, LAT, LIT, NNO, NOB, NOR, POR, POR, RUS, SLV, SRP, SWE, TUR |
| Distribution: | 1 DVD |
| Member fee: | $0 for 1994 members |
| Non-member Fee: | US$75.00 |
| Reduced-License Fee: | US$75.00 |
| Extra-Copy Fee: | US$75.00 |
| Non-member License: | yes |
| Member License: | yes |
| Readme File: | yes |
| Online documentation: | yes |
| Licensing Instructions: | Subscription Members, Standard Members, Non-Members |
| Citation: | LDC 1994 ECI Multilingual Text Linguistic Data Consortium, Philadelphia |
|
| The first release of the European Corpus Initiative, the Multilingual
Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European)
languages. The total size of these is roughly 92 million (lexical)
words. The corpora are marked up using TEI P2 conformant SGML (to
varying levels of detail), with easy access to the source text without
markup. Twelve of the component corpora are multilingual parallel corpora
with from two to nine sub-corpora. All the alphabetic corpora (there
is some Japanese and Chinese) are encoded in the ISO LATIN family of
8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High
Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at
least.
The amount of material per language varies, from about 36 million
words (German) to about 5 thousand words (Bulgarian). The majority of
sources are journalistic in nature (newspapers, magazines,
broadcasts); additional sources include dictionaries (Albanian,
Gaelic, Turkish, Japanese/English), literature, technical reports and
proceedings or publications of international organizations. The table
on the next page lists the languages included, the subcorpus numbers
for each language (in parentheses) and the amount of data per
language in thousands of lexical words.
Language (Subcorpus #) Kwords Totals
German (70) 34291 (09) 191 (65) 20 (28) 187
(29) 59 (30) 76 (47) 24 (59) 50
(71) 21 (70A) 999 35918
French (31) 4775 (04) 4121 (28) 187 (29) 59
(30) 76 (47) 24 (51) 6 (59) 50
(71) 21 (32) 1667 10986
Spanish (31) 4500 (13) 830 (14) 1041 (15) 447
(47) 24 (32) 1667 8 (59) 50 (71) 8580
English (31) 4222 (36) 1141 (74) 95 (28) 187
(47) 24 (51) 6 (56) 97 (59) 50
(71) 21 (32) 1667 7510
Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145
Czech (44) 4726 4726
Italian (11) 3518 (42) 303 (58) 13 (29) 59
(30) 76 (47) 24 (71) 21 4014
Chinese (78) 2895 2895
Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610
Norwegian (41) 2226 2226
Swedish (37) 1718 1718
Serb/Croat/Slov(24) 700 (56) 289 989
Tibetan (76) 834 834
Portuguese (60) 675 (47) 24 (71) 21 720
Malay (80) 563 563
Russian (73) 364 364
Japanese (57) 203 203
Turkish (20) 173 (20A) 110 283
Albanian (82) 205 205
Gaelic (55) 141 141
Estonian (39) 100 100
Usbek (81) 88 88
Latin (74) 75 75
Danish (47) 24 (71) 21 45
Lithuanian (89) 20 20
Bulgarian (84) 5 5
Total 91969
Content Copyright |
|
|