Information for (and from) the TIDES data committee: a work in progress.
Primary focus is on immediate (year one) needs.
If you have any questions, additions or corrections, please contact Mark Liberman.
Multilingual research in TIDES year one is focused on Mandarin, Korean and Spanish. This is a sketch of basic resources in those languages, currently or potentially available to TIDES researchers for year one research. For the purposes of this inventory, "TIDES year one" is taken to start 1/1/2000.
In the tables below, "KW" means "thousand words", "MW" means "million words"; "KCh" means "thousand (Chinese) characters", "MCh" means "million (Chinese) characters". For parallel corpora, where only one count is given, it applies to each of the parallel languages, not their sum.
Unless otherwise indicated, Mandarin resources are in GB encoding.
Summary:
Major year one needs -- i.e. needed NOW or within two or three months, not in a year or two:
|
Mandarin
|
Korean
|
Spanish
|
|
| Parallel Text 1/1/2000 (in hand) | 25 MW | 0-50 KW | 61 MW |
| Parallel Text 1/1/2001 (in hand or credibly promised) |
35 MW | 400-450 KW | 61 MW |
| Monolingual Text 1/1/2000 | 599 MCh =~ 327 MW |
23 MW | 471 MW |
| Monolingual Text 1/1/2001 |
696 MCh |
33 MW | 567 MW |
|
Source(s)
|
Archive amount
(start of year one) |
IPR Status
|
Availability
|
Ongoing
accumulation |
Projected amount
(end of year one) |
Comments
|
| Hong Kong Legal Code | 11.5MCh. (Mandarin) 6.3 MW (English) |
OK | LDC | N/A |
11.5 MCh/ |
Big 5 encoding Sentence aligned. |
| Hong Kong News | 17 MCh (Mandarin) 9.3 MW (English) |
OK | LDC | 900KCh/mo. 500KW/mo. |
27.8 MCh/ 15.3 MW |
Big 5 encoding Story aligned, will be sentence aligned |
| Hong Kong Hansard | 17.5 MCh (Mandarin) 9.2 MW (English) |
OK | LDC |
600KCh/mo. |
24.7 MCh/ 12.8 MW |
Big 5 encoding Will be sentence aligned |
| FBIS, various sources | None | English OK (?), Mandarin to be determined |
Not yet | ~40KW/mo. starting ~2/10/2000 |
~420 KW | Source accumulation has started ~2/10/2000 (?); Processes for normalization, alignment etc. need to be established |
| STRAND, various
internet sources (Phil Resnick/Jinxi Xu) |
3289 URL pairs 17.6 MW (English) |
No IPR permissions; others can download from specified URLs |
URL list may be available | (unknown) | Probably overlaps considerably (almost entirely?) with LDC Hong Kong data |
|
Source(s)
|
Archive amount
(start of year one) |
IPR Status
|
Availability
|
Ongoing
accumulation |
Projected amount
(end of year one) |
Comments
|
| DLI U.S. Army translator training texts entered at Penn (M. Palmer) |
~50KW | DLI has not released it | unknown | none | ~50 KW | Domain is "ground troop movements" Both English and Korean are being Treebanked |
| FBIS, various sources | None |
English OK (?), |
not yet | >40KW/mo. | ~400 KW | Source accumulation will start soon (?); processes for normalization, alignment etc. need to be established |
| BITS (web search) |
None | To be determined | unknown | unknown | ~6 MW (guess based on preliminary survey) |
This is the system that found and document-aligned the Mandarin parallel corpora |
|
Source(s)
|
Archive amount
(start of year one) |
IPR Status
|
Availability
|
Ongoing
accumulation |
Projected amount
(end of year one) |
Comments
|
| UN document archive | 47 MW | OK | LDC | None | 47 MW | Published by LDC. Topics very different from journalistic text. |
| ECI Corpus |
ITU CCITT Handbook: 4.5 MW |
OK | LDC | None | ~6.6 MW | Published by LDC |
| ELRA-0007 EC 9-language corpus |
1.1 MW (1993 Q&A) 5-8 MW (1992-94 Parliamentary debates) |
OK | ELRA | None | ~7 MW | Published by ELRA Some parts have been variously annotated in the JOC corpus |
| FBIS, various sources | None | English OK (?), Spanish to be determined |
Not yet | Unknown; information promised for 2/18/2000 |
Unknown | FBIS Latin American Bureaus to begin using new computer system 2/16/2000; source accumulation can start after that |
| WHO parallel corpus | ? | ? | Unclear | ? | ? |
|
Source(s)
|
Archive amount
(start of year one) |
IPR Status
|
Availability
|
Ongoing
accumulation |
Projected amount
(end of year one) |
Comments
|
| People's Daily | 147 MCh | OK | LDC | None | 147 MCh | 1991-1996 |
| China Radio | 110 MCh | OK | LDC | None | 110 MCh | 1991-1996 |
| Xinhua | 158 MCh | OK | LDC | 2.15 MCh/mo. | 184 MCh | 1994-1999 |
| CNA | 102 MCh | OK | LDC | 3.2 MCh/mo. | 140 MCh | 1996-1999 |
| Zaobao | 77.7 MCh | OK | LDC | 2.8 MCh/mo. | 111 MCh | |
| (1997 Mandarin Broadcast - Sources?) | ? | OK | LDC | None | ? | Broadcast transcriptions (30 hours) |
| TDT2 Mandarin Text | ? | OK | LDC | None | ? | |
| Conversational transcripts | ? | OK | LDC | None | ? | 240 conversations, ? hours transcribed |
| ECI Corpus/Xinhua | 3.75 MCh | OK | LDC | None | 3.75 | 1/1990-3/1991 |
| (Some Hong Kong Newspaper) | ? | ? | ? | None | ? | 1998 -- Obtained by Donna Harman for TREC -- details to come |
| Academia Sinica balanced corpus | ~5 MW | "Academic research" only | ROCLING | None | ~ 5MW | Big 5 encoding |
|
Source(s)
|
Archive amount
(start of year one) |
IPR Status
|
Availability
|
Ongoing
accumulation |
Projected amount
(end of year one) |
Comments
|
| Korean Press Agency | 23.2 MW | OK | LDC | 33KW/day | ~33MW | 1994- onwards |
| Conversational transcripts | ? | OK | LDC | None | ? | 60 calls, ? hours transcribed |
|
Source(s)
|
Archive amount
(start of year one) |
IPR Status
|
Availability
|
Ongoing
accumulation |
Projected amount
(end of year one) |
Comments
|
| Reuters LA wire | 34 MW | OK | LDC | None | 34 MW | 1993-1995 |
| Reuters SL wire | 41 MW | OK | LDC | None | 41 MW | 1993-1995 |
| APWS | 74 MW | OK | LDC | ~3 MW/month | 110 MW | 1995-1998 |
| AFP/Spanish | 213 MW | OK | LDC | ~5 MW/month | 273 MW | 1994-1999 |
| Infosel | 23 MW | OK | LDC | None | 23 MW | 1993 |
| El Norte | 84 MW | OK | LDC | None | 84 MW | 1997-1998 |
| ECI Corpus | Corpus Oral: 1MW Sur newspaper: 447 KW El Diario Vasco 830 KW |
OK | LDC | None | 2.2 MW | |
| (1997 Spanish BC transcripts) | ? | OK | LDC | None | ? | 30 hours |
| Conversational transcripts | ? | OK | LDC | None | ? | 240 calls, ? hours transcribed |
This is an incomplete list of as-yet unpublished resources that are in principle available:
LDC VOA Collection as of 3/20/2000 (from 12/99). Collection of ~1 hour/day in all 59 VOA languages will start 4/2000.
|
Hours
|
Files
|
Language
|
| 11.00 | 11 | Azerbaijani |
| 5.90 | 6 | Bangla |
| 63.81 | 70 | Burmese |
| 66.63 | 52 | Cantonese |
| 32.25 | 63 | Dari |
| 65.46 | 63 | Farsi |
| 53.01 | 55 | French |
| 25.08 | 50 | Georgian |
| 32.38 | 65 | Hausa |
| 38.85 | 73 | Hindi |
| 35.90 | 56 | Indonesian |
| 57.69 | 58 | Kazak |
| 81.06 | 82 | Khmer |
| 56.46 | 57 | Kirundi |
| 76.47 | 106 | Korean |
| 50.88 | 51 | Kyrghiz |
| 52.35 | 56 | Lao |
| 49.91 | 49 | Mandarin |
| 32.15 | 63 | Pushto |
| 57.10 | 83 | Portuguese |
| 54.86 | 55 | Tajik |
| 57.35 | 59 | Tibetan |
| 5.00 | 5 | Turkish |
| 57.89 | 58 | Turkmen |
| 14.00 | 18 | Unknown |
| 54.23 | 89 | Urdu |
| 35.62 | 73 | Uyghur |
| 50.68 | 51 | Uzbek |
| 57.59 | 60 | Vietnamese |
In addition to published materials, LDC has as-yet unreleased newswire or other journalistic archives in Albanian, Arabic, Turkish, Farsi, Russian, Thai, Hindi, Serbo-Croatian, Tamil, Indonesian, Ukrainian, Vietnamese, Khmer, Portuguese, Spanish, French, Japanese and German, among others.