TIDES Data Inventory

Information for (and from) the TIDES data committee: a work in progress.

Primary focus is on immediate (year one) needs.

If you have any questions, additions or corrections, please contact Mark Liberman.

1. Resources for multilingual research

Multilingual research in TIDES year one is focused on Mandarin, Korean and Spanish. This is a sketch of basic resources in those languages, currently or potentially available to TIDES researchers for year one research. For the purposes of this inventory, "TIDES year one" is taken to start 1/1/2000.

In the tables below, "KW" means "thousand words", "MW" means "million words"; "KCh" means "thousand (Chinese) characters", "MCh" means "million (Chinese) characters". For parallel corpora, where only one count is given, it applies to each of the parallel languages, not their sum.

Unless otherwise indicated, Mandarin resources are in GB encoding.

Summary:

Major year one needs -- i.e. needed NOW or within two or three months, not in a year or two:

 
Mandarin
Korean
Spanish
Parallel Text 1/1/2000 (in hand) 25 MW 0-50 KW 61 MW
Parallel Text 1/1/2001
(in hand or credibly promised)
35 MW 400-450 KW 61 MW
Monolingual Text 1/1/2000 599 MCh
=~ 327 MW
23 MW 471 MW
Monolingual Text 1/1/2001

696 MCh
=~ 380 MW

33 MW 567 MW

 

A. Parallel text data (documents in language X paired with English translations)

Mandarin:

Source(s)
Archive amount
(start of year one)
IPR Status
Availability
Ongoing
accumulation
Projected amount
(end of year one)
Comments
Hong Kong Legal Code 11.5MCh. (Mandarin)
6.3 MW (English)
OK LDC N/A

11.5 MCh/
6.3 MW

Big 5 encoding
Sentence aligned.
Hong Kong News 17 MCh (Mandarin)
9.3 MW (English)
OK LDC 900KCh/mo.
500KW/mo.
27.8 MCh/
15.3 MW
Big 5 encoding
Story aligned,
will be sentence aligned
Hong Kong Hansard 17.5 MCh (Mandarin)
9.2 MW (English)
OK LDC

600KCh/mo.
300KW/mo.

24.7 MCh/
12.8 MW
Big 5 encoding
Will be sentence aligned
FBIS, various sources None English OK (?),
Mandarin to be determined
Not yet ~40KW/mo.
starting ~2/10/2000
~420 KW Source accumulation has started ~2/10/2000 (?);
Processes for normalization, alignment etc. need to be established
STRAND, various internet sources
(Phil Resnick/Jinxi Xu)
3289 URL pairs
17.6 MW (English)
No IPR permissions;
others can download from specified URLs
URL list may be available (unknown)   Probably overlaps considerably (almost entirely?) with LDC Hong Kong data

 

Korean:

Source(s)
Archive amount
(start of year one)
IPR Status
Availability
Ongoing
accumulation
Projected amount
(end of year one)
Comments
DLI U.S. Army translator training texts
entered at Penn (M. Palmer)
~50KW DLI has not released it unknown none ~50 KW Domain is "ground troop movements"
Both English and Korean are being Treebanked
FBIS, various sources None

English OK (?),
Korean to be determined.

not yet >40KW/mo. ~400 KW Source accumulation will start soon (?);
processes for normalization, alignment etc. need to be established
BITS
(web search)
None To be determined unknown unknown ~6 MW
(guess based on preliminary survey)
This is the system that found
and document-aligned
the Mandarin parallel corpora

 

Spanish:

Source(s)
Archive amount
(start of year one)
IPR Status
Availability
Ongoing
accumulation
Projected amount
(end of year one)
Comments
UN document archive 47 MW OK LDC None 47 MW Published by LDC.
Topics very different from journalistic text.
ECI Corpus

ITU CCITT Handbook: 4.5 MW
ILO Bulletin (1984-1989) 1.7 MW
3 others: 475 KW total

OK LDC None ~6.6 MW Published by LDC
ELRA-0007
EC 9-language corpus
1.1 MW (1993 Q&A)
5-8 MW (1992-94 Parliamentary debates)
OK ELRA None ~7 MW Published by ELRA
Some parts have been variously annotated in the JOC corpus
FBIS, various sources None English OK (?),
Spanish to be determined
Not yet Unknown;
information promised for 2/18/2000
Unknown FBIS Latin American Bureaus to begin using new computer system 2/16/2000;
source accumulation can start after that
WHO parallel corpus ? ? Unclear ? ?  

 

B. Multilingual text data

Mandarin:

Source(s)
Archive amount
(start of year one)
IPR Status
Availability
Ongoing
accumulation
Projected amount
(end of year one)
Comments
People's Daily 147 MCh OK LDC None 147 MCh 1991-1996
China Radio 110 MCh OK LDC None 110 MCh 1991-1996
Xinhua 158 MCh OK LDC 2.15 MCh/mo. 184 MCh 1994-1999
CNA 102 MCh OK LDC 3.2 MCh/mo. 140 MCh 1996-1999
Zaobao 77.7 MCh OK LDC 2.8 MCh/mo. 111 MCh  
(1997 Mandarin Broadcast - Sources?) ? OK LDC None ? Broadcast transcriptions
(30 hours)
TDT2 Mandarin Text ? OK LDC None ?  
Conversational transcripts ? OK LDC None ? 240 conversations,
? hours transcribed
ECI Corpus/Xinhua 3.75 MCh OK LDC None 3.75 1/1990-3/1991
(Some Hong Kong Newspaper) ? ? ? None ? 1998 -- Obtained by Donna Harman for TREC -- details to come
Academia Sinica balanced corpus ~5 MW "Academic research" only ROCLING None ~ 5MW Big 5 encoding

Korean:

Source(s)
Archive amount
(start of year one)
IPR Status
Availability
Ongoing
accumulation
Projected amount
(end of year one)
Comments
Korean Press Agency 23.2 MW OK LDC 33KW/day ~33MW 1994- onwards
Conversational transcripts ? OK LDC None ? 60 calls,
? hours transcribed

Spanish:

Source(s)
Archive amount
(start of year one)
IPR Status
Availability
Ongoing
accumulation
Projected amount
(end of year one)
Comments
Reuters LA wire 34 MW OK LDC None 34 MW 1993-1995
Reuters SL wire 41 MW OK LDC None 41 MW 1993-1995
APWS 74 MW OK LDC ~3 MW/month 110 MW 1995-1998
AFP/Spanish 213 MW OK LDC ~5 MW/month 273 MW 1994-1999
Infosel 23 MW OK LDC None 23 MW 1993
El Norte 84 MW OK LDC None 84 MW 1997-1998
ECI Corpus Corpus Oral: 1MW
Sur newspaper: 447 KW
El Diario Vasco 830 KW
OK LDC None 2.2 MW  
(1997 Spanish BC transcripts) ? OK LDC None ? 30 hours
Conversational transcripts ? OK LDC None ? 240 calls,
? hours transcribed

 

Other resources

This is an incomplete list of as-yet unpublished resources that are in principle available:

 

LDC VOA Collection as of 3/20/2000 (from 12/99). Collection of ~1 hour/day in all 59 VOA languages will start 4/2000.

Hours
Files
Language
11.00 11 Azerbaijani
5.90 6 Bangla
63.81 70 Burmese
66.63 52 Cantonese
32.25 63 Dari
65.46 63 Farsi
53.01 55 French
25.08 50 Georgian
32.38 65 Hausa
38.85 73 Hindi
35.90 56 Indonesian
57.69 58 Kazak
81.06 82 Khmer
56.46 57 Kirundi
76.47 106 Korean
50.88 51 Kyrghiz
52.35 56 Lao
49.91 49 Mandarin
32.15 63 Pushto
57.10 83 Portuguese
54.86 55 Tajik
57.35 59 Tibetan
5.00 5 Turkish
57.89 58 Turkmen
14.00 18 Unknown
54.23 89 Urdu
35.62 73 Uyghur
50.68 51 Uzbek
57.59 60 Vietnamese

In addition to published materials, LDC has as-yet unreleased newswire or other journalistic archives in Albanian, Arabic, Turkish, Farsi, Russian, Thai, Hindi, Serbo-Croatian, Tamil, Indonesian, Ukrainian, Vietnamese, Khmer, Portuguese, Spanish, French, Japanese and German, among others.