Hindi resource notes:
Hindi Language Resources page:
http://www.cs.colostate.edu/~malaiya/hindilinks.html
Webdunia Hindi Portal:
http://www.webdunia.com/
Hindi on the web:
http://www.avashy.com/hindibhasha/weblinks.htm
http://theory.tifr.res.in/bombay/history/people/language/hindi.html
Resources from Indian Language Technology Solutions: http://www.cfilt.iitb.ac.in/
Hindi links via Ted Pedersen: http://www.d.umn.edu/~pura0010/hindi.html
Hindi translation site: http://mason.gmu.edu/~aross2/hindi.htm
Stuff from Anoop Sarkar
http://ldc.upenn.edu/myl/cbnlp-work.tar.gz
http://ldc.upenn.edu/myl/cbnlp_readme
"The tarfile
above includes a tagger/supertagger and chunker and also a
PCFG parser all trained on the tiny LTRC
treebank. The treebank is also
in the data directory and the tests directory
(where it has been
converted to dependency trees, etc.)."
http://ldc.upenn.edu/myl/hindi_chunker_24_03_03.tgz
"T Papi Reddy, who was
one of my team members in the workshop held a
couple of years ago in India, has a version of his chunker available for
download from his web page. "
Hindi News Sites:
BBC Hindi news:
http://www.bbc.co.uk/hindi/
Indian newspapers page:
http://www.ipl.org/div/news/browse/IN/
Some Hindi newspapers:
http://www.prabhasakshi.com/
http://www.hindimilap.com/ (in Hindi
and Urdu: http://www.milap.com/aboutus.html)
http://www.naidunia.com/
http://www.navabharat.net/
http://www.jagran.com/
http://www.rajasthanpatrika.com/
http://www.bhaskar.com/
Hindi literary magazine:
http://www.udgam.com/
http://www.bharatdarshan.co.nz/
Parallel sources (more or less):
http://sify.com/news_info/news/ http://sify.com/hindi/
http://www.indiatoday.com/itoday/index.html http://www.indiatodayhindi.com/
http://www.vigyanprasar.com/dream/index.asp
News in English, Hindi, Telegu: http://www.niharonline.com/ http://www.niharonline.com/hindi/news/
English, Hindi, Marathi: http://www.rediff.com/ http://www.rediff.com/hindi/index.html
Literary magazine in English and Hindi: http://www.boloji.com/Default.asp
http://www.boloji.com/hindi/index.html
"Computer news & IT resources" http://ciol.com
http://hindi.ciol.com/main.asp
ZDnet in Hindi: http://www.zdnetindia.com/hindizone/
Indian government sites:
Government Portal: http://indiaimage.nic.in/
Parliament:
English version: http://rajyasabha.nic.in/
Hindi version: http://rajyasabha.nic.in/hindisite/hindipage.asp
Constitution: http://indiacode.nic.in/coiweb/welcome.html
Ministry of Home Affairs: http://rajbhasha.nic.in/
http://rajbhasha.nic.in/dolst_eng.htm http://rajbhasha.nic.in/dolst_hin.htm
(English welcome page above seems to be be mistakenly linked to Hindi page)
Press information bureau: http://pib.nic.in/ http://pib.nic.in/urdu/hindimain.html
http://164.100.24.208/
Gita in Hindi and Engish: http://www.gitasupersite.iitk.ac.in/
7th World Hindi Conference: http://www.vishwahindisammelan.nic.in/welcome.html
Radio:
http://www.voa.gov/hindi/
http://allindiaradio.org/
http://www.bbc.co.uk/hindi/index.shtml
Hindi learning resources:
Learn to read Hindi (i.e. devanagari):
http://www.ukindia.com/zhin001.htm
http://www.latrobe.edu.au/indiangallery/devanagari.htm
http://lrs.ed.uiuc.edu/students/avatans/resources.html
http://lrs.ed.uiuc.edu/Students/avatans/project.html
http://philae.sas.upenn.edu/Hindi/hindi.html
Grammatical sketch: http://www.it-c.dk/people/pfw/hindi/index.html
Corpora:
The beta version of the EMILLE corpus, released in March 2003
http://www.emille.lancs.ac.uk/beta.htm
contains 30Mw of Hindi text and some parallel text (120KW?)
AnnCorra documentation
Dictionaries etc.
English-Hindi dictionary at IIT:
http://sanskrit.gde.to/hindi/hindidictreadme.html
http://www.iiit.net/ltrc/Dictionaries/Dict_Frame.html
IIT dictionary in UTF-8 is at http://ldc.upenn.edu/myl/English-Hindi-Dictionary_2.utf8
tab-delimited version without English
example sentences is here in utf-8 and here in iscii.
Hindi-English dictionaries:
http://www.wordanywhere.com/
http://www.yourdictionary.com/languages/indoiran.html#hindi
Hindi Wordnet at Resource Center for Indian Language Technology Solutions
Indian Institute of Technology - Bombay
http://www.cfilt.iitb.ac.in/
http://www.cfilt.iitb.ac.in/wordnet/webhwn/
Four English-Hindi domain-specific bilingual term lists:
http://tdil.mit.gov.in/download/menu.htm
(be sure to select SHABDIKA in the pull-down list)
Collation of Mike Schultz's list against the IIT dictionary is here.
Morph analyzers:
http://www.iiit.net/ltrc/morph/index.htm
http://ccat.sas.upenn.edu/plc/tamilweb/hindi.html
Hindi verb conjugator:
http://www.verbix.com/languages/hindi.shtml
IIT download site:
http://www.iiit.net/ltrc/downloads.html
Encyclopedia etc.
http://www.tdil.mit.gov.in/terminology.htm
According to http://www.afnlp.org/nlprs2001/WS-LanguageResource/001.pdf,
NHK Science and Technical Research Laboratories has been generating
a parallel corpus in 22 languages including Hindi and English since 1998
by collecting the foreign language versions of Japanese news broadcasts.
According to http://tdil.mit.gov.in/corpora/ach-corpora.htm,
the Central Institute of Indian Languages (http://www.ciil.org/)
has a 3 million word corpus.
English to Hindi MT: http://anglahindi.iitk.ac.in/index2.html
Rendering, encodings, fonts etc.:
A good illustration of why Hindi rendering is not trivial:
http://people.redhat.com/otaylor/gtk/guadec2-i18n/slide003.html
Website for an Indian-language localization project for linux:
http://www.indlinux.org/
They have a package that works for RedHat 8.0 to view and type (but not
print and sort) Hindi text.
As of 5/16/2003, QT
3.2 Beta 1 supports complex rendering in Hindi and similar languages:
http://www.trolltech.com/newsroom/announcements/00000127.html
The previous releases of QT (used e.g. in the KDE desktop) do not.
Hindi support in Java
http://www.sun.com/developers/gadc/technicalpublications/presentations/iuc22_thai_hindi.pdf
The Inscript keymap (Indian gov't standard) for Hindi:
http://www.indlinux.org/keymap/hindi.php
General discussion of fonts/encodings for Indian langauges,
with pointers to unicode fonts for Hindi etc.: http://india-n-indian.com/it/wil.html
Information about ISCII:
http://tdil.mit.gov.in/standards.htm
iscii91.pdf
ITRANS:
online interface: http://www.aczone.com/itrans/online/
The ITRANS 7-bit transliteration for devanagari -- this may be the most
practical thing for "ascii" approximations:
http://www.aczone.com/itrans/#itransencoding
more detailed ITRANS documentation: http://www.aczone.com/itrans/idoc/idoc.html
ITRANS download: http://www.aczone.com/itrans/#download
iscii2ascii.py for
mapping from iscii to quasi-HZ-encoded transliteration.
The CS/CSX 8-bit encoding:
http://www.aczone.com/itrans/icsx/icsx.html
General information and downloads for ITRANS and CS/CSX:
http://www.aczone.com/itrans/
Another approach to typing Hindi (for Windows only, I think):
http://www.aksharamala.com/about/
Tools for conversion among Unicode, ISCII,
ITRANS, proprietary fonts:
Notes on converting
from ISCII or Unicode into ITRANS
Notes
on other usages and how to hack them.
ISCIlib, iconverter:
http://www.cse.iitk.ac.in/users/isciig/
http://www.cse.iitk.ac.in/users/isciig/documents/user_iconverter.txt
"Support is provided for conversions from iscii code space to unicode
and vice-versa
for each of the ten Indian languages. A file containing Indian language
texts in ISCII codes,
can be converted to its unicode equivalent with the help of the tool - iconverter."
Same project also has spellchecker...
and iscii2ps for printing.
IBM's International Components for Unicode: http://oss.software.ibm.com/icu/
Includes "uconv" which seems to convert ISCII <-> unicode
Conversions from/to various non-standard fonts: http://www.iiit.net/ltrc/FC-1.0/fc.html
Font/encoding converters and other tools from Project Tukaram: http://www.cfilt.iitb.ac.in/resourcepage/index.html
iconv: http://www.gnu.org/software/libiconv/
http://gettext.sourceforge.net/
(but only some recent versions support ISCII -- haven't found
source yet...)
iscii2itrans.py
http://www.webdunia.net/products/data_converter.asp
Resources within LDC:
Hindi newswire:
/mnt/unagi/speechd1/newswires/hindi
Naiduna uses a proprietary encoding: utf8 versions of our 2-year archive
are here:
/pkg/ldc/newswires/hindi/processed/utf8/*.sgml
/mnt/unagi/speechd16/TIDES/Surprise/HINDI/*
/mnt/unagi/speechd16/TIDES/Surprise/HINDI.txt,v
EMILLE corpus: /speechd16/TIDES/Surprise/HINDI/EMILLE
IIT dictionary (cleaned up): /speechd16/TIDES/Surprise/HINDI/IIIT_Dictionary/eng_hin_dict.txt
Speech corpus LDC96S52 CALLFRIEND Hindi
/mnt/talk/Surprise/HINDI/parallel_text/www.rediff.com/hindi-tkn/*.text
TidesSLList mailing list TidesSLList@ldc.upenn.edu
http://www.ldc.upenn.edu/mailman/listinfo.cgi/tidessllist
http://ldc.upenn.edu/Project/SurpriseLanguage
Our tools for hand alignment, translation and entity tagging use QT:
>From TrollTech News Release for Qt 3.2:
> The addition of Indic script input and rendering means that Qt 3.2
now
> supports all major script-based languages, including advanced languages
> such as Hindi and Bengali. Qt 3.2 is also more efficient at font
> rendering.
Other relevant tools:
Survey of tools for Indian languages: http://www.indian-languages.org/
For extracting text from .pdf:
http://www.foolabs.com/xpdf/
http://www.foolabs.com/xpdf/download.html