Text Data Collection
There are a large and rapidly growing number of websites dedicated to providing content, for example news articles, at very low cost. In many cases, this content can be harvested, converted into an appropriate format (eg. SGML) and collected to creat a useful corpus. The reader should note that many providers of web-based content retain their copyright. Therefore, one should not assume that all web-accessible data can be freely used and shared. It is sometimes also possible to use contact information on foreign language websites as a starting point for negotiations with a publishing company for rights to use their proprietary database.
The main difficulties posed by text harvest from WWW sites are:
Content providers who use the worldwide web as a distribution medium are increasingly sensitive to copyright. Be sure you have secured the necessary permissions before harvesting and distributing web based content.
WWW based content is typically formatted with HTML markup. Rather than being delivered, pages must be downloaded. A robust harvest script should ignore content that has already been downloaded and only retrieve new material. It should (MUST) also comply with the Robot Exclusion Protocol to be sure that you are not harvesting material the sites' owners want to protect.
Standard webcrawler utilities like wget, webcopy, curl, and sam spade can all be configured to harvest articles from the web. Another alternative is to write a customized harvest script in Perl using the Perl HTTP modules. Whichever method you choose, note that it is your responsibility to minimize the impact of your text harvesting on the servers you visit. To accomplish this you should not only follow the robot exclusion protocol but also run your harvesting processes during times less likely to be busy for the servers you visit. Check the time zones of the servers you visit. Your quiet times are not necessarily theirs. Where possible you should configure your processes so that they harvest small amounts of data at a time and can stop and restart without loss of data to accommodate high traffic times.
Consider the following tools for text harvesting:
Once the pages are downloaded, they must be converted into usable data. The html markup must be stripped or converted into a the form you will use. Document ID, date, author, title, source, category, and keyword information can be tagged explicitly, either by hand, or by writing a script to do the translation. Often, the html markup (which provides formatting information) can be translated into document information. Unfortunately, this process is error prone, and requires extra attention. The markup in certain sources will be so irregular that attempts at transduction will result in more frustration than clean text. Another stumbling block is the fact that web sites often change over time; the script that worked flawlessly last year may be unuseable on current articles.
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data