Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome


Text Data Collection

WWW Capture

There are a large and rapidly growing number of websites dedicated to providing content, for example news articles, at very low cost. In many cases, this content can be harvested, converted into an appropriate format (eg. SGML) and collected to creat a useful corpus. The reader should note that many providers of web-based content retain their copyright. Therefore, one should not assume that all web-accessible data can be freely used and shared. It is sometimes also possible to use contact information on foreign language websites as a starting point for negotiations with a publishing company for rights to use their proprietary database.

The main difficulties posed by text harvest from WWW sites are:

  • Web sites tend to be transient. They reorganize their content, change addresses and initiate or go out of business more rapidly than more tradition information providers.
  • They often have a low daily volume or an irregular supply of new material.
  • They often have a low signal to noise ratio (SNR) with lots of formatting, javascript and irrelevant text or each page.
Nevertheless, by combining material from several websites and normalizing markup and document content, it is possible to create a useful corpus.

Content providers who use the worldwide web as a distribution medium are increasingly sensitive to copyright. Be sure you have secured the necessary permissions before harvesting and distributing web based content.

WWW based content is typically formatted with HTML markup. Rather than being delivered, pages must be downloaded. A robust harvest script should ignore content that has already been downloaded and only retrieve new material. It should (MUST) also comply with the Robot Exclusion Protocol to be sure that you are not harvesting material the sites' owners want to protect.

Standard webcrawler utilities like wget, webcopy, curl, and sam spade can all be configured to harvest articles from the web. Another alternative is to write a customized harvest script in Perl using the Perl HTTP modules. Whichever method you choose, note that it is your responsibility to minimize the impact of your text harvesting on the servers you visit. To accomplish this you should not only follow the robot exclusion protocol but also run your harvesting processes during times less likely to be busy for the servers you visit. Check the time zones of the servers you visit. Your quiet times are not necessarily theirs. Where possible you should configure your processes so that they harvest small amounts of data at a time and can stop and restart without loss of data to accommodate high traffic times.

Consider the following tools for text harvesting:

  • curl (http://curl.haxx.nu/): "Curl is a tool for getting files with URL syntax, supporting FTP, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP. Curl supports HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, cookies, user+password authentication and a busload of other useful tricks." Used in combination with perl to do mirroring or recursive capture.
  • wget (http://www.gnu.org/software/wget/wget.html): "GNU Wget is a freely available network utility to retrieve files from the World Wide Web, using HTTP (Hyper Text Transfer Protocol) and FTP (File Transfer Protocol), the two most widely used Internet protocols."
  • Sam Spade (http://samspade.org/ssw): Very nice application for Win32. Recursive capture is just one feature of this application; it contains several other WWW utilities. Raw web capture is quite nice.

Once the pages are downloaded, they must be converted into usable data. The html markup must be stripped or converted into a the form you will use. Document ID, date, author, title, source, category, and keyword information can be tagged explicitly, either by hand, or by writing a script to do the translation. Often, the html markup (which provides formatting information) can be translated into document information. Unfortunately, this process is error prone, and requires extra attention. The markup in certain sources will be so irregular that attempts at transduction will result in more frustration than clean text. Another stumbling block is the fact that web sites often change over time; the script that worked flawlessly last year may be unuseable on current articles.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Tuesday, 22-Oct-2002 17:44:55 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.