![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
||||
|
|
Text Data Collection News Feeds Purchasing a subscription to an electronic newsfeed is another possible method for developing a text corpus. Newsfeeds in various languages can be purchased from companies like the Associated Press or Agence France Presse. The data can be delivered by several different methods: in (near) real time over a satellite feed, modem, or leased line; over a dial-up modem; or by periodic deliveries on tape,CD-ROM, or floppy disk. In any case, newsfeeds from professional newswire companies have one thing in common: high cost. Many companies will charge as much as $20,000 per year (this includes usage fees, and negotiated distribution rights). This approach does have some advantages: for example, technical support for the delivery mechanisms are often included in the subscription price. Newswire companies tend to provide a high volume of articles per day, and the SNR is better than for WWW resources. Many newswire companies employ a "push" architecture: the wire is always live, and they present new stories on the wire automatically. The articles include both news information and "metadata" (summary, people, keywords). Many stories are transmitted progressively: the initial transmission contains some details, and subsequent deliveries provide new information. In some cases, this must be taken into account when formatting the articles for inclusion in the corpus. ANPA (American Newspaper Publishers Association) 1312 is standard which defines a document wrapper which provides information about the enclosed article such as story ID,version,keywords, and release date. Newswires from the major providers are still being delivered in ANPA format; however, there are initiatives in place to change delivery to XML: XMLNews:http://www.xmlnews.org http://www.stars.com/Internet/Future/xmlnews.html NITF: http://www.oasis-open.org/cover/nitf.html IPTC: http://www.iptc.org/iptc/ |
|||
|
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data
Contact ldc@ldc.upenn.edu | ||||