Web Data Extraction
We should have a whole article, not just a chapter, about this. It's very interesting what's happening on the web. From the legal point of view, site owners have of course all ownership and copyrights of their web site content. But truth is sites with no visitors are worthless. For the site owners themselves. Sites which are not visited and crawled by search engines, at least by the big ones (Google, Yaoo, MSN), have also small chances to succeed. And so many sites today use content which is not 100% original, or at least is "inspired" from other information published somewhere else on the web, that it is so hard to tell what can be really protected by their copyright claims.
It's a customary belief that search engine crawlers are always welcome. If this is usually true for Google or 2-3 other major search engines, most site owners know about the nightmare tens of other spiders can bring to their log files. Not only these other spiders consume a lot of bandwidth and slow down the web application, but a large part of them do not bother to identify correctly in User Agent and do not obey robots.txt or robots meta instructions. If friendly known spiders crawl your site for indexing purposes, collecting just the minimum amount of information that helps your site to be found faster on the web, other spiders are simply data extractors, that come to steal your information and possibly your site content as a whole.
It's hard to protect against them, because they frequently come from different IP addresses. A good spider however will obey at least the robots.txt or sitemap rules, even if you disallowed the whole site from crawling and there is nothing you can give to them. Good spiders will also include a proper identification in their User Agent header, with a web address where you can find more information about them and some way to send feedback to a real person behind. Really good spiders have also a way to be notified when you have fresh content and would like to be crawled.
There is no legal protection against spiders that do not obey to robots.txt. Many sites would rather specify they don't want to be crawled in their Terms and Conditions page, in textual terms. This is wrong and meaningless. Because most spiders use automated crawling for thousands or million of sites, and it cannot be assumed the person behind has time to individually look at each site and discover if crawling information exists somewhere else. Under these circumstances, it's hard to imagine someone can be sued. Because he wasn't able to discover your Terms and Conditions, or whatever other page you used.
If you really want to disallow crawling - or rather to send the message that you don't want crawling - for the whole site or just some pages or directories, use robots.txt. It's a standard all spider creators know about it. And - only if they want to! - they can easily implement their automated process to look at this single text file on each visited site, before crawling.
Conclusions
There are many situations where data extraction and reverse-engineering techniques are illegal, even if it's for a software product you own, you paid for it. There are also plenty of other situations where data extraction is not only allowed, but saves you from catastrophic loses or builds a better Internet. You should always be aware when you are and when you are not allowed to use data extraction and reverse-engineering techniques, like the ones presented in this magazine. When necessary, we do our best to inform you on each article on possible legal issues of the methods presented.