Extract Genuine File Information Legal Issues in Data Extraction
Apr 18

Printer Friendly Version

Download Source Code: YahooMoviesScrapper.zip - 18.17KB

Web Navigation

Web navigation is made very simple with a general Navigate method, which lets you choose the HTTP method you send as request. HTTP requests are all using 2 or 3 stages:

  1. Prepare the web request.
  2. Issue the web request and get back the response.
  3. Get the eventual stream of ASCII data associated with the response.

In .NET, you can use the System.Net HttpWebRequest and HttpWebResponse classes, and transfer the data stream with a System.IO.StreamReader. It is also possible to do it with a WebClient, but the WebRequest/WebResponse pair offers more flexibility, information and control. In ASP.NET pages, the web request and response objects are part of the System.Web.UI.Page itself, but our crawler will be more generic and can be used from both standalone Windows applications or from the server-side. It's up to you.

You'll mostly use a GET method, with an optional query string composed from property=value pairs, concatenated by &, separated by your page name with ?. If your URL is already fully prepared, you can alternatively directly set the Address property.

The HEAD method is rarely used, but for web scrappers it is very important and can offer great performance gains. There are many cases when you do not need to transfer the content string of a response. You just want to know if a page exists, its last modification time or its size. Yes, you can do this with a HEAD method, much faster. Each web request and web response has some associated Headers. These are properties of the web handshake and of the target page. You'll never see these headers in a browser. And you're not able to issue HEAD requests from a browser.

POST methods are transparently used from many web pages. Unlike GET, POST property values are hidden and do not appear in the URL, as query strings. With Navigate, you can also generate POST requests.

There are other features in our Navigate method below - such as UserAgent and Proxy properties, robots.txt check, web service calls - that we will use later, for more complex scrappers. For now, the full coverage of the three main HTTP methods should be more than enough:

/// <summary>
/// Main method for HTTP web navigation
/// </summary>
/// <param name="url">base url</param>
/// <param name="method">GET / POST / HEAD</param>
/// <param name="query">optional name+value pairs,
/// for GET QS or POST data</param>
public void Navigate(string url, 
    string method, params object[] query)
{
    Debug.Assert(method == "GET"
        || method == "POST" || method == "HEAD");
    
    // prepare Query String: ?name=value&...
    // each name passed must be followed by a value,
    // which will be URL-encoded
    string qs = "";
    if (query != null)
        for (int i = 0; i < query.Length; i++)
            if (query[i] != null)
                qs += (i % 2 == 0
                    ? (qs.Length == 0 ? "" : "&")
                    + query[i].ToString() + '='
                    : HttpUtility.UrlEncode(
                        query[i].ToString()));

    // append eventual QS to url
    _address = url;
    if (method == "GET" && qs.Length > 0)
        _address += (_address.IndexOf('?')<0 ? '?' : '&') + qs;

    // is Robots used, show in Debug window first Disallow match
    // (if any) from robots.txt, and eventually stops in Assert
    if (_robots != null)
    {
        string match = _robots.Match(_address);
        if (match != null)
            Debug.WriteLine(
                "[ROBOTS] DISALLOWED (" + match + ") " + _address);
        Debug.Assert(match == null || !_stopOnDisallow);
    }

    // prepares the HTTP request
    HttpWebRequest request
        = (HttpWebRequest)WebRequest.Create(_address);
    request.Method = method;
    request.Timeout = _navigationTimeout;
    _headers = null;

    if (_userAgent != null && _userAgent.Length > 0)
        request.UserAgent = _userAgent;
    if (_proxy != null && _proxy.Length > 0)
        request.Proxy = new WebProxy(_proxy, true);

    // prepares stream of bytes for a POST request
    if (method == "POST" && qs.Length > 0)
    {
        byte[] bytes = Encoding.Default.GetBytes(qs);
        request.ContentType = "application/x-www-form-urlencoded";
        request.ContentLength = bytes.Length;
        using (Stream stream = request.GetRequestStream())
            stream.Write(bytes, 0, bytes.Length);
    }

    // optional wait between navigation
    if (method!="HEAD")
        _position = 0;
    DateTime now = DateTime.Now;
    if (_navigationWaitTime > 0)
    {
        int wait = _navigationWaitTime
            - ElapsedTime(_navigationLast, now);
        if (wait > 0)
        {
            Thread.Sleep(wait);
            now = DateTime.Now;
        }
    }

    // send the HTTP request,
    // get back the HTTP response and the content
    if (method != "HEAD")
        _content = null;
    using (HttpWebResponse response
        = (HttpWebResponse)request.GetResponse())
    {
        _headers = response.Headers;
        if (method != "HEAD")
            using (StreamReader reader
                = new StreamReader(response.GetResponseStream()))
                _content = reader.ReadToEnd();
    }

    // update times
    _navigationLast = DateTime.Now;
    int elapsed = ElapsedTime(now, _navigationLast);
    _navigationTime += elapsed;
    if (method != "HEAD")
        _page++;

    if (_trace)
        Debug.WriteLine("[TRACE] " + elapsed.ToString("#,###")
            + "ms (Page " + _page + ") " + method + " " + _address);
}

Most two frequent patterns crawlers navigate through web pages are:

  1. Start from the home page, then recursively find and navigate through any other local (or remote) link found in that page. This is the way search engine discover all other pages part of or linked with your web site.
  2. Start from the first page of some data results and sequentially move to the next page, until no other data is found. This is the way most web scrapper and data extractors navigate across pages, and this is what we will focus on here.

When you start building a web scrapper, to extract semi-structured data from some result pages, what you need first is the starting URL. When data is spread across multiple pages, look at the URL of each page. If it ends in ? and property=value pairs, separated by &, you're lucky, because the site you're crawling is using GET method. You're lucky, because you can explicitly define and construct the URL for each next page in your code, you don't need to deal with forms and complex POST requests, or detect next page URL from current page content.

Most often, the result pages you are crawling have a query string property that gives you either the page number or the offset of current page and total number of records per page. Experiment a bit with your search results and chances are you'll find such properties in the URL shown in your browser. So all you have to do now is call Navigate once, with your starting page, then accordingly replace some property value in your query string, for a multiple page result.

Our WebScrapperBase class provides Page and MaxPages properties which, together with the CanCapturePage method, help you in multi-page navigation. Page gives you the number of pages you came across, and with MaxPages you can set a limit for the maximum number of pages you want to crawl.

Static Web Page Content Crawling

Navigate will return the page content, as text, in an internal Content string buffer. As we said before, many crawlers use regular expressions to detect patterns and extract data from this content. We'll provide alternative sequential string parsing functionality, easy to understand, especially if you are not so familiar with regular expressions. String parsing can be even faster, if you try not to instantiate too many substrings and rather use StringBuilder objects, if you need to. This is because any string is actually an array of characters, that can be parsed with two start and end pointers. String class is already rich in functionality, but we will provide some simple wrappers that will make your life much easier when implementing a data scrapper.

There are two main use cases in data scraping from a string content:

  1. Detect if a certain occurrence of a substring exists.
  2. Extract a substring located between two other different substrings.

Sequential parsing implies that you have a forward-only cursor or pointer, that leaves behind the text you already looked at. This is what the FindNext and ExtractNext methods do, compared to Find and Extract, which will not automatically change the Position:

// Returns true if text found in a content,
// false for null content, true for null/empty text
public bool Find(string content, string text, bool caseSensitive)
{
    return (content != null
        && (text == null || text.Length == 0
        || content.IndexOf(text,
        caseSensitive ? StringComparison.InvariantCulture
        : StringComparison.InvariantCultureIgnoreCase) >= 0));
}

// Returns substring delimited by start and end within content.
// Returns null for null content or text not found.
// If start null/empty, search from the beginning of content string.
// If end null/empty, search to the end of content string.
public string Extract(string content, string start, string end)
{
    if (content == null)
        return null;

    // find start offset
    int i = 0;
    if (start != null && start.Length > 0
        && (i = content.IndexOf(start)) < 0)
        return null;
    if (start != null)
        i += start.Length;

    // find end offset
    int j = content.Length;
    if (end != null && end.Length > 0
        && (j = content.IndexOf(end, i)) < 0)
        return null;

    // returns substring between start offset and end offset
    return content.Substring(i, j - i);
}

// Returns true if text found in _content,
// false for null _content, true for null/empty text
public bool Find(string text, bool caseSensitive)
{
    return Find(_content, text, caseSensitive);
}

// Finds first occurrence in _content of any of the strings
// passed in text, advances the _position seek pointer
// after this match and returns true if found.
public bool FindNext(params string[] text)
{
    Debug.Assert(text != null && text.Length > 0);
    foreach (string s in text)
    {
        Debug.Assert(s != null && s.Length > 0);

        // returns true for first good match
        // and advances the seek pointer after the text string
        int position = _content.IndexOf(s, _position);
        if (position >= 0)
        {
            _position = position + s.Length;
            return true;
        }
    }
    return false;
}

// Returns HTML-decoded substring delimited
// by start and end within _content.
// Returns null for null _content or text not found.
// If start null/empty, search from the beginning
// of _content string.
// If end null/empty, search to the end of _content string.
// After a good match, advances the _position seek pointer
// after the end string.
public string ExtractNext(string start, string end)
{
    if (_content == null)
        return null;

    // find start offset
    if (start != null && start.Length > 0
        && !FindNext(start))
        return null;

    // find end offset and advances the seek pointer
    // after the end text
    int position = _position;
    if (end == null || end.Length == 0)
        _position = _content.Length;
    else if (!FindNext(end))
        return null;

    // returns HTML-decoded substring 
    // between start offset and end offset
    return HttpUtility.HtmlDecode(
        _content.Substring(position,
        _position - position
        - (end == null ? 0 : end.Length)));
}

There is no magic in clearly detecting where data to be extracting starts and ends. What you do is figure out what unique anchor text from the page source you can use as delimiters. Assuming you are familiar with HTML, look at the Source of your page and try to figure out first if there is some unique text (HTML and end-user text combined) that you can ALWAYS identify as delimiter, where your data starts. You may call FindNext once or several times to move to that point. Then, for each group or record, call ExtractNext with other two fixed string delimiters for each field. Never choose variable end-user text in your string delimiters. Don't use very long strings neither.

This kind of data is very sensitive to future changes. In fact, this is where most web scrapper have to adapt, when the page structure or layout format changes. This is because site owner may choose to change their page content for some pages, to add advertising or present some data in other colors or fonts. Or they may later decide they want a better layout format for their page. So pay lot of attention when choosing these data delimiters, but don't spend a huge amount of time on them; they may later change anyway and you'll have to adapt your code.

Continue reading »

Subscribe and Share: Subscribe using any feed reader Bookmark and Share

Leave a Reply