Extract Genuine File Information Legal Issues in Data Extraction
Apr 18

Printer Friendly Version

Download Source Code: YahooMoviesScrapper.zip - 18.17KB

Simple Yahoo!Movies Web Scrapper

YahooMoviesScrapper is a first very simple implementation of a web scrapper, based on our WebScrapperBase class. As demo, we'll extract data from some regular tables, from Yahoo!Movies. Source Yahoo!Movies pages and our real-time scrapped and formatted data pages can be found here:

The scrapper navigates to a start page http://movies.yahoo.com/mvc/top10?lid=LID, where LID is either 20, 1 or 2. With this URL, a single Parse method is required. Looking at those Yahoo!Movies pages, we found out that each data row starts by a serial number, in bold. No other data can be identified in a page with this row delimiter. Each row has a hyperlink, which href gives us the movie's detail page address, and which text gives us movie's title. We collect only this information and leave anything else.

To return collected data as a HTML bulleted list, we override GetData method. Remark that this method is also automatically called by SaveAs, so we don't need to do anything else to save this data in a HTML file:

// The actual parser implementation
private int Parse(string href)
{
    Reset(true);
    Address = href;

    for (int i = 1; FindNext("<b>" + i + ".</b>"); i++)
        AddRow(
            ExtractNext("<a href=\"", "\""),
            ExtractNext(">", "</a>"));

    return UpdateTotalTime();
}

// Returns row-based data as a numbered list
public override string GetData(bool html)
{
    if (!html)
        return base.GetData(false);

    StringBuilder sb = new StringBuilder("<ol>");
    foreach (DataRow row in Table.Rows)
        sb.AppendLine("<li>"
            + "<a rel=\"nofollow\" target=\"_blank\""
            + " href=\"" + row[0].ToString() + "\">"
            + HttpUtility.HtmlEncode(row[1].ToString())
            + "</a></li>");
    return sb.ToString() + "</ol>";
}

In the open-source project you can download and run locally, the Main method, which creates and runs the scrappers, looks like:

/// <summary>
/// Parse some regular tables from Yahoo!Movies
/// and save them as HTML files
/// </summary>
static void Main()
{
    YahooMoviesScrapper scrapper = new YahooMoviesScrapper();
    scrapper.Trace = true;

    scrapper.ParseTopRatedNow();
    scrapper.SaveAs("Top Rated in Theaters.htm");

    scrapper.ParseTopRated();
    scrapper.SaveAs("Top Rated of All Time.htm");

    scrapper.ParseBottomRated();
    scrapper.SaveAs("Bottom Rated of All Time.htm");

    Debug.WriteLine("Total Usage Time: "
        + scrapper.TotalUsageTime.ToString("#,###") + "ms");
}

Continue reading »

Subscribe and Share: Subscribe using any feed reader Bookmark and Share

Leave a Reply