Download Source Code: YahooMoviesScrapper.zip - 18.17KB
Simple Yahoo!Movies Web Scrapper
YahooMoviesScrapper is a first very simple implementation of a web scrapper, based on our WebScrapperBase class. As demo, we'll extract data from some regular tables, from Yahoo!Movies. Source Yahoo!Movies pages and our real-time scrapped and formatted data pages can be found here:
- Top Rated Movies in Theaters Now - DEMO HERE
- Top Rated Movies of All Time - DEMO HERE
- Bottom Rated Movies of All Time - DEMO HERE
The scrapper navigates to a start page http://movies.yahoo.com/mvc/top10?lid=LID, where LID is either 20, 1 or 2. With this URL, a single Parse method is required. Looking at those Yahoo!Movies pages, we found out that each data row starts by a serial number, in bold. No other data can be identified in a page with this row delimiter. Each row has a hyperlink, which href gives us the movie's detail page address, and which text gives us movie's title. We collect only this information and leave anything else.
To return collected data as a HTML bulleted list, we override GetData method. Remark that this method is also automatically called by SaveAs, so we don't need to do anything else to save this data in a HTML file:
// The actual parser implementation
private int Parse(string href)
{
Reset(true);
Address = href;
for (int i = 1; FindNext("<b>" + i + ".</b>"); i++)
AddRow(
ExtractNext("<a href=\"", "\""),
ExtractNext(">", "</a>"));
return UpdateTotalTime();
}
// Returns row-based data as a numbered list
public override string GetData(bool html)
{
if (!html)
return base.GetData(false);
StringBuilder sb = new StringBuilder("<ol>");
foreach (DataRow row in Table.Rows)
sb.AppendLine("<li>"
+ "<a rel=\"nofollow\" target=\"_blank\""
+ " href=\"" + row[0].ToString() + "\">"
+ HttpUtility.HtmlEncode(row[1].ToString())
+ "</a></li>");
return sb.ToString() + "</ol>";
}In the open-source project you can download and run locally, the Main method, which creates and runs the scrappers, looks like:
/// <summary>
/// Parse some regular tables from Yahoo!Movies
/// and save them as HTML files
/// </summary>
static void Main()
{
YahooMoviesScrapper scrapper = new YahooMoviesScrapper();
scrapper.Trace = true;
scrapper.ParseTopRatedNow();
scrapper.SaveAs("Top Rated in Theaters.htm");
scrapper.ParseTopRated();
scrapper.SaveAs("Top Rated of All Time.htm");
scrapper.ParseBottomRated();
scrapper.SaveAs("Bottom Rated of All Time.htm");
Debug.WriteLine("Total Usage Time: "
+ scrapper.TotalUsageTime.ToString("#,###") + "ms");
}