Download Source Code: HockeyWebScrapper.zip - 12.08KB
| Another simple application of a real-time web scrapper, for static web data content, based on the WebScrapperBase class, introduced in a previous article. We rather focus here on transformation and improvement of the information found in the source page. Extracted real-time data is presented on the last page. | ||
Overview
In Create a Web Scrapper Base Class article, we introduced a simple abstract base class and general requirements for a managed C# web data scrapper. Here we continue with a simple online application.
Data scraping is frequently used to expose online information already shown somewhere on the web, but under a different presentation format. When the page with the scrapped data is shown, a server-side scrapper will re-collect, in real time, that information from the source site. It doesn't make sense to display the information as it is already presented there, unless you aggregate it with other similar collected data. Most frequently, the information is translated into a different display format, eventually transformed, and new meaningful data can be added. The query and overall processing time must be short enough, otherwise the user might leave your page.
Here we implemented a simple data scrapper, whose class derives from the generic abstract WebScrapperBase, presented in detail in our previous article. The new HockeyWebScrapper class implements a method, Parse, which goes to CNN's Sports Illustrated NHL Standings page, grabs the data from the two Eastern and Western Conference tables - data which is refreshed on a daily basis, during the regular hockey season - and creates a combined table with all teams together.
If you are a hockey fan and use to check these pages for NHL (National Hockey League) updates, you might have got a bit frustrated that there is no overall standing page, with all teams together, to see the sorted list of overall best teams and worst teams in the NHL, ordered by their current number of Points. The reason is selection for playoffs it's done at the Conference and Division level.
Anyway, if you want to see different views of this data, you can sort the table of results by other columns, such as the number of Goals Scored or Allowed, the number of Wins or Losses etc. We'll also show friendly names for the column headers, in case there are people who cannot figure out what GP, W, L, OT etc stand for. And also important, we'll show the remaining number of games. While all teams must play 82 games during a regular NHL season, when the playoffs come close, it is important to rather have the number of remaining games, than the total games played.
Implementation
With the WebScrapperBase class, implementation of the actual scrapper, in Parse method, is simple and almost self-explanatory.
We first prepare out DataTable for collection, creating strong-type data columns. We want strong typing, because we want to be later able to sort data by different columns with integer values. We have to do it now, when the table is empty.
Then we simply navigate to the Sports Illustrated's web page and transparently collect its static HTML result. No user interface involved, no images imported, no dynamic script executed. On the string result, we look for each line with a team entry. We found that " cnnBold"><a href=\"/hockey/nhl/teams/" gives us a sure indication that this is a team table row and nothing else. We add a row, starting with the team page address and its name, and we also add the redundant Games Left column. At the end, we simply check that we collected all 30 NHL teams:
/// <summary>
/// Parse and return data in the DataTable
/// </summary>
/// <returns></returns>
public int Parse()
{
const int TOTAL_TEAMS = 30;
const int TOTAL_GAMES = 82; // during a regular NHL season
const string TEAM_LINK = "http://sportsillustrated.cnn.com/"
+ "hockey/nhl/teams/";
// Collect all data and add games Left
SetColumns("address", "Team",
"Played", "Left", "Wins", "Losses", "OT Losses",
"Points", "Goals Scored", "Goals Allowed",
"At Home", "On the Road", "Last 10 Games");
for (int i = 2; i <= 9; i++)
_table.Columns[i].DataType = typeof(int);
// go to the NHL Conference Standings page
Address = "http://sportsillustrated.cnn.com/"
+ "hockey/nhl/standings/conference/";
// for each team row,
// in any of the Eastern/Wester Conference tables
int played = 0;
while (FindNext(" cnnBold\"><a href=\"/hockey/nhl/teams/"))
AddRow(TEAM_LINK + ExtractNext("", "\""),
ExtractNext(">", "</a>"),
played = Convert.ToInt32(ExtractNext("<td>", "</td>")),
TOTAL_GAMES - played,
Convert.ToInt32(ExtractNext("<td>", "</td>")),
Convert.ToInt32(ExtractNext("<td>", "</td>")),
Convert.ToInt32(ExtractNext("<td>", "</td>")),
Convert.ToInt32(ExtractNext("<td>", "</td>")),
Convert.ToInt32(ExtractNext("<td>", "</td>")),
Convert.ToInt32(ExtractNext("<td>", "</td>")),
ExtractNext("<td>", "</td>"),
ExtractNext("<td>", "</td>"),
ExtractNext("<td>", "</td>"));
// make sure we collected all 30 NHL teams
Debug.Assert(Rows == TOTAL_TEAMS);
// just return overall time statistics
return UpdateTotalTime();
}If you look back at the WebScrapperBase class, we said that GetData method must be usually overridden, if we want a customized layout format for collected data. Returned string can be either a text with tab separators between field values, or a HTML table. It can be dynamically embedded into a HTML page at run-time (and this is what our online demo here does), or saved into a local file, with SaveAs (and this is what we do in the downloadable project).
GetData is the only other method of our custom scrapper, which simply creates a better table layout, for HTML output only:
/// <summary>
/// Returns HTML table with data scrapped and sorted
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public override string GetData(bool html)
{
if (!html)
return base.GetData(false);
StringBuilder sb = new StringBuilder();
sb.AppendLine("<h2>NHL Overall Standings</h2>"
+ "<div><table border=\"0\">"
+ "<thead style=\"background-color:gainsboro\"><tr>");
for (int i = 1; i < Table.Columns.Count; i++)
sb.AppendLine("<td> "
+ Table.Columns[i].ColumnName + " </td>");
sb.AppendLine("</tr></thead><tbody>");
foreach (DataRow row in Table.Rows)
{
object[] vals = row.ItemArray;
sb.AppendLine("<tr><td>"
+ "<a rel=\"nofollow\" target=\"_blank\" href=\""
+ vals[0].ToString() + "\">"
+ vals[1].ToString() + "</a></td>");
for (int i = 2; i < vals.Length; i++)
sb.Append("<td>" + vals[i].ToString() + "</td>");
sb.AppendLine("</tr>");
}
sb.AppendLine("</tbody></table></div>");
return sb.ToString();
}To execute the parser, sort the result by the total number of points, and return a HTML partial code from the result table, we need only few lines of code:
HockeyWebScrapper scrapper = new HockeyWebScrapper();
scrapper.Parse();
scrapper.SortTable("[Points] DESC");
return scrapper.GetData(true);The limitation we have, in this implementation, is that we sort data only by the total number of points. When more than one team has the same number of points, we should include in sorting other criteria, such as the number of goals scored or against...
The following page shows an online demonstration of this scrapper.
