Download Source Code: JobsScrapper.zip - 14.06KB
| Collector and aggregator of advertised jobs, for indexing purposes. This is just a demo, with a web scrapper based on WebScrapperBase, that can be customized, enhanced and adapted for other markets, situations, sites. Demonstrates usage of POST and HEAD HTTP methods and the generic algorithm to parse data entries from multiple result pages. | ||
Overview
In Create a Web Scrapper Base Class, we introduced an abstract base class, with main functionality usually required by web scrappers. We implemented a first simple web scrapper, to extract in real-time some tables from Yahoo!Movies web site. Another similar example, Real-Time Web Scrapper for NHL Standings, has been later built to illustrate the utility value of real-time data scrapping and transformation.
Data has not to be always collected and presented in real-time. Search engines periodically crawl sites and store partial information in their local database. Data scrapping for indexing purposes does not collect all information from the web sites. In fact, they collect and expose only some minimal amount of text, in result lists, associated with the URL the user can navigate, to get the full content from the original source. Indexing is in most cases welcome by site owners, because most users visit the search engines first, to find web pages which contains certain combinations of words, known as keywords. This is how they may find your site.
We'll continue the series on examples based on the initial web scrapper class with a scrapper similar to a search engine. Our JobsScrapper collects job entries from various job sites, for indexing purposes. It does not parse the full job description pages, but only saves the links to these pages. In fact, JobsScrapper IS a search engine itself, and because it deals with a specialized niche, it's known as a vertical search engine. Of course, in this demo we will implement custom parsers for only a couple of sites, with the intent to demonstrate different aspects of data scrapping in each. Similar crawlers can be implemented for other markets, such as classified ads, real estate, travel, product catalogs etc. And yes, for each site you usually need to have a specific implementation, eventually reusing parts from other crawlers you built.
Typical Layout
Particular for these job sites is search is usually performed in 3-4 steps and types of screens, almost always the same:
- Define your search criteria, such as location, domain, occupation, keywords in the job description and title. Then trigger a GET or POST method execution, with a Search command button.
- Get an initial list of search results, where partial descriptions of each job is shown. The result may return hundreds or thousands of entries, and it doesn't make sense to show them all on a single page. This is why, in most cases, you may need to navigate to a next page, typically through a hyperlink with the page number, or a Next button identifier. Result entries may be presented in TABLE rows, DIVs, P paragraphs, LI bullet list items...
- Select a single job and get its full description page, navigating through the hyperlink with the title of the job.
An indexing web crawler will be mostly concerned with step (2), where actual data scrapping in performed.
Step (1), if possible, must be avoided. This is because we try as much as possible to deal only with static HTML content, and we don't want to have to process events on dynamic form controls. Search criteria can be actually extracted from the query string, in the result list page, if the method GET has been used. Control values usually appear as name=value concatenations in the next URL, separated by &, or by ? from the base URL. When we are not so lucky and we see no query string in the result page URL, it's likely a POST method has been executed, and we should look for the FORM tag and its inner control, to manually simulate and build this method in our scrapper.
Step (3) is usually not implemented, if the index engine has enough information to display just a short minimal description for each entry. The goal of a search engine is not to steal all data of another site, but to collect and show links to content data from that site.
Collect Monster.com Job Entries
Monster.com is one of the biggest job sites. Its search criteria page is using a GET method, and it's easy to deduct the parameter values from its URL. In our demo project, we used some fixed search criteria, to return all IT jobs from Silicon Valley, California. No need to know what specific mapping the page does for each parameter value. Just looking at the URL of the first and a second page, and playing a bit with excluded parameters from the query string, you can conclude that parameters like fn, lid, cy and JSNONREG are required.
Page number is hold by property pg. To get to the next page - next page navigation mechanism is all you need, because your crawler acts like a one-page forward-only reader - you can detect if the >> hyperlink exists. When such hyperlink does not exist in a result page, this means that there are no other pages to explore.
An indexing crawler collect hyperlinks - both addresses and titles - to all job entries that should point to the web site. For Monster, it's easy to determine that "getjob.asp?JobID=" is a pattern for any full job description page. The JobID is all we need, all other eventual parameters are optional from a job URL.
We may also use this partial string to find next job entry in the list. When we extract field values, we may need to advance the current position cursor with one of several FindNext calls between fields.
Here is the implementation of our ParseMonster method:
/// <summary>
/// Parse lists of jobs from Monster.com
/// Search Criteria: IT Jobs in Sillicon Valley/CA
/// Start page: http://jobsearch.monster.com/Search.aspx
/// ?q=&fn=660&lid=356&cy=us&JSNONREG=1&pg=1
/// </summary>
/// <returns>total time spent on crawling</returns>
public int ParseMonster()
{
Reset(false);
do
{
// navigate to first/next page
Navigate("http://jobsearch.monster.com/Search.aspx", "GET",
"fn", "660", "lid", "356", "cy", "us", "JSNONREG", "1",
"pg", Page + 1);
// for each record in current page
while (CanCaptureRow(FindNext(
"http://jobview.monster.com/getjob.asp?JobID=")))
{
string jobAddress
= "http://jobview.monster.com/getjob.asp?JobID="
+ ExtractNext("", "&");
string jobTitle = ExtractNext(">", "</a>");
FindNext("<span ");
string company = ExtractNext(">", "</span>");
FindNext("<span ");
string pubDate = ExtractNext(">", "</");
FindNext("</td>"); FindNext("<span ");
string location = ExtractNext(">", "</span>");
// (try to) get length/modified of detail job page,
// and save data
NavigateHead(jobAddress);
AddRow(jobAddress, jobTitle, company, location, pubDate,
_contentLength, _lastModified);
}
// repeat until max pages/records reached or no Next
} while (CanCapturePage(Find("> >></a>", true)));
return UpdateTotalTime();
}Collect ComputerJobs Job Entries
ComputerJobs.com is another job site, for IT jobs only. After playing with search criteria, you can see that Northern California jobs can be all returned by a URL with subdomain northerncalifornia. This makes things easier.
Specific in their implementation of next page mechanism is a siteid and searchid, values determined on your first page, are required to navigate through a list. We'll detect these values from the first URL to a job entry, in the first page only:
/// <summary>
/// Parse lists of IT jobs from ComputerJobs.com
/// Search Criteria: Northern California (as subdomain!)
/// Start page: http://www.northerncalifornia.computerjobs.com
/// /job_results.aspx
/// </summary>
/// <returns>total time spent on crawling</returns>
public int ParseComputerJobs()
{
const string START_PAGE
= "http://www.northerncalifornia.computerjobs.com"
+ "/job_results.aspx";
string siteid = null;
string searchid = null;
Reset(false);
do
{
// navigate to first/next page
if (Page == 0)
Address = START_PAGE;
else
Navigate(START_PAGE, "GET", "siteid", siteid,
"searchid", searchid, "page", Page + 1);
// for each record in current page
while (CanCaptureRow(FindNext(
"href='/job_display.aspx?jobid=")))
{
// navigation to page 2,3,... requires a siteid
// and searchid in QS, which can be found on first
// job link from Page 1
string link = ExtractNext("", "'");
if (Page == 1)
{
siteid = Extract(link, "siteid=", "&");
searchid = Extract(link, "searchid=", "&");
}
string jobAddress = "http://www.search.computerjobs.com"
+ "/job_display.aspx?jobid="
+ Extract(link, "", "&");
string jobTitle = ExtractNext(">", "</a");
FindNext(">More<"); FindNext("<tr>");
string company = ExtractNext("<b>", "</b");
FindNext("<td "); FindNext("<font ");
string location = ExtractNext(">", "</td");
location = Extract(location, "", ",").TrimEnd();
FindNext("<td "); FindNext("<td "); FindNext("<font ");
string pubDate = ExtractNext(">", "</td");
// (try to) get length/modified of detail job page,
// and save data
NavigateHead(jobAddress);
AddRow(jobAddress, jobTitle, company, location, pubDate,
_contentLength, _lastModified);
}
// repeat until max pages/records reached or no Next
} while (CanCapturePage(FindNext(">" + (Page+1) + "</a>")));
return UpdateTotalTime();
}