Download Source Code: JobsScrapper.zip - 14.06KB
Collect Jobs at Google
Jobs can be posted not just on global job sites, but also on company sites. In fact, this is what attracts people to still build scrappers and indexers for vertical markets: while job sites may post only paid job descriptions, your crawler can collect jobs from both job sites and company job sites, and mix them together. You build a collector-aggregator, and your search engine may attract users, because exposes more information.
Google's job site is actually a small section within their web site, which presents available jobs hierarchically. There is no form for search criteria, but you drill-down on hyperlinks from specialized result pages. For our demo, we will go to the Software Engineering in Mountain View/California jobs page, which is two steps away from any actual page with job entries.
The parser will first collect all links to intermediate pages in a list, then navigate and walk through these pages for actual job entries. It's good to always provide implementation support for MaxPages and MaxRows, so don't forget to call, in your code, CanCapturePage and CanCaptureRow, which will return false if one of these upper limit have been reached.
The job list is particular for Google, in the sense job entries appear like bullet list items and include the location in the job title, which we will separate in a distinct field. Looking closer, there are no LI tags, but • characters, that show up as bullets.
/// <summary>
/// Parse lists of Software Engineering jobs
/// from Google's company job site
/// Search Criteria: Mountain View, CA
/// Start page: http://www.google.com/support/jobs/bin/topic.py
/// ?dep_id=1054&loc_id=1116
/// </summary>
/// <returns>total time spent on crawling</returns>
public int ParseJobsAtGoogle()
{
Reset(false);
// navigate to Search Engineering jobs page,
// which contains entries to specialized types of jobs
Address = "http://www.google.com/support/jobs/bin/topic.py"
+ "?dep_id=1054&loc_id=1116";
// collect web addresses to all specialized types of jobs
List<string> links = new List<string>();
while (FindNext("•")
&& FindNext("href=\"/support/jobs/bin/topic.py?"))
links.Add("http://www.google.com/support/jobs/bin/topic.py?"
+ ExtractNext("", "\""));
_page = 0;
// for each page with actual job entries
foreach (string link in links)
{
// repeat until max pages/records reached
if (!CanCapturePage(true) || !CanCaptureRow(true))
break;
// navigate to the page with job entries
Address = link;
// repeat until max records reached or no other job in page
while (CanCaptureRow(FindNext("•") && FindNext(
"href=\"/support/jobs/bin/answer.py?answer=")))
{
string jobAddress = "http://www.google.com/support/jobs/"
+ "bin/answer.py?answer="
+ ExtractNext("", "\"");
string jobTitle = ExtractNext(">", "</a");
string location = "Mountain View";
// separate actual job title from location
int i = jobTitle.LastIndexOf(" - ");
if (i > 0)
{
location = jobTitle.Substring(i+3);
jobTitle = jobTitle.Substring(0, i);
}
// (try to) get length/modified of detail job page,
// and save data
NavigateHead(jobAddress);
AddRow(jobAddress, jobTitle, "Google", location, "",
_contentLength, _lastModified);
}
}
return UpdateTotalTime();
}Collect Jobs at USBank
Job Opportunities at USBank shows a different kind of company job site layout. All job entries appear in a single result page, in tabular format. This is good, because you need to navigate to a single page for all results. But there are no job descriptions in this result page, and for real-life crawler this isn't usually great because, if they also need some short job description for their index pages, they may need to partially parse also the job description pages.
Another thing that may complicate the implementation: the URL to a DisplayPostings.do result page has no query string, so it is likely the site is using a POST method. Go back and look at the HTML source of the search criteria page. You'll find that, indeed, the form with DisplayPostings action uses a POST method. Fortunately, our Navigate simulates a POST command, and there are only two criteria, both text, you should pass: selectedStates and selectedJobs. We'll use in our demo the filter values "CA" and "All", for all USBank jobs in California:
/// <summary>
/// Parse lists of Software Engineering jobs
/// from Google's company job site
/// Search Criteria: Mountain View, CA
/// Start: https://www2.usbank.com/jobops/DisplayPostings.do (POST!)
/// Search Crit.: https://www2.usbank.com/jobops/NewJobSearch.do
/// </summary>
/// <returns>total time spent on crawling</returns>
public int ParseJobsAtUSBank()
{
Reset(false);
// get list with a POST method, for "All" "CA" jobs at USBank
Navigate("https://www2.usbank.com/jobops/DisplayPostings.do",
"POST", "selectedStates", "CA", "selectedJobs", "All");
// repeat until max records reached or no other job in page
while (CanCaptureRow(FindNext("href=\"JobDetail.do?")))
{
string jobAddress
= "https://www2.usbank.com/jobops/JobDetail.do?"
+ ExtractNext("", "\"");
string jobTitle = ExtractNext(">", "</a");
FindNext("<br>");
string location = Trim(ExtractNext("<br>", "</td>"));
string pubDate = Trim(ExtractNext("<br>", "</td>"));
// (try to) get length/modified of detail job page,
// and save data
NavigateHead(jobAddress);
AddRow(jobAddress, jobTitle, "USBank", location, pubDate,
_contentLength, _lastModified);
}
return UpdateTotalTime();
}Using the HEAD method
Any previous Parse methods call a NavigateHead private method, before the AddRow. We added this method to also demonstrate the usage and utility of a HEAD HTTP method:
// (Try to) get the content length and last modified date
// of each job detail page, with a HEAD HTTP method
// Disabled (default) if _parseJobDetail false.
private bool _parseJobDetail = false;
public bool ParseJobDetail
{
get { return _parseJobDetail; }
set { _parseJobDetail = value; }
}
private string _contentLength = null;
private string _lastModified = null;
private void NavigateHead(string jobAddress)
{
_contentLength = _lastModified = null;
if (_parseJobDetail)
{
try
{
// some web servers do not allow the HEAD method!
Navigate(jobAddress, "HEAD");
_contentLength = Headers["Content-Length"];
_lastModified = Headers["Last-Modified"];
}
catch { }
}
}By default, this is turned off by the ParseJobDetail property, but uncomment the related line from the Main method, and your crawlers will also issue a HEAD web request for any job entry. It will take longer, of course, but the good news is HEAD does not transfer any web page content, so it's shorter and consume less bandwidth than GET. Remark that, for a HEAD method, Navigate does not modify current Content and Position, and does not increment the Page number.
You usually issue a HEAD for a web page, instead of a GET, when you just want to know if the page exists and read some page information from its headers. This are the page content length and last time the page has been updated.
Not all web server will successfully respond to HEAD, and you can see here that Monster sends back only the Content-Length, ComputerJobs sends them both, Google and USBank simply do not support this kind of request.
Finally, here is the content of our Main method, which instantiates a scrapper, sets some configuration parameters, calls one by one its Parse methods to collect jobs in the DataTable, than saves all extracted job entries in a local text file:
static void Main()
{
// instantiate and configure jobs scrapper
JobsScrapper scrapper = new JobsScrapper();
scrapper.Trace = true;
scrapper.MaxPages = 3;
scrapper.MaxRows = 60;
//scrapper.ParseJobDetail = true;
// parse job entries from multiple job sites
scrapper.ParseMonster();
scrapper.ParseComputerJobs();
scrapper.ParseJobsAtGoogle();
scrapper.ParseJobsAtUSBank();
// aggregate and save all collected jobs
scrapper.SaveAs("Jobs.txt");
}