Download Source Code: YahooMoviesScrapper.zip - 18.17KB
Data Capture
We already presented Extract and ExtractNext as main functions for data capture. ExtractNext collects data between two string delimiters and will also move the buffer's pointer immediately after the end delimiter. What do you do with your data?
We'll keep it simple and provide just the necessary. There are two kinds of data:
- Tabular data, structured in rows and columns, that we will collect in a System.Data.DataTable.
- Singular property values, that we may collect in a Dictionary.
Before you start crawling, if you are collecting tabular data, call once SetColumns with all the field names. You can also define a specific data type for each field. Then call AddRow, and pass collected field values for each row. These are the string you collect with ExtractNext. You can check current number or collected rows with Rows property, and stop crawling after MaxRows. CanCaptureRow transparently checks this condition for you.
AddValue collects a single property value in the data dictionary. Properties must have unique names, otherwise their value is simply replaced.
// name+value pairs for single-value captured results
protected Dictionary<string, object> _dictionary
= new Dictionary<string, object>();
public Dictionary<string, object> Dictionary
{ get { return _dictionary; } }
// Collect a single-value result.
// Existing value with same name will be replaced
public void AddValue(string name, object val)
{
if (_dictionary.ContainsKey(name))
_dictionary.Remove(name);
_dictionary.Add(name, val);
if (_trace)
Debug.WriteLine("[TRACE] Added " + name + " = "
+ (val == null ? "(null)" : val.ToString()));
}
// Data table for row-based captured results
protected DataTable _table = new DataTable();
public DataTable Table { get { return _table; } }
// Sets names (column headers) for row-based captures
// This must be called once, before any row-based data collected
public void SetColumns(params string[] fields)
{
Reset(true);
_table.Columns.Clear();
if (fields != null)
foreach (string field in fields)
_table.Columns.Add(field);
}
// Add new captured data row to the table
public void AddRow(params object[] values)
{
Debug.Assert(values != null
&& values.Length == _table.Columns.Count);
_table.Rows.Add(values);
_rows++;
if (_trace)
{
Debug.Write(_rows + ". ");
foreach (object val in values)
Debug.WriteLine(val == null
? "(null)" : val.ToString());
Debug.WriteLine("");
}
}
// Delete any collected data (if clear=true)
// and prepare for new parsing sequence
public void Reset(bool clear)
{
// delete all single-value and row-based results
if (clear)
{
_dictionary.Clear();
_table.Rows.Clear();
}
// prepare for new parsing sequence
_page = 0;
_rows = 0;
_totalTime = _navigationTime = 0;
_startTime = DateTime.Now;
}
// Number of rows collected since last Reset
protected int _rows = 0;
public int Rows { get { return _rows; } }
// Maximum number of rows to collect in next parsing sequence
protected int _maxRows = Int32.MaxValue;
public int MaxRows
{
get { return _maxRows; }
set { _maxRows = value; }
}
// Returns true if rows did not reach their upper limit
// and row find condition true
protected bool CanCaptureRow(bool condition)
{
return (_rows < _maxRows
&& condition);
}
// Sort data table ASC/DESC by one or more columns,
// depending on column's DataType
public void SortTable(string orderBy)
{
_table.DefaultView.Sort = orderBy;
_table = _table.DefaultView.ToTable();
}
// Returns row-based data as either a HTML TABLE
// or tab-delimited text
public virtual string GetData(bool html)
{
StringBuilder sb = new StringBuilder();
if (html)
sb.AppendLine("<table>");
foreach (DataRow row in _table.Rows)
{
string s = "";
foreach (object val in row.ItemArray)
s += (s.Length == 0 ? ""
: (html ? "</td><td>" : "\t")) + val.ToString();
sb.AppendLine(html ? "<tr><td>" + s + "</td></tr>" : s);
}
if (html)
sb.AppendLine("</table>");
return sb.ToString();
}
// Dump data table in either a HTML or tab-delimited text file
public virtual void SaveAs(string filename)
{
bool html = (filename.ToLower().EndsWith(".htm")
|| filename.ToLower().EndsWith(".html"));
using (StreamWriter writer = new StreamWriter(filename, false))
{
if (html)
writer.WriteLine("<html><body>");
writer.Write(GetData(html));
if (html)
writer.WriteLine("</body></html>");
}
}What you later do with collected data is up to you. We give direct access to the Table and Dictionary. GetData returns the tabular data in either simple text format, with values separated by TABs, or HTML TABLE format. SaveAs will save it in a text or HTML file. Both methods can be overridden, for custom save formats.
Remark that, combining our simple data scraping and data capture methods, you can perform very fast data extraction and collection, without using regular expressions. Operations are performed on current string character, and do not require additional substring instantiations, except when data is collected in your table.
Tracing
As we said, good crawlers must have some form of simple built-in tracing capabilities and give you performance and time measurement values. We exposed a very simple Trace property. When it is set to true, information is sent to the Debug window while crawling. But you can easily change the Debug.WriteLine calls and redirect output to Console or a local log file.
We also transparently measure the time spent on navigation and total processing. As you'll see, web navigation is the most time consuming operation, which usually takes 80-90% of total time. This is why it is not worth to spend too much time improving the speed of your data scraping and parsing technique.
// If true, send trace messages to the Debug window while parsing
protected bool _trace = false;
public bool Trace
{
get { return _trace; }
set
{
_trace = value;
if (_robots != null)
_robots.Trace = value;
}
}
// Total time of last parsing
protected int _totalTime = 0;
public int TotalTime { get { return _totalTime; } }
// Total parsing time, since instantiation
protected long _totalUsageTime = 0L;
public long TotalUsageTime { get { return _totalUsageTime; } }
// UpdateTotalTime must be called at the end of a parsing method
protected DateTime _startTime = DateTime.Now;
public int UpdateTotalTime()
{
_totalTime += ElapsedTime(_startTime, DateTime.Now);
_totalUsageTime += _totalTime;
if (_trace)
{
int processing = _totalTime - _navigationTime;
Debug.WriteLine("[TRACE] "
+ _navigationTime.ToString("#,###")
+ "ms Navigation + " + processing.ToString("#,###")
+ "ms ("
+ ((int)((double)(processing * 100) / _totalTime))
+ "%) Processing = "
+ _totalTime.ToString("#,###") + "ms");
}
return _totalTime;
}
// Total time spent on navigation, on last parsing
protected int _navigationTime = 0;
public int NavigationTime { get { return _navigationTime; } }
// Time to wait between web navigations
protected int _navigationWaitTime = 0;
public int NavigationWaitTime
{
get { return _navigationWaitTime; }
set { _navigationWaitTime = value; }
}
// Starting time of last navigation
protected DateTime _navigationLast = DateTime.MinValue;
public DateTime NavigationLast { get { return _navigationLast; } }
// Navigation time-out time
protected int _navigationTimeout = 10000;
public int NavigationTimeout
{
get { return _navigationTimeout; }
set { _navigationTimeout = value; }
}
// Internal utility, to get number of ms
// elapsed between start and end times
protected int ElapsedTime(DateTime start, DateTime end)
{
return new TimeSpan(end.Ticks - start.Ticks).Milliseconds;
}It is always respectful, toward a site you crawl, to wait few milliseconds between each subsequent navigation. Yes, this will slow-down your crawling process, but it will not affect the availability of the web server for other users. Set the NavigationWaitTime property to 100 ms or more.
Many crawlers also use multiple threads, each dedicated to a page navigation. There are crawlers that simultaneously crawl dozens of pages, with dozens of connections to the same server. Web site owners do not usually like this. Anyway, we left multiple-threads crawling for another time.