Download Source Code: PdfFileParser.zip - 205.25KB
PDF File Parser
Cross-reference tables are skipped and ignored by now, because they are used by some applications only to quickly access objects in random mode, without loading the whole file. When saving the file, reference tables may not be necessary, and Acrobat will rebuild them on-the-fly anyway.
Each PdfSection implements and parses its own data blocks. In fact, this is the class where most parsing functionality is implemented.
GetPdfObject is the top function called to import the next PDF object representation from the file, including all aggregate objects. For the body list, the function is called repeatedly until the cross-reference table, identified by a line with xref, is found. For the trailer, a single call is necessary, supposed to return a PdfDictionary object.
GetPdfObject will make recursive calls for composite objects. During this stage only, surrogate PdfObject instances - for which IsValid would return false - will be also created, for closing ], >>, and endobj delimiters, and streams. However, they will not be added to any container object that made the recursive call, so the final PDF loaded hierarchy will hold only valid objects:
/// <summary>
/// Main private parsing method, to read and translate
/// next text token, from file, into a PDF object
/// </summary>
/// <returns></returns>
private PdfObject GetPdfObject()
{
// get next token
string token = NextToken();
if (token == null)
return null;
Debug.Assert(token.Length > 0);
// Null PDF object
if (token == "null")
return new PdfObject(null);
// Boolean PDF object
if (token == "true" || token == "false")
return new PdfObject(token == "true");
// PDF dictionary object --> sequentially load
// name-value pairs, until >> closing tag
if (token == "<<")
{
PdfDictionary dict = new PdfDictionary();
for (PdfObject child = GetPdfObject();
!child.Text.StartsWith(">>");
child = GetPdfObject())
{
Debug.Assert(child.IsName);
PdfObject val = GetPdfObject();
Debug.Assert(child.IsValid);
dict.Dictionary.Add(child.ToString(), val);
}
return dict;
}
// Create valid PDF object for % comment line,
// named value, (..) or <..> string value.
// Create temporary (invalid) object
// for closing ], >>, endobj tag,
// or stream..endstream declaration.
// Invalid object will be automatically discarded
// by previous GetPdfObject call.
if (token.StartsWith("%") || token.StartsWith("/")
|| token.StartsWith("(") && token.EndsWith(")")
|| token.StartsWith("<") && token.EndsWith(">")
|| token == "]" || token == ">>"
|| token == "endobj" || token.StartsWith("stream "))
return new PdfObject(token);
// PDF array object --> sequentially load
// PDF object elements, until ] closing tag
if (token.StartsWith("["))
{
PdfArray array = new PdfArray();
for (PdfObject child = GetPdfObject();
!child.Text.StartsWith("]");
child = GetPdfObject())
{
Debug.Assert(child.IsValid);
array.List.Add(child);
}
return array;
}
// PDF Indirect Reference
if (token.EndsWith(" R"))
return new PdfIndirectReference(token);
// PDF Indirect Object declaration --> sequentially load
// PDF object elements, until endobj closing tag
if (token.EndsWith(" obj"))
{
PdfIndirectObject ind = new PdfIndirectObject(token);
for (PdfObject child = GetPdfObject();
!child.Text.StartsWith("endobj");
child = GetPdfObject())
{
// if stream ... endstream declaration,
// attach Stream to previous PDF object,
// which MUST be a PDF dictionary!
if (child.Text.StartsWith("stream "))
{
int last = ind.List.Count - 1;
Debug.Assert(ind.List[last] is PdfDictionary);
(ind.List[last] as PdfDictionary).Stream
= child.Text.Substring(7, child.Text.Length - 8);
}
else
{
Debug.Assert(child.IsValid);
ind.List.Add(child);
}
}
return ind;
}
// if none of previous --> this MUST be
// an integer or real number value
object number = GetNumber(token);
Debug.Assert(number is int || number is double);
return new PdfObject(number);
}Streams are always declared after a dictionary definition, at the top-level declaration of a PdfIndirectObject instance. Once a stream is detected, its array of bytes will be added to the last object already added to the list of current indirect object, which must be a PdfDictionary.
Tokens are extracted and pushed into a local stack from the current text line read from the file. With this approach, we avoid loading the whole file in memory. While PDF files are usually pretty large in size, this avoids consuming memory with a second large buffer. Each token from a line is recursively parsed by a combination of ParseLine and NextToken calls. When current line is over, NextToken reads another line from the file. When a xref cross-reference table or startxref end of trailer is detected, a null item is pushed into the stack, which flags the end of processing:
private string NextToken()
{
while (_tokens.Count == 0)
{
// parsing ends when xref/startxref
// reference table declaration found
string line;
if ((line = _reader.ReadLine()) == null
|| line.Trim() == "xref"
|| line.Trim() == "startxref")
PushToken(null);
else
while (line != null && line.Length > 0)
line = ParseLine(line);
}
return DequeueToken();
}PushToken, PopToken and DequeueToken are simple utility functions to deal with the stack. This is rather a FIFO queue, but ParseLine uses it also internally as a LIFO stack, when obj or R "operators" are found, and the object/generation numbers have been already pushed into the stack, as Number values.
ParseLine will parse, character by character, new lines from the file itself, for (...) or <...> string values, or stream blocks:
/// <summary>
/// Parse a line from the PDF file
/// </summary>
/// <param name="line"></param>
/// <returns></returns>
private string ParseLine(string line)
{
string token = line;
// for each character in the line
for (int i = 0; i <= line.Length; i++)
{
if (i == line.Length)
{
token = line;
line = null;
break;
}
if (line[i] == ' ')
{
token = line.Substring(0, i).Trim();
line = line.Substring(i + 1);
break;
}
// % comment, till end-of-line?
if (line[i] == '%')
{
PushToken(line.Substring(0, i).Trim());
PushToken(line.Substring(i));
return null;
}
// [..] array decl?
if (line[i] == '[' || line[i] == ']')
{
PushToken(line.Substring(0, i).Trim());
PushToken(line[i].ToString());
return line.Substring(i + 1);
}
// <<..>> dictionary decl?
if ((line[i] == '<' || line[i] == '>')
&& i < (line.Length - 1) && line[i + 1] == line[i])
{
PushToken(line.Substring(0, i));
PushToken(line[i].ToString() + line[i]);
return line.Substring(i + 2);
}
// <..> binary string value?
if (line[i] == '<')
{
PushToken(line.Substring(0, i));
StringBuilder sb = new StringBuilder();
for (line = line.Substring(i);
(i = line.IndexOf('>')) < 0;
line = _reader.ReadLine().Trim())
sb.Append(line);
token = sb.ToString() + line.Substring(0, i + 1);
PushToken(token);
return line.Substring(i + 1);
}
// (..) char string value?
if (line[i] == '(')
{
PushToken(line.Substring(0, i));
StringBuilder sb = new StringBuilder();
line = line.Substring(i + 1);
int open = 0;
while (true)
{
bool slash = false;
for (i = 0; i < line.Length; i++)
{
if (line[i] == '(' && !slash)
open++;
else if (line[i] == ')'
&& !slash && (--open) < 0)
break;
// process special \c characters,
// including \\
else if (line[i] == '\\')
slash = !slash;
else if (line[i] != '\\' && slash)
slash = false;
}
if (open < 0)
break;
sb.Append(line + '\n');
line = _reader.ReadLine().Trim();
}
token = '(' + sb.ToString() + line.Substring(0, i + 1);
PushToken(token);
return line.Substring(i + 1);
}
}
// stream..endstream decl?
if (token == "stream")
{
StringBuilder sb = new StringBuilder();
while ((line = _reader.ReadLine().Trim()) != "endstream")
sb.Append(line + '\n');
PushToken("stream " + sb.ToString());
line = null;
}
// indirect Object/Reference decl?
else if (token == "obj" || token == "R")
{
string generationNumber = PopToken();
string objectNumber = PopToken();
PushToken(objectNumber + " "
+ generationNumber + " " + token);
}
else
PushToken(token);
return line;
}We also provided a custom implementation for GetNumber, which converts a string to either an integer or double value, without throwing exceptions and sticking to the simplified syntax declaration of numbers defined in the PDF documentation:
/// <summary>
/// Converts text to either an integer or real number,
/// according to PDF's spec.
/// </summary>
/// <param name="s"></param>
/// <returns></returns>
private object GetNumber(string s)
{
bool dot = false;
bool digits = false;
for (int i = 0; i < s.Length; i++)
{
if (s[i] >= '0' && s[i] <= '9')
{
if (!digits)
digits = true;
}
else if (s[i] == '.')
{
if (dot)
return null;
dot = true;
}
else if (s[i] == '+' || s[i] == '-')
{
if (i > 0)
return null;
}
else
return null;
}
if (!digits)
return null;
if (dot)
return Convert.ToDouble(s);
return Convert.ToInt32(s);
}
Visual PDF File Parser
PdfDocument, PdfSection, PdfObject and its derived classes, they all override the default ToString implementation. They also provide a new ToString(deep) function, to expose PDF object contents as either simple declarations or in depth, including all their composite objects, using recursion. ToString() - equivalent to ToString(false) - returns a simple object declaration or its inner atomic value.
These calls can be immediately used to dump the loaded content structure of an object, section or the whole document, to the Console or Debug window. That's what the simple PdfFileParser console application does.
An alternative better look at the loaded PDF data structure is offered by the VisualPdfFileParser windows applications, which hierarchically displays all elements described here, in collapsible nodes. Nodes will obviously take longer to expand for very large PDF files (like dozens of MB).
Top folder nodes relate to the upper file layer, while object nodes are attached to PdfObject or derived instances. Nodes use ToString() to display element declarations as text. ToString(true) is called upon a node selection, to dump the PDF object content, in depth, in the rich text box control:
/// <summary>
/// When node selected, attached PDF object's content
/// is dumped in the text view
/// </summary>
/// <param name="obj"></param>
private void ShowSelection(object obj)
{
// no dump for lists of PDF sections
if (obj == null || obj is List<PdfSection>)
txtText.Text = "";
// dump full PDF document content
else if (obj is PdfDocument)
txtText.Text = (obj as PdfDocument).ToString(true);
// dump full PDF document section
else if (obj is PdfSection)
txtText.Text = (obj as PdfSection).ToString(true);
// generic dump for any kind of PDF object
else if (obj is PdfObject)
txtText.Text = (obj as PdfObject).ToString(true);
// sequential dump for lists of PDF objects
else if (obj is List<PdfObject>)
{
StringBuilder sb = new StringBuilder();
foreach (PdfObject o in (obj as List<PdfObject>))
sb.Append(o.ToString(true));
txtText.Text = sb.ToString();
}
// default dump
else
txtText.Text = obj.ToString();
}Conclusions
This is a possible minimal implementation of a PDF object model, for primitive PDF data types. The ASCII PDF file is sequentially parsed and fully loaded, as PdfSection and PdfObject instances, in a hierarchical structure.
In future articles and projects, we will enhance this low-level data model and add new functionality. Basically, we will identify standard dictionaries, object declarations and arrays, as either tables of contents, page structures or other standard elements of a PDF document. We will decode streams that represent images. We will implement new features, to extract data from PDF documents and save it into separate files.
April 03, 2008 at 06:00 AM
does anyone know about such a thing being done in PHP ?
April 15, 2008 at 08:00 PM
April 18, 2008 at 08:30 PM