C# Syntax Highlighter
Apr 15

Printer Friendly Version

Automation, Redundancy, Partial Data Exposure

With the huge availability of application, tools and APIs (Application Platform Interfaces) available today for any kind of information or document, you may wonder why you should still care about implementing your own methods for data extraction.

One frequent reason is automation. Most user friendly tools, with rich user interfaces, do not expose and do not allow you to perform batch operations, on multiple selected items of the same kind. It's easy to manually select, copy and paste on your own catalog the first ten hyperlinks returned by a search result web page, but when you need the same operation to be repeated frequently for a large number of searches or pages, you'd better write some code, to automate the process.

Another reason is missing full data availability. .NET Framework Classes are essentially wrappers using PInvoke Win32 function calls, but it's well known that a lot of functionality, fully exposed at the operating system level, has not been (yet?) implemented in .NET. You'll find on the Internet many specific implementations calling Win32 functions, that try to fill the gap of this missing functionality.

Windows comes with plenty of system applications and dialogs, to expose information about files, documents, processes and applications, in a very user-friendly manner. This is good. However, most UI controls or dialogs do not expose the full information available for that objects. And redundancy may become as bad as the lack of information: instead of having one fully-featured extensible component for each kind of object, we have so many controls that each shows same parts of some data, in a different presentation format, that at some point you don't know what to choose.

Compound Document Part Extraction

Crop image and zoom
Crop image and zoom

One of the most common data extraction situations is when required data is part of a more complex compound document.

When data to be extracted can be clearly delimited within the document, we can simply cut that area and save it separately. Examples of such common operations:

  • Extract the still valid portion of a damaged document, in recovery operations.
  • Crop parts of an image.
  • Separately extract only end-user data or tags metadata from a HTML or XML document.
  • Extract only elements of a certain kind, such as images from a web page or PDF file.
Icon Images from an .ICO File
Icon Images from an .ICO File

Some examples of well-known compound documents and their structure:

  • .ICO files are actually groups of several icon images, each with own size and color palette.
  • PE (Portable Executable) files (i.e. Windows .EXE and .DLL files) can embed resources. You can store as resources images, strings, other file contents, audio or other multimedia binary content...
  • Web pages have HTML text content, which at the low level can be seen as a combination of tags and end-user data.

"Data" doesn't mean only end-user data (credit card numbers, people names, passwords), but also what is commonly known as metadata ("data about data"), executable code and low-level storage delimiters for the data structures.

The structure of a compound document can be described by a standard File Format, which can be proprietary (i.e. defined, owned and controlled by a vendor), and/or open (i.e. publicly available). In some cases, the format is known only by the vendor, but many vendors expose components or functions to access and load different parts of the document.

Databases contain Data and Metadata
Databases contain Data and Metadata

For instance, no Database Management System expose the file format of its database files. A database system can be seen as a compound document, or collection of such documents with complex structure. To access data, you usually connect to the database (actually to the database server or engine) through a data access API, and perform SQL queries against its tables or execute stored procedures. Database metadata is accessed through catalog tables or built-in functions. You'll never access database data at the file level, even when it's relatively easy to determine which .MDB (in Microsoft Access) or .MDF (in SQL-Server) files hold your data.

This also stands for other kinds of documents. For instance, Microsoft Office documents (.DOC for Word, .XLS for Excel, .PPT for PowerPoint) have complex structures and file format which is not made publicly available by Microsoft. Can you build application tools to extract data from such documents using custom rather than Office apps? It depends what executable components the vendor made available. Common Office properties are stored in some kind of property bags, which can be accessed through standard IStorage-like COM interfaces.

Software developers usually like to know where their data comes from, where it is actually stored. This makes perfectly sense if data is stored on your local computer or local network.

First question you should ask yourself is what kind of data that is: static or dynamic? And is it actual storage or system data? For static data, it makes perfectly sense to look for a file on your computer. Unless you deal with system data, it can be nowhere else. For data stored in a file, open the file with a simple editor and look at it. You can try with Notepad first, but if the file contains binary data, try with a binary file viewer, like the one we present in our magazine. Dynamic and/or system data is either stored somewhere else, in an unaccessible place, by the operating system, or it's never stored and it's returned dynamically, at run-time only. Extract this kind of data with a function call.

Possible resources embedded in a DLL
Possible resources embedded in a DLL

The dynamic view of data got extended to the static data as well. We started to somehow loose this natural perception that documents stored in files can have their data extracted from the binary bytes, when you open and read the file. Instead, we use components, that associate parts of a documents to specific storage interfaces, and load data through method calls. This gets more generic over a network or the web, when you have often no idea (and do not need to know) HOW returned data is actually remotely stored. What you need to know is if there is an available METHOD (this can be a HTTP GET or web service call) you can call and return your data.

When you deal with documents saved in local files, if their file format is too complex, unavailable or difficult to handle, try to look if there are available components that read and load that data dynamically.

Continue reading »

Subscribe and Share: Subscribe using any feed reader Bookmark and Share

Leave a Reply