SAX And DOM Overview
There are two main types of XML APIs, tree-based and event-driven. Tree-based APIs map an XML document into an internal tree structure, then allow an application to navigate that tree. The Document Object Model (DOM) working group at the World-Wide Web Consortium (W3C) maintains a recommended tree-based API for XML and HTML documents, and there are many similar APIs from other sources. Event-driven APIs, on the other hand, reports parsing events (such as the start and end of elements) directly to the application through callbacks, and generally do not build an internal tree. The application implements handlers to deal with the different events, much like handling events in a graphical user interface. SAX the best known example of this type of API.Tree-based APIs are useful in a wide range of applications, particularly if the application is document-centric rather than data-centric. However, trees can put a hefty strain on system resources, especially if the document is large, you're only interested in part of it, or a tree representation is not a good fit for your problem. It's wasteful in terms of coding effort and runtime efficiency to construct a tree model only to immediately map it to some other representation, for instance.
In these cases, an event-driven API like SAX provides a simpler, lower level access to the XML document. It's possible to parse very large documents which may not fit into memory, or to map them directly to your own object model.
SAX was originally developed by members of the XML-DEV mailing list as a Java API during the early part of 1998. There has subsequently been one major revision, to incorporate namespace processing. (Fuller History). Although SAX was developed as, and is defined by, its Java implementation, it has been ported to Python, Perl, a COM interface, Eiffel and numerous other languages. Most parsers, in whatever language, now include a SAX or SAX-like interface.
To get a feel for how SAX works, consider the following little document:
<?xml version="1.0"?>
<document>
<para>This is a very simple example</para>
</document>
A parser implementing SAX breaks down the document into a linear sequence of events:
startDocument
startElement: document
startElement: para
characters: This is a very simple example
endElement: para
endElement: document
endDocument
A SAX client application (i.e. your application) handles these events in the same kind of way as you would say mouse-events from a GUI system. There's no need to keep the entire document in memory if you don't need it, and you're free to ignore anything you're not interested in.
