Some History, or why all C++ SAX implementations are incompatible
SAX is a standard interface for event-based XML parsing, developed
collaboratively by the members of the XML-DEV mailing list. SAX 1.0
was released on Monday 11 May 1998, and is free for both commercial
and non-commercial use.
- But...
The unfortunate gotcha is that the
preceeding paragraph applies to
the Java implementation of SAX. There is, unfortunately, no standard
C++ SAX implementation. There are several others including the
Apache Project's Xerces
(originally based on
IBM's XML4C++) and SEGV's. I
make no claim for the code here to be the "one true" C++ implementation
or to have any particular advantage over any other implementation. It
all depends on what you're after really.
- On the other hand ...
There was a move in progress on the xml-dev mailing list to
bring the various C++ implementations (apparently there are knocking on
a dozen) into line, perhaps based on the SAX2 specification. This
atrophied, and I doubt it would have been possible to reach agreement anyway.
- So why wasn't it sorted out in the first place?
Well, ignorance initially, I guess. When I started out I didn't find
any other implementations. Then when xml4c was published it turned
out to be pretty different from what I'd done, and then SEGV got in
touch and his implementation was different again.
I'm not completely sure why but I suspect it's because, to quote Larry
Wall, there's more than one way to do it. (Ironically, there is only
one SAX implementation in Perl, probably because Larry Wall is the
coauthor of it.) Take, for example, the Java function
class DocumentHandler {
void startElement(String name, AttributeList attr)
...
}
How should this be translated into C++? Leaving aside the fact that
Java strings are Unicode and C++ strings arn't, is it
void startElement(const std::string& name, const AttributeList& attr);
or
void startElement(const std::string* name, const AttributeList* attr);
or
void startElement(const char* name, AttributeList& attr);
or something else entirely?
-
Where do you want to go?
I think it depends a lot on what your priorities are. One of the Apache
Project's stated aims for Xerces is portability. They want it to be
compilable on as many platforms as possible, and to achieve this they
have a number of
coding guidelines, similar to those used by Mozilla, which essentially excludes
anything but that C++ subset supported by even the most braindead
compiler still in use today. It means things like no namespaces,
minimal use of templates, no RTTI, so no dynamic_cast (although I
don't think that's actually needed for SAX).
My goals when I first ported SAX to C++, which was couple of months
before IBM published XML4C++, were slightly different. I wanted to
do the port quickly and
make it easy to use, so I chose to prefer references over pointers
and the standard library over rolling-my-own. The last thing the
world of C++ really needs is another string class. I don't say this
to run down the Xerces, but to state my position. Portability is
a noble goal, and Xerces certainly seems to be succeeding. However,
I spent nigh on
a year and a half working under the same kind of restrictions as the
Xerces guidelines, and it was one of the most deeply frustrating times
of my life. If my function returns a std::string people know what
they're getting, if it returns a JezString they don't.
-
So there you have it, more or less
Please feel free to use and abuse the code provided here. I hope
you'll find it useful, and I'm keen to hear how you get get on with
it. On the other hand, be aware that there are other C++ SAX
implementations out there which may be more suitable for you.