Monday 30 April 2007 Article: Adventures in Autoconfiscation

Article originally published in CVu 18-6. CVu is the journal of ACCU.

Arabica is an XML toolkit written in C++. It provides a SAX interface for streaming XML parsing, a DOM interface for in-memory XML processing, and an XPath engine for easy DOM access. In the next release or two, it will add an XSLT processor. Arabica supports std::string, std::wstring or pretty much any other crazy string class. The code itself is good, honest, standard C++, which my experience shows is highly portable. I've built Arabica on Windows using Visual C++, under Cygwin, on a variety of Linux flavours, FreeBSD, several types of Solaris, OS X, and GNU Darwin. It's quite a tidy package, and if you're working with XML in C++ you should consider it. That's what I think, anyway, but I did write.

Damon, in Montreal, disagrees. On 22 February 2005 he wrote

"The C++ port of SAX (a set of standard JAVA API for XML parsing) Arabica is totally unusable, there are even syntax errors like namespace errors in the socalled stable release, besides it does not have any reference manual!"

The following week he wrote

"It is an awful library with a bunch of syntax errors in its newest release (namespace error....), no documents available at all. Fail to compile it on Linux at all."

I didn't have any correspondence with Damon. I found his comments months later during a vanity searching session on Technorati. In the nearly 8 years since Arabica's initial release, it's been my experience that people very rarely write to you about your software. When they download your code, it either works or it doesn't. If it works, they got on with what they were doing. If it doesn't, they may take a moment to think you're an idiot, then they fling it away and try something else. More often than not, that first point of failure is the build. If it doesn't build, it's fallen at the very first hurdle.

The Arabica distribution contains, currently, around 150 source files. Since Arabica is largely implemented as C++ templates, the majority of the files don't need to be compiled and build seperately. You just include them into your code. Only a handful, under 20, need to be build into a shared or static library.

Why did Damon have such a hard time? Why didn't I?

When I started on the code that would become Arabica, I was an angry man. I was having a very bad experience at work, with an rotton developer, who had handed over an horrible, verbose, bug-ridden piece of code. It was, allegedly, an XML parser. It read an XML wire-format and built a C++ object graph, what we now call deserialisation. At the time, I called it rubbish. (I know everybody has "he was awful" war stories, but this was, genuinely, one of the worst experiences of my working life. I'm getting angry just thinking about it again.)

I had argued that we shouldn't, absolutely shouldn't, build our own parser, but use one of free parsers that were, even then, already available. When I had this lump of code dropped on me, I wanted to demonstrate just how awful it was. I grabbed the Expat source and got the build going on Windows. Next, I grabbed the recently released Java SAX interfaces, and ran though them search and replacing String for std::string. (SAX describes a streaming XML parser interface. It was initially developed provide a common interface to XML parsers written in Java (as JDBC provides a common interface to databases), but there are now implementations in most languages.) That done, I hooked up Expat, which is a C library that deals in char*, to my new SAX classes. It worked. No bugs. Not bad for an afternoon's work. I released the code as an afterthought. I didn't think it was of particular interest, but the code I'd based it on was freely available, and I needed something to put on my website.

Over the subsequent months, and then years, I continued to work on Arabica on and off. There was a new version of SAX, which I incorporated. Other C and C++ XML parsers were released and I wrote SAX wrappers for them.

For most of that time, my primary development platform was Visual C++ 6 and then 7. Every now and again, I'd boot up a Linux box, refresh the Makefiles, and clean up the conformance errors GCC pointed out. It worked ok, after a fashion.

As the library grew, the build became increasingly fiddly. While Arabica provided bindings for Expat, libxml2, Xerces, and MSXML, you'd only want to build against one of those. That implies a certain amount of Makefile editing. I found out that some compiler/operating system combinations didn't support std::wstring, so parts of the build had to be conditionally excluded. C++ libraries have different levels of Standards conformance, and there are ambiguities in some places, so parts of my code have to be conditionally included to plug the gaps. Some platforms put things in different places, or expect certain types of files to have certain extensions, which needs more Makefile editting. Shared libraries, for example, generally have .so extensions. Under Cygwin, however, uses .dll, while OS X and other Darwin derivatives use .dylib.

At the time Damon was discarding Arabica as completely unusable, my build notes were

Building Arabica isn't hard, but it can be a little fiddly.
First, you will need to have at least one of the following parsers installed - expat, libxml, Xerces. If you're working on a Linux box, you probably have libxml or expat already installed. It's entirely possible to build in support for several parsers, but you'll probably only want one.
Next you need to build the SAX library, configuring it for your choice of parser, or parsers.
In an ideal world you'd just do ./configure and be done with it. Unfortunately, at the moment the dark recesses of template meta-programming are as nothing to getting autoconf going. One day ... So anyway, we have to resort to a little Makefile fiddling. What I'm going to describe is probably GNU Make specific, but for other Make variants you should be able to follow along ok.
Choose your parser (or parsers) as detailed above.
You'll need a relatively Standards compliant C++ compiler and library - gcc 3.x.y is okay, gcc 2.95.* will probably work if you use an alternative library such as STLPort.
Untar the Arabica source.
At the top level directory, you'll find a Makefile which builds everything. It uses the -include directive to pull in Makefile.header, which is where all the twiddly bits are.
Pull up Makefile.header in your favourite editor. Most of it should be pretty obvious - defining CXX to point to your C++ compiler and so on. There are some examples in the distribution you can use as a base.
The interesting Makefile.header macro is PARSER_CONFIG. PARSER_CONFIG controls which parsers Arabica will use, and also whether to compile in wide character support. For each parser you want to configure as -DUSE_parser. The choices are USE_EXPAT, USE_LIBXML2 and USE_XERCES. If you don't need, or your platform+compiler doesn't support wide characters (eg. Cygwin, gcc on Solaris) you'll also need to set -DARABICA_NO_WCHAR_T. For each parser you support, add the appropriate -lwhatever (-lexpat, -lxerces-c, -lxml2) to DYNAMIC_LIBS.
Run make. libSAX should build, possibly with a number of warnings about preprocessor tokens, and finish up in ./bin. If your parser's header files aren't installed in the usual places (/usr/include, /usr/local/include or whatever the default is for your platform), you'll have to edit INCS_DIRS in the Makefile. Once libSAX is built, everything else should build too.
The supplied Makefiles work for me on using gcc on Suse Linux 7.3, Cygwin and Solaris 7. If you can supply a Makefile.header for a new platform+compiler, I'd be delighted to receive it.
Once the SAX library is built, the DOM library is simplicity itself. You don't have to do anything! Arabica's DOM implementation is all headers files. If you want to use it, just include the appropriate parts, link the SAX library, and you're done.

You can see I had made some effort to ease this process. GNU Make supports an include mechanism, so I had moved all the platform specific pieces out into a separate Makefile fragment. This minimized the number of places that needed to be edited, but there was still a deal of manual intervention required. I supplied a number of platform specific versions, 6 at last count, but I didn't have regular access to all of the platforms in question. Note also the equivocation - "that should be it", "will probably work", "works for me". It wasn't reliable, and I knew it.

It was a maintainance bother too. As I added more test and example programs, I had more Makefiles to maintain. When I added the XPath engine, which uses Boost Spirit, I received emails from people who didn't need XPath asking me how to leave out it out of the build, as their builds were now broken.

I had this code, code I knew was good and portable and useful, but I had this cruddy, wobbly, unreliable build system that had accreted around the outside. It was awkward for me, off-putting for other people. At least one person thought I was a useless idiot. Something had to change.

I needed an alternative to my motley collection of Makefile bits and pieces. At the very least it had to meet the following criteria

be able to find Arabica's prerequisites - at least an XML parser and optionally Boost
identify whether std::wstring was supported
detect platform specific file extensions
track file dependencies
be at least as easy to maintain as my existing setup
stand a better than even chance of working on the random machine that somebody has just downloaded my code to

There are now many alternatives to Make. There's Ant, its Groovy derivative Gant, and its .NET-alike Nant. There's Cons and Scons. There's Jam and BJam. There's Rake and A-A-P and a whole host more I'd not even heard of. If you look at any of these tools, chances are there's at least a passing reference to how much better than Make it is.

I didn't consider any of them, not even for a moment.

If you download some arbitrary program or library written in C or C++, from Sourceforge, Tigris, Savannah, or whereever, it won't, as rule, use any of those tools. Chances are pretty good that it won't need anything like the fiddling the Arabica did. You expect something like this :

$ wget http://somewhere/path/to/somelib.tar.gz
$ tar zxf somelib.tar.gz
$ cd somelib
$ ./configure
[... lots of output snipped ...]
$ make
[... lots more output snipped ...]
$ make install
[... a little bit more output, also snipped ...]

Anything else violates the principle of least surprise by a considerable distance. There was only one choice for Arabica - GNU Autotools.

The magic of the ./configure; make; make install is provided by GNU Autotools. Autotools is actually three separate packages - autoconf, automake, and libtool. Autoconf create portable and configurable packages, the configure script. Automake is a Makefile generator, used with autoconf to produce Makefiles based on what configure finds out about the system. Libtool is a set of shell scripts to build shared libraries in a generic fashion. In reality, you don't use one without the others. As far as I can tell Autotools is not an official name, but everybody knows that it means.

You might have noticed a disparaging reference to configure in the build notes above. I've actually been here before. Six years ago, I attempted to convert Arabica as was to use Autotools. Even armed with a hot off the press copy of New Riders' GNU Autoconf, Automake, and Libtool - a book written by the primary Autotools maintainers - I made absolutely no progress at all. I found the whole process so dispiriting and confusing that I abandoned my efforts, subsequently consigning myself to years of creaking Makefiles and the contempt of Damon from Montreal.

Six years is a long time in programming. Despite my previous bad experience, I had no doubts that I would, in relatively short order, autoconfiscate my project.

In part two, I tell you how I did it.

Tagged article, autotools, autoconf, and code

Article: Adventures in Autoconfiscation - Part Three »

« Favourite Train