Adventures in Autoconfiscation - Part One of Three

Arabica is an XML toolkit written in C++. It provides a SAX interface for streaming XML parsing, a DOM interface for in-memory XML processing, and an XPath engine for easy DOM access. In the next release or two, it will add an XSLT processor. Arabica supports std::string, std::wstring or pretty much any other crazy string class. The code itself is good, honest, standard C++, which my experience shows is highly portable. I've built Arabica on Windows using Visual C++, under Cygwin, on a variety of Linux flavours, FreeBSD, several types of Solaris, OS X, and GNU Darwin. It's quite a tidy package, and if you're working with XML in C++ you should consider it. That's what I think, anyway, but I did write.

Damon, in Montreal, disagrees. On 22 February 2005 he wrote

"The C++ port of SAX (a set of standard JAVA API for XML parsing) Arabica is totally unusable, there are even syntax errors like namespace errors in the socalled stable release, besides it does not have any reference manual!"

The following week he wrote

"It is an awful library with a bunch of syntax errors in its newest release (namespace error....), no documents available at all. Fail to compile it on Linux at all."

I didn't have any correspondence with Damon. I found his comments months later during a vanity searching session on Technorati. In the nearly 8 years since Arabica's initial release, it's been my experience that people very rarely write to you about your software. When they download your code, it either works or it doesn't. If it works, they got on with what they were doing. If it doesn't, they may take a moment to think you're an idiot, then they fling it away and try something else. More often than not, that first point of failure is the build. If it doesn't build, it's fallen at the very first hurdle.

The Arabica distribution contains, currently, around 150 source files. Since Arabica is largely implemented as C++ templates, the majority of the files don't need to be compiled and build seperately. You just include them into your code. Only a handful, under 20, need to be build into a shared or static library.

Why did Damon have such a hard time? Why didn't I?

Come with me. Come with me on a journey through time.

When I started on the code that would become Arabica, I was an angry man. I was having a very bad experience at work, with an rotton developer, who had handed over an horrible, verbose, bug-ridden piece of code. It was, allegedly, an XML parser. It read an XML wire-format and built a C++ object graph, what we now call deserialisation. At the time, I called it rubbish. (I know everybody has "he was awful" war stories, but this was, genuinely, one of the worst experiences of my working life. I'm getting angry just thinking about it again.)

I had argued that we shouldn't, absolutely shouldn't, build our own parser, but use one of free parsers that were, even then, already available. When I had this lump of code dropped on me, I wanted to demonstrate just how awful it was. I grabbed the Expat source and got the build going on Windows. Next, I grabbed the recently released Java SAX interfaces, and ran though them search and replacing String for std::string. (SAX describes a streaming XML parser interface. It was initially developed provide a common interface to XML parsers written in Java (as JDBC provides a common interface to databases), but there are now implementations in most languages.) That done, I hooked up Expat, which is a C library that deals in char*, to my new SAX classes. It worked. No bugs. Not bad for an afternoon's work. I released the code as an afterthought. I didn't think it was of particular interest, but the code I'd based it on was freely available, and I needed something to put on my website.

Over the subsequent months, and then years, I continued to work on Arabica on and off. There was a new version of SAX, which I incorporated. Other C and C++ XML parsers were released and I wrote SAX wrappers for them.

For most of that time, my primary development platform was Visual C++ 6 and then 7. Every now and again, I'd boot up a Linux box, refresh the Makefiles, and clean up the conformance errors GCC pointed out. It worked ok, after a fashion.

As the library grew, the build became increasingly fiddly. While Arabica provided bindings for Expat, libxml2, Xerces, and MSXML, you'd only want to build against one of those. That implies a certain amount of Makefile editing. I found out that some compiler/operating system combinations didn't support std::wstring, so parts of the build had to be conditionally excluded. C++ libraries have different levels of Standards conformance, and there are ambiguities in some places, so parts of my code have to be conditionally included to plug the gaps. Some platforms put things in different places, or expect certain types of files to have certain extensions, which needs more Makefile editting. Shared libraries, for example, generally have .so extensions. Under Cygwin, however, uses .dll, while OS X and other Darwin derivatives use .dylib.

At the time Damon was discarding Arabica as completely unusable, my build notes were

You can see I had made some effort to ease this process. GNU Make supports an include mechanism, so I had moved all the platform specific pieces out into a separate Makefile fragment. This minimized the number of places that needed to be edited, but there was still a deal of manual intervention required. I supplied a number of platform specific versions, 6 at last count, but I didn't have regular access to all of the platforms in question. Note also the equivocation - "that should be it", "will probably work", "works for me". It wasn't reliable, and I knew it.

It was a maintainance bother too. As I added more test and example programs, I had more Makefiles to maintain. When I added the XPath engine, which uses Boost Spirit, I received emails from people who didn't need XPath asking me how to leave out it out of the build, as their builds were now broken.

I had this code, code I knew was good and portable and useful, but I had this cruddy, wobbly, unreliable build system that had accreted around the outside. It was awkward for me, off-putting for other people. At least one person thought I was a useless idiot. Something had to change.

Out with Makefiles

I needed an alternative to my motley collection of Makefile bits and pieces. At the very least it had to meet the following criteria

There are now many alternatives to Make. There's Ant, its Groovy derivative Gant, and its .NET-alike Nant. There's Cons and Scons. There's Jam and BJam. There's Rake and A-A-P and a whole host more I'd not even heard of. If you look at any of these tools, chances are there's at least a passing reference to how much better than Make it is.

I didn't consider any of them, not even for a moment.

If you download some arbitrary program or library written in C or C++, from Sourceforge, Tigris, Savannah, or whereever, it won't, as rule, use any of those tools. Chances are pretty good that it won't need anything like the fiddling the Arabica did. You expect something like this :

$ wget http://somewhere/path/to/somelib.tar.gz
$ tar zxf somelib.tar.gz
$ cd somelib
$ ./configure
[... lots of output snipped ...]
$ make
[... lots more output snipped ...]
$ make install
[... a little bit more output, also snipped ...]

Anything else violates the principle of least surprise by a considerable distance. There was only one choice for Arabica - GNU Autotools.

In With Makefile.am

The magic of the ./configure; make; make install is provided by GNU Autotools. Autotools is actually three separate packages - autoconf, automake, and libtool. Autoconf create portable and configurable packages, the configure script. Automake is a Makefile generator, used with autoconf to produce Makefiles based on what configure finds out about the system. Libtool is a set of shell scripts to build shared libraries in a generic fashion. In reality, you don't use one without the others. As far as I can tell Autotools is not an official name, but everybody knows that it means.

You might have noticed a disparaging reference to configure in the build notes above. I've actually been here before. Six years ago, I attempted to convert Arabica as was to use Autotools. Even armed with a hot off the press copy of New Riders' GNU Autoconf, Automake, and Libtool - a book written by the primary Autotools maintainers - I made absolutely no progress at all. I found the whole process so dispiriting and confusing that I abandoned my efforts, subsequently consigning myself to years of creaking Makefiles and the contempt of Damon from Montreal.

Six years is a long time in programming. Despite my previous bad experience, I had no doubts that I would, in relatively short order, autoconfiscate my project.

In part two, I tell you how I did it.


Jez Higgins