Arabica, an XML Toolkit written in C++

Arabica is an XML Toolkit offering SAX and DOM parsers, a hybrid DOM plus callback parser, an XPath engine, and a partial and still developing XSLT engine. Arabica is written in C++ and builds on all common platforms.

My two primary aims for Arabica are correctness and ease of use. By correctness I mean adherence, as far as possible, to the relevant standards and recommendations, both with respect to C++ as well as XML.

Ease of use is more nebulous. Many of the programming interfaces Arabica provides are defined in IDL or Java, so direct translation into C++ is not always desirable, still less possible. I aim to write them in idiomatic C++ rather than, say, "Java in C++" or "C with Classes". I promote, for example, passing parameters by const reference rather than by pointer.

Ease of use extends to other things too. You shouldn't have to manually manage any memory or resources Arabica creates and the ownership of objects passed into Arabica shouldn't be in doubt. Most significantly it doesn't impose a string type on you. You can use Arabica with std::string, std::wstring, or, by writing a simple adaptor, with any string class of your choice using the encoding of your choice. Arabica is parameterised on the string and adaptor classes, allowing them to be injected right into the heart of the library, minimising compiled code size, reducing runtime overhead, and providing as natural an API as possible. If, as is probably quite likely, you are not interested in the minutiae of such things, it provides reasonable defaults and should just work.

I am almost certainly getting ahead of myself. Let's take tour of what Arabica offers, see it action, and get a feel for it. Arabica's XML slinging capabilities fall naturally into four main layers, each one building on the layer below, and you can use as much or as little as you need. At the bottom is the SAX layer, so let's begin there.

SAX

SAX, the simple API for XML, is an event-driven serial-access mechanism for accessing XML documents. As the parser encounters different pieces of an XML document, it raises events describing those pieces. SAX was originally developed to provide a common API to XML parsers written in Java, and proved such a success that SAX-like interfaces are available in most other languages.

To get a feel for how a SAX parser works, consider the following little piece of XML shown in Listing 1. A parser implementing SAX breaks down the document into a linear sequence of events, such as that shown in Listing 2.

Listing 1: A simple XML document
<example>
  <para>This is a very simple example.</para>
  <para>I hope it will make sense.</para>
</example>
		

Listing 2: A Sequence of SAX events
start document
start element: example
start element: para
characters: This is a very simple example.
end element: para
start element: para
characters: I hope it will make sense.
end element: para
end element: example
end document
		

Listing 3: The ContentHandler interface
template<class string_type, class string_adaptor>
class ContentHandler
{
  public:
    virtual void startDocument() = 0;
    virtual void endDocument() = 0;

    virtual void startElement(const string_type& namespaceURI, 
                              const string_type& localName,
                              const string_type& qName,
                              const AttributesT attrs) = 0;
    virtual void endElement(const string_type& namespaceURI, 
                            const string_type& localName,
                            const string_type& qName) = 0;

    virtual void characters(const string_type& ch) = 0;

    virtual void processingInstruction(const string_type& target,
                                       const string_type& data) = 0;
      
    ... 
};
		

An application registers callback functions with the parser. The parser raises events by calling the appropriate function. SAX defines quite a large number of possible callback functions, grouping related functions into interfaces. The most important of these interfaces is ContentHandler. In C++ we define interfaces using abstract classes. Arabica's ContentHandler, in part, is shown in Listing 3. Note the template parameters but don't worry about them, I'll discuss them in detail later.

In many cases event this set of callbacks is more than you need, so Arabica provides a do-nothing implementation, DefaultHandler, to use as a concrete base class.

Listing 4 gives a full example. It reads an XML document and generates Pyx, an alternative line-oriented notation.

In this example, the SAX2PYX class overrides those callback functions it needs, and leaves the do-nothing DefaultHandler to do the rest. The main body of the program creates a parser,

Arabica::SAX::XMLReader<std::string> parser;
and hooks up the callback functions
parser.setContentHandler(handler);
parser.setErrorHandler(handler);
The parser needs to know where to read data from. An InputSource allows a file name, URI, or a stream to be used as the data source. In this case what was passed on the command line is probably a file name.
Arabica::SAX::InputSource<std::string> source(argv[i]);
Now, everything is set up and we're ready to go.
parser.parse(source);
This sets the parser reading the XML document. As encounter elements, attributes, and so on, the corresponding member functions on SAX2PYX are called.

Feeding this the sample document above produces the output shown in Listing 5.

SAX's event plus callback nature might appear unwieldy, particularly if you haven't worked with event-driven systems before. It does, however, have a number of things going for it. The memory overhead of SAX parsing is very low and is, generally, unrelated to document size. Typically, it is related to element nesting depth, but for most intents and purposes it can be considered constant and small. This means that a SAX based application should be able to process very large documents, even exceeding available memory.

SAX also allows you to focus on those parts of the document you are interested in. On a coarse level, for instance, if you are not interested in the document text, you don't even have to look at it. If you are interested in a particular element, a simple check at the top of startElement is all you need.

SAX has its difficuties too. It provides forward, linear access to a document, that is it starts and the top and carries on, element after element, until it reaches the end. If you need random access, SAX is probably not for you.

SAX event-driven nature can make tracking state awkward. If you are interested in, say, the first address element following a customer element, you can see you might get caught up in a lot of flag twiddling. There is a way out here though, as SAX allows the registered callback handlers to be changed while a parse is in progress. By building a set of related handlers, explicit and potentially tricky state tracking can be avoided. There's not space for a proper example here, but Arabica's XSLT engine uses exactly this approach to compile XSLT stylesheets into executable objects.

SAX also provides explicit support for filters. A SAX filter presents the ContentHandler interface on one side, and the Parser interface on the other. A filter can suppress events, generate additional events, or modify events "in flight". By building stacks of filters, it's possible to build quite sophisticated XML processors in a straightforward and moduler way. Arabica includes several filters, including classes to strip out document text, strip out everything but text, and to monitor and track xml:base attributes.

DOM

The DOM is the other major XML parsing approach. While SAX gives low-level, piece-by-piece, read-only access to a document, a DOM parser gives you a high-level, in-memory representation of an XML document. DOM means Document Object Model, and that's just what it is.

The object model takes the form a tree mirroring the hierarchical structure of an XML document. Given our earlier example in Listing 1, a DOM parser builds a tree of objects, as shown in Figure 1.

Figure 1: Tree of DOM Objects

The XML document the DOM models is composed of nodes. DOM defines a class for each type of node: element, text, comment, and so on; all of which are subclasses of the node base class. A node in the tree can have children, it may have siblings. All nodes have a parent node, except the top most document and attributes (which have an owning element instead).

Here, the example element node has two children, the para element nodes. They both a single text node child. The base class Node gives access to information about a node: its type, name, and so on; and to its children, siblings, and parent. By moving from parent to child, from sibling to sibling, it is possible to navigate anywhere within the document. The specific node subclasses: Element, ProcessingInstruction, Text, and so on; provide additional functionality but, as a rule, you rarely need to down case from Node.

The code in Listing 6 again converts an XML document to Pyx notation, this time using a DOM parser and walking the tree it builds.

At the heart of this code is a getFirstChild/getNextSibling recursive loop, which is a common pattern for exhaustively walking a DOM tree. Notice that attributes cannot be reached by this loop. Attributes exist to the side of the tree, and are reached from their owning element. The cast from Node to Element, in order to call getAttributes, is the exception to the "no need to downcast" rule.

Note the lack of any explicit memory management. Arabica's DOM implement looks after itself, and there is no need to track tree lifetimes and explicitly free or release or delete a DOM tree. This is, as far as I'm aware, a unique feature of Arabica amongst C++ DOM implementations.

As well as reading a DOM tree, you can also update it. You can add new pieces to it, cut pieces off it, change existing pieces. Compared with SAX, this sounds like a random access read/write nirvana. There is a price, though, and that price is memory. A DOM tree can require two or even more times as much memory as the size of the document it represents. Each node has to maintain pointers to its parent and children, in addition to the information about itself. Even using a string pool to avoid redundant copies of text laying around, the plain fact is that a DOM requires a lot of internal housekeeping. Consequently, this means that the DOM can struggle with large documents.

Template parameters

You'll have noticed a great many template parameters flying around the place in the code I've shown - string_type and string_adaptor. You may even be wondering what they are for, particularly as my examples just use std::string and seem to ignore the string_adaptor parameter.

For the C++ library writer, string handling can be a vexacious subject. While the Standard Library has provided std::string for years, many widely used frameworks use their own string types - QString, wxString, CString, GLib::ustring, and so on. It's also not uncommon to find legacy string classes used in internal toolkits. Even if you choose to use std::string as your library's string type, you ignore the growing band of those using std::wstring. Consequently, any choice you make is to some extent wrong.

Arabica, I hope, makes a "less wrong" choice. Arabica lets you use any string you like. That's the string_type template parameter. Clearly though, Arabica can't magically work with any arbitrary string class that's thrown at it. That's where the second template parameter, string_adaptor, comes in. The string_adaptor's job is to mediate between what Arabica expects and the particular string class provides.

Out of the box, Arabica knows about std::string and std::wstring. This is why, in the examples above, there's no need for an explicit string_adaptor as the defaults are sufficient.

Internally, Arabica never calls any member function on a string_type object. Instead, it calls static member function on the string_adaptor, passing in the string_type object. In is up to the string_adaptor class to call the appropriate member function.

Here's a snippet from Arabica's internals,

value_ = string_adaptor::construct_from_utf8("");
for(DOMNode_impl* c = Node::getFirstChild(); c != 0; c->getNextChild())
string_adaptor::append(value_, c->getNodeValue());
In almost every case, the body of string_adaptor function is trivial. In this case, the default string_adaptor append function is
static void append(std::string& str, const std::string& a)
{
str.append(a);
}

A full class for use as a string_adaptor has approximately 20 member functions, the majority of which will probably be implemented in one line. The vast majority of Arabica is implemented as template classes. The rules for compiling template classes mean that code is only generated for those member functions that are needed. The means, depending on what you actually do with Arabica, it may not be necessary to even define, let alone implement, all of the string_adaptor functions.

As you will be aware, XML documents are Unicode text, which means we, as programmers, have to be at least aware of character encoding issues. The current C++ Standard makes no reference to character encoding (although I understand that next version, due in 2009, will), so we are somewhat on our own here. By default, Arabica delivers UTF-8 encoded std::strings, or UTF-16 encoded std::wstrings. While not a perfect solution it seems to suit most situations, at least in my experience. Where it doesn't suit, the string_adaptor mechanism allows an alternate encoding to be substituted.

Arabica builds on an existing parser library. It can use Expat, libxml2, Xerces, or MSXML. Expat and libxml2 return character strings encoded using UTF-8, while Xerces and MSXML return their character strings using UTF-16. When Arabica needs a new string_type object, it calls a factory method on string_adaptor, either construct_from_utf8 or construct_from_utf16. The string_adaptor is responsible for any necessary transcoding from UTF-8 (or UTF-16) to your desired encoding, before constructing the string_type object. Arabica includes stream classes and code conversion facets to assist here.

By parameterising on string_type and string_adaptor, coupled with reasonable defaults, Arabica attempts to strike the balance between simplicity and flexibility. Using the defaults, Arabica probably does what you need. However if it doesn't, then by supplying a custom string_type and string_adaptor Arabica can be customised at its very core. It will then give you strings of your choosing, in your choice of encoding, without the need for a separate translation layer or the inconvenience of using different string classes in different parts of your programs. As a little bonus, since the calls on string_adaptor are through static member functions, the compiler is almost certain to inline the calls away.

XPath

The DOM provides read and write access to the whole of an XML document. It does not, however, lend itself to easy random access. Imagine we wanted to find all the "para" elements in a document. It would require the kind of exhaustive walk using the DOM to Pyx example we looked at in Listing 6. Finding all the "para" elements immediately following a "title" element would be a little more involved. But what about some arbitrary query? It could all get frightfully fiddly very quickly.

The W3C defined XPath exactly fills this niche. XPath is a language for searching and querying XML documents, and Arabica builds it over its DOM implementation.

Using Arabica's XPath engine is straightforward. Listing 7 shows a simple XPath grep utility, which searches XML documents for arbitrary XPaths. Listing 8 shows it in action.

I haven't dwelt on the details of Arabica's internal implementation, but I must mention that without the remarkable Boost::Spirit library it is extremely unlikely that the XPath engine would have been written. Boost::Spirit is parser generator library which allows an EBNF grammar to be transcribed, almost directly, in C++ code. I took the XPath grammar from the XPath recommendation and codified it, literally codified it, in about two hours. As I recall, I was watching the television at the same time. It needed a little bit of extra work to eliminate left recursion, which is a straightforward and rather mechanical transformation, and with that done I had a complete XPath parser. Furthermore, it was a parser I could relate directly to the original grammar productions, and so could be confident was correct. With that parsing taken care of, I could concentrate on writing the code to do the work of evaluating XPath. If you are looking at building a parser, you really owe it to yourself to look at Boost::Spirit.

XSLT

The three chunks of Arabica I've covered so far, SAX, the DOM, and the XPath engine, are all complete and sound pieces of code. From its beginning as a quick translation of a few pieces of Java, Arabica has grown to be a solid and useful library. While I never planned to go beyond that early SAX code, each piece seemed to lead to the next. Working on SAX pointed naturally to writing a DOM implementation. Using the DOM for a while nudged then pushed me toward XPath. But that was going to be it. I was going to look after the code I had, but pull down the shutters on new development.

Well, that's what I told myself.

Although I can no longer find the reference, I'm sure I read Daniel Veillard, the author of libxml and libxslt, had said "once you have an XPath engine, implementing XSLT is easy". Perhaps he didn't, but I got that in my head and it wouldn't go away. As a result, as of August this year, Arabica includes Mangle, an XSLT engine.

Mangle isn't yet complete, but covers a lot of the common cases. I'm quietly pleased with it (and myself, if I'm honest). Veillard's almost right, by the way. It's not easy, but it is relatively straightforward. It's a lot of fun too. XSLT is, you may be surprised to learn, a Turing complete functional programming language and consequently it is an interesting problem to work on.

History And Future

Working on Arabica these past eight years has been, and continues to be, very rewarding and rather fun. I've met some interesting people, and some very interesting conversations. It's allowed me to work on a body of code for extended period of time. When work commitments have taken me in other areas, I've been able to keep at least a toe in the C++ waters. I've been able to experiment with approaches and techniques that I may not otherwise have been able to try. I'm a better programmer for working on Arabica. Finally, since I decided to release Arabica under an open source license, I know people are downloading the code and playing with it. Some of them are even using it, and what could be better than that?

The Arabica source code is released under a BSD style license. The code can be used by anyone in source or binary form for any purpose, with only a simple requirement to retain copyright notices.

Going forward, I'm not sure what comes next once the Mangle XSLT engine is complete. Right now XProc, the XML pipeline language, looks interesting. I might look at XQuery, or go back to the bottom of the stack and write an XML parser. I don't know. It depends on how I feel, what catches my eye, and the emails I receive. If, following this article, you decide to look at Arabica, do let me how you find it and what you might like.


#include <SAX/helpers/DefaultHandler.hpp>
#include <SAX/InputSource.hpp>
#include <SAX/XMLReader.hpp>
#include <iostream>

class SAX2PYX : public Arabica::SAX::DefaultHandler<std::string>
{
public:
  virtual void startElement(const std::string& namespaceURI, const std::string& localName,
                            const std::string& qName, const Arabica::SAX::Attributes<std::string>& atts)
  {
    std::cout << '(' << localName << std::endl;
    
    for(int i = 0; i < atts.getLength(); ++i)
    {
      std::cout << 'A' << atts.getLocalName(i)
                << ' ' << escape(atts.getValue(i))
                << std::endl;
    } // forward ...
  } // startElement
  
  virtual void endElement(const std::string& namespaceURI, const std::string& localName,
                          const std::string& qName)
  {
    std::cout << ')' << localName << std::endl;
  } // endElement
  
  virtual void characters(const std::string& ch)
  {
    std::cout << '-' << escape(ch) << std::endl;
  } // characters
  
  virtual void processingInstruction(const std::string& target, const std::string& data)
  {
    std::cout << '?' << target
              << ' ' << data << std::endl;
  } // processingInstruction
  
  virtual void warning(const Arabica::SAX::SAXParseException<std::string>& e) { fatalError(e); }
  virtual void error(const Arabica::SAX::SAXParseException<std::string>& e) { fatalError(e); }
  
private:
  std::string escape(const std::string& str) const
  {
    std::string estr(str);
    std::string::size_type i(estr.find("\n"));
    while(i != std::string::npos)
    {
      estr.replace(i, 1, "\\n", 2);
      i = estr.find("\n", i);
    } // while ...
    return estr;
  } // escape
}; // class SAX2PYX

int main(int argc, char* argv[])
{
  if(argc == 1) 
  {
    std::cout << "Usage : " << argv[0] << " xmlfile ... " << std::endl;
    return 0;
  } // if(argc == 0)
  
  SAX2PYX handler;
  
  for(int i = 1; i < argc; ++i)
  {
    try
    {
      Arabica::SAX::XMLReader<std::string> parser;
      parser.setContentHandler(handler);
      parser.setErrorHandler(handler);
      
      Arabica::SAX::InputSource<std::string> source(argv[i]);
      parser.parse(source);
    } // try
    catch(std::runtime_error& e)
    {
      std::cerr << "Parse problem " << e.what() << std::endl;
    } // catch  
  } // for ...
  
  return 0;
} // main
    
Back to article
$ ./pyx simple.xml
  (example
  -\n
  -
  (para
  -This is a very simple example.
  )para
  -\n
  -
  (para
  -I hope it will make sense.
  )para
  -\n
  )example
    
Back to article
  #include <string>

  #include <iostream>
  #include <DOM/SAX2DOM/SAX2DOM.hpp>

  void generate_pyx(Arabica::DOM::Element<std::string> element);
  std::string escape(const std::string& str);

  int main(int argc, char* argv[])
  {
    if(argc < 2) 
    {
      std::cout << "Usage : " << argv[0] << " xmlfile ... " << std::endl;
      return 0;
    } // if(argc < 2)

    Arabica::SAX2DOM::Parser<std::string> domParser;

    for(int i = 1; i < argc; ++i)
    {
      std::string file(argv[i]);
      Arabica::SAX::InputSource<std::string> is;
      is.setSystemId(file);
      domParser.parse(is);

      Arabica::DOM::Document<std::string> doc = domParser.getDocument();
      if(doc == 0)
      {
        std::cerr << "Error parsing " << file << std::endl;
        continue;
      }

      generate_pyx(doc.getDocumentElement());
    } 

    return 0;
  } // main

  void generate_pyx(Arabica::DOM::Element<std::string> element)
  {
    std::cout << '(' << element.getNodeName() << std::endl;

    if(element.hasAttributes())
    {
      Arabica::DOM::NamedNodeMap<std::string> attrs = element.getAttributes();
      for(int a = 0; a != attrs.getLength(); ++a)
      {
        Arabica::DOM::Node<std::string> attr = attrs.item(a);
        std::cout << 'A' << attr.getNodeName() 
                  << ' ' << escape(attr.getNodeValue()) 
                  << std::endl;
      } // for ...
    } // if ...  

    for(Arabica::DOM::Node<std::string> n = element.getFirstChild(); n != 0; n = n.getNextSibling())
    {
      switch(n.getNodeType())
      {
        case Arabica::DOM::Node<std::string>::ELEMENT_NODE:
          generate_pyx(static_cast(n));
          break;
        case Arabica::DOM::Node<std::string>::TEXT_NODE:
        case Arabica::DOM::Node<std::string>::CDATA_SECTION_NODE:
          std::cout << '-' << escape(n.getNodeValue()) << std::endl;
          break;
        case Arabica::DOM::Node<std::string>::PROCESSING_INSTRUCTION_NODE:
          std::cout << '?' << n.getNodeName() 
                    << ' ' << n.getNodeValue() << std::endl;
          break;
      } // switch ...
    } // for ...

    std::cout << ')' << element.getNodeName() << std::endl;
  } // generate_pyx

  std::string escape(const std::string& str)
  {
    std::string estr(str);
    std::string::size_type i(estr.find("\n"));
    while(i != std::string::npos)
    {
      estr.replace(i, 1, "\\n", 2);
      i = estr.find("\n", i);
    } // while ...
    return estr;
  } // escape
  
Back to article
Listing 7: A simple XPath grep utility
#include <string>

#include <SAX/helpers/CatchErrorHandler.hpp>
#include <DOM/SAX2DOM/SAX2DOM.hpp>
#include <DOM/io/Stream.hpp>
#include <XPath/XPath.hpp>

////////////////////////////////////////////////
int main(int argc, char* argv[])
{
  if(argc < 3) 
  {
    std::cout << "Usage : " << argv[0] << " xpath xmlfile ... " << std::endl;
    return 0;
  } // if(argc < 3)

  Arabica::XPath::XPath<std::string> xpathParser;
  Arabica::XPath::XPathExpressionPtr<std::string> xpath;
  try {
    xpath = xpathParser.compile_expr(argv[1]);
  }
  catch(const std::runtime_error& error) {
    std::cout << "XPath compilation error: " << error.what() << std::endl;
    return 0;
  }
  
  Arabica::SAX2DOM::Parser<std::string> domParser;
  Arabica::SAX::CatchErrorHandler<std::string> eh;
  domParser.setErrorHandler(eh);

  for(int i = 2; i < argc; ++i)
  {
    std::string file(argv[i]);
    Arabica::SAX::InputSource<std::string> is;
    is.setSystemId(file);

    if(file != "-")
      domParser.parse(is);
    else
    {
      is.setSystemId("stdin");
      is.setByteStream(std::cin);

      domParser.parse(is);
    } // if(file != "-")

    if(!eh.errorsReported())
    {
      Arabica::DOM::Document<std::string> doc = domParser.getDocument();
      Arabica::XPath::XPathValuePtr<std::string> result;
      result = xpath->evaluate(doc);
      if(result->asBool())
      {
        std::cout << file << std::endl;
        if(result->type() == Arabica::XPath::NODE_SET)
        {
          const Arabica::XPath::NodeSet<std::string>& ns = result->asNodeSet();
          for(unsigned int i = 0; i < ns.size(); ++i)
          {
            Arabica::DOM::Node<std::string> n = ns[i];
            std::cout << n << std::endl;
          }
        } // if ..
      } // if ...
    }
    else
    {
      std::cerr << eh.errors() << std::endl;
      eh.reset();
    } // if ...
  } // for ...

  return 0;
} // main
    
Back to article
Listing 8: xgrep output
  # select all the para elements
  $ ./xgrep //para simple.xml
  simple.xml
  <para>This is a very simple example.</para>

  <para>I hope it will make sense.</para>
  
  # find the first para child of the example element
  $ ./xgrep /example/para[1] simple.xml
  simple.xml
  <para>This is a very simple example.</para>

  # find all the text containing the word 'hope'  
  $ ./xgrep '//text()[contains(., "hope")]' simple.xml
  simple.xml
  I hope it will make sense.
    
Back to article

Jez Higgins