[RSS 0.91]
Sunday 03 March, 2013
#Rather unbelievably, as least for me, some of this Arabica information has been translated into Serbo-Croat by Vera Djuraskovic.
[Add a comment]

Wednesday 28 November, 2012
#Arabica Release - 2012 November

Since putting the Arabica source up on GitHub there seems to have been a little surge in interest in it. It might be coincidence, of course, but I've received several emails and patches of the past few weeks. Once of those emails prompted me to do something I'd been putting off - parameterise the XSLT engine on string type. All the rest of the library is as string type agnostic as I could make it, allowing you to plug in std::string, std::wstring, or whatever other string class you might fancy. (In testing, I actually use a string type with no public member functions.) The XSLT engine was the last hold out, but no more and for the better.

If you've been using the XSLT engine what this means is that where you wrote

    Arabica::XSLT::StylesheetCompiler compiler = ...
    std::auto_ptr<Arabica::XSLT::Stylesheet> stylesheet = ...
you now have to write
    Arabica::XSLT::StylesheetCompiler<std::string> compiler = ...
    std::auto_ptr<Arabica::XSLT::Stylesheet<std::string> > stylesheet = ...
If you haven't been using the XSLT engine because the rest of your application uses std::wstring, then now there's nothing to stop you. Dive in!

Source tar.bz2
http://sourceforge.net/projects/arabica/files/arabica/November-12/arabica-2012-November.tar.bz2/download

Source tar.gz
http://sourceforge.net/projects/arabica/files/arabica/November-12/arabica-2012-November.tar.gz/download

Source zip
http://sourceforge.net/projects/arabica/files/arabica/November-12/arabica-2012-November.zip/download


Changes and Bug Fixes

DOM

XSLT

Build and installation


[Add a comment]

Friday 07 September, 2012
#Arabica on GitHub
I've migrated the Arabica source code to GitHub.
[Add a comment]

Thursday 30 December, 2010
#DOM Conformance Tests

Over the past few days, I've been working on Arabica's DOM conformance. Until now, it's been based entirely on my reading or not of the relevant W3C Recommendations. I've always been pretty confident is was correct, but a recent bit of undirected Googling reminded me of the W3C DOM Conformance Test Suites and I thought "why not".

The W3C tests are defined in XML and then transformed to code using XSLT. It comes with stylesheets to generated Java JUnit tests and Javascript JSUnit tests. Monkeying up something to generated Arabica-style CppUnit code only took a few minutes, and getting that code compiling and running only took a little bit longer than that. Embarrasingly, some of the existing DOM code didn't compile and nobody had ever noticed. Interrogating a doctype for its entities just isn't that common, I guess.

With that done, and to my relief, nearly all of the 500 odd tests in the Level 1 Core suite passed first time. Most of those that didn't relied on loading an external DTD, and those that remained were primarily around the behaviour of entity references and child nodes of attributes. Good to have it all fixed though.

Thanks to those who put these tests together. It must have really rather tedious, but all the tests I've looked at in any detail have been good and sensible.

Will move onto Level 2 Core in due course, but got a hankering to wrestle some more of Arabica's XSLT engine to the floor.


[Add a comment]

Sunday 24 October, 2010
#Arabica Release - 2010 November

For no particular reason than people like official releases and there hasn't been one for a very long time, I've cut a new Arabica release. I'm not entirely sure why I've labelled it 2010-November when it's clearly still October. There is no major new feature, just the gentle accumulation of more work on Arabica's XSLT processor along with sundry bug fixes.

Source tar.bz2
http://sourceforge.net/projects/arabica/files/arabica/November-10/arabica-2010-November.tar.bz2/download

Source tar.gz
http://sourceforge.net/projects/arabica/files/arabica/November-10/arabica-2010-November.tar.gz/download

Source zip
http://sourceforge.net/projects/arabica/files/arabica/November-10/arabica-2010-November.zip/download


Changes and Bug Fixes

SAX

  • Exceptions thrown by MSXML are usefully reported, and no longer corrupt the stack
  • updated for most recent Xerces release

DOM

  • Corrected set/get/removeNamedItemNS functions
  • splitText fixed
  • fixed setAttributeNodeNS
  • double delete when removing and re-adding an attribute fixed
  • operator<< extended for wide streams
  • operator<< correctly generates auotmatic namespaces prefixes for attributes

XPath

  • Some optimisations in the expression evaluation
  • variables may now, optionally, be resolved at compile time

XSLT

  • xsl:key and key() implemented
  • cdata-section-elements supported
  • literal result element (aka embedded stylesheets) implemented
  • minor speed optmiations
  • xsl:sort/@lang is still not supported, but now issues a warning rather than throwing an exception
  • function-available implemented
  • element-available stub implementation
  • xsl:sort attributes correctly implemented as attribute value templates
  • allow and ignore attributes in foreign namespaces
  • verify the qualified names used in the stylesheet (eg. as template names) have prefixes which are bound
  • take precedence into account when resolving named templates
  • disallow variables in xsl:key match and use expressions

Build and installation

  • Solution and project files for Visual Studio 7 (2003) and 8 (2005) are no longer provided. A script to generate them from the VS9 files is provided. The results are not guaranteed, but has worked fine when used previously.

Other bits and bobs

  • Builds without warnings
  • xgrep example application now also outputs non-nodeset results


I never did write the release notes for the previous release, back in March 2009. For completeness sake, they are

XSLT

  • generate-id implemented
  • detect circular imports and includes
  • escape tabs, carriage returns and line feeds when outputting attribute values

Other bits and bobs

  • Improved URI parsing


[Add a comment]

Thursday 05 November, 2009
#Arabica source code repository

Entirely through my own stupidity, I managed to corrupt the Arabica subversion repository. By sheer good luck, I'd been using Bazaar as my front-end client, and so had a clone of the entire repository sitting in my working directory. Accordingly, the Arabica source code is now housed in a Bazaar repository.

The repository can be browsed and you can grab your own working copy over HTTP using

  bzr branch http://jezuk.dnsalias.net/arabica-bzr/trunk
Write-access using bzr+ssh is available on request.


[Add a comment]

Saturday 01 August, 2009
#Development snapshots

Arabica code as at 13:00 on the 1st of August :


[Add a comment]

Friday 13 March, 2009
#Arabica March 2009 Release

Just uploaded to Sourceforge. Proper release notes to follow but main difference is a big performance improvement in Taggle parsing and further work on Arabica's XSLT engine.


[Add a comment]

Friday 27 February, 2009
#

Just wrote quite a long piece about what's been going on in Arabica over the past four months then, like a burk, killed Firefox. Hurrr.

What I'd said, in a rather long winded and rambling way, was that import precedence is now works correctly for all cases, not just mainly implemented for the common cases, a couple of nagging little bits got sorted out, and over the past few weeks I've implemented xsl:key and key(). As many times before, James Clark's concise and subtle spec text has been a pleasure to work with, and I've surprised myself with how easily I've been able to implement a feature. I've been working with this code for a long time now, but it really is holding up.

No comfort, no doubt, to learn that it's rightly spelled "berk", being short for "Berkeley Hunt", which rhymes with ...
johnwcowan, 1st Mar 2009
John, thank you so, so much. That is marvellous.
jez, 1st Mar 2009

[Add a comment]

Monday 20 October, 2008
#FAQ: When will Arabica's XSLT library be finished?

To tell the truth, I have no idea. Development is of Mangle, Arabica's XSLT engine, is ongoing, although progress varies according to the vagarities of how busy I am, how energetic I'm feeling, whether the kids have a swimming gala, and so on and so forth.

Although it's not done yet, it might well be done enough. I'm using the OASIS XSLT test suite to help drive development, and so it also provides a measure of how much has been done, what's working and what isn't. The results are published here, but all the code and test data is included in the download. The executive summary is the core stuff that you use every day works, but some of the bits round the edges (edges defined by my experience, anyway) are missing.

To my knowledge there's nothing that causes Mangle to crash, and anything that I haven't yet implemented generates a warning when the stylesheet is compiled.

Give it a go. It might do what you need.


[Add a comment]

Sunday 19 October, 2008
#FAQ: What are all those failing tests, and why are they ignored?

If you run the tests, the final testsuite exercises the XSLT engine and it will list a number of failures. Quite a large number. XSLT development is ongoing, and I'm using the OASIS XSLT test suite to guide that. Consequently, the tests that fail generally indicate something I haven't done yet, rather than an actual bug. The XSLT tests are, therefore, ignored by make check (should you be lucky enough to be working on a Unixy platform).

Failures in any other tests are, however, indicative of a problem that needs investigating.


[Add a comment]

Saturday 18 October, 2008
#Arabica October 2008 Release

The "Probably long overdue release" bringing a big chunk of new functionality.

Source tar.bz2
http://downloads.sourceforge.net/arabica/arabica-2008-october.tar.bz2

Source tar.gz
http://downloads.sourceforge.net/arabica/arabica-2008-october.tar.gz

Source zip
http://downloads.sourceforge.net/arabica/arabica-2008-october.zip

Exciting New Stuff

The exciting new stuff is Taggle, a port of John Cowan's rather super TagSoup package.

TagSoup, if you're not familiar with it, is

a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.
Obviously, if you have a SAX parser you can apply all your standard XML techniques - not only SAX filters, but building a DOM, applying XPaths, or XSLT transformations as well.

Cowan describes what TagSoup does as

TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on.

The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like:

This is <B>bold, <I>bold italic, </b>italic, </i>normal text
gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
Looks straightforward, doesn't it? Well, that's a simple example and it's still a tricky and awkward result in practice. Cowan's patience in persuing this and what looks like a rather elegant solution is to be applauded. Porting his code to C++ was quick and painless, and Taggle is a useful addition to Arabica. Thanks, John.

Arabica Taggle chews through HTML, providing the same SAX XMLReader interface as the XML parser, and can be used in exactly the same way. HTML source can be fed through SAX filter stacks, used to build DOM trees, queried with XPath, or transformed using XSLT.


Changes and Bug Fixes

There are, of course, many other fixes and changes. Most are relatively minor, and if you haven't been bitten by them you won't notice. The most significant changes are in Arabica's XSLT engine, Mangle. While still not feature complete and under development, it takes, in this release, a fairly big step forward.

SAX

  • Fixed AttributesImpl.getIndex. Thanks to Isak Johnsson for that, and what on earth was I thinking to me
  • Return attribute type as "CDATA" not the empty string
  • After all this time, realised I had too many template parameters on XMLReaderInterface. It only needs the string_type and string_adaptor. Any addition parameters are only of interest the implementing parser class

DOM

  • Output DocumentFragment properly
  • Output <elem/> for empty elements
  • Slipped a TextCoalescer filter into the DOM builder, so that consecutive bits of text get applied to a single Text or CDATA node, rather than as a series of nodes. (A series of nodes is perfectly legal, it's just slightly unexpected. Even to me, and I work with DOMs a lot :)

XPath

  • Some time ago, it was gently suggested to me that XPathValuePtr and XPathExpressionPtr both exposed implementation details and provided an interface that was inconsistent with the DOM classes, because you accessed the member functions via -> rather than . At the time, I was just pleased to have got the XPath stuff done and wasn't really fussed, so I left it. Since then though, it's niggled and niggled away at the back of my mind and now I've done something about it. XPathValuePtr has become XPathValue and XPathExpressionPtr has become XPathExpression, with the member functions accessed through the . operator. The XPathValuePtr and XPathExpressionPtr name and -> member access are retained for the meantime, so that existing code won't be broken. Existing code using XPathValuePtr will still work, but new stuff should use XPathValue
  • Correctly implemented Namespace Nodes. The XPath data model requires that namespace nodes are associated with an element, and sort ahead of attribute nodes in document order. Until now, Arabica's namespace node had no parent, or owner document and so was failing these requirements
  • The default namespace is included when constructing namespace nodes
  • Amazingly, the XPath prefix:* didn't compile. I had no test for it, and had overlooked it. Now I do, and it isn't
  • Unbound namespace prefixes throw an exception
  • Corrected text() test to match CDATA nodes as well as text nodes
  • XPaths are now evaluated as if the DOM had been normalised, even if it hasn't. That is, consecutive text nodes are treated as a single node

XSLT

  • Params are not passed on through an xsl:apply-imports call
  • Template names are now QNames
  • Template mode is now QName
  • In XPath node() matches any node of any type. In an XSLT match pattern, node() matches everything except attributes and the document root node. Fixed.
  • Fixed variable scoping in xsl:for-each, xsl:if, and xsl:choose
  • Escape naughty text when outputting processing instructions and comments (eg ---)
  • Use std::stable_sort instead of std::sort. When xsl:sort specifies a numerical sort, but you've got some string data in there we need to maintain the relative positions of that string data. This is the first time I can recall actually using std::stable_sort. I will mark it down in my big book of programming accomplishments.
  • Fixed local-name for namespace nodes
  • xsl:message can contain another xsl:message - now handled properly
  • Empty comments output correctly
  • Ensure xsl:choose has at lease one xsl:when
  • Make sure any xsl:template mode attribute is not empty
  • Verify xsl:sort attribute values
  • xsl:call-template now throws if it can't find a matching template
  • Duplicate variable and parameter names are rejected
  • Disallowed current() in match patterns
  • Verify xsl:for-each selects a node-set
  • Disallow pcdata ahead of an xsl:param
  • xsl:stylesheet now allows top-level elements when they are in a foreign namespace
  • Implemented position(), last() and positional predicates in match patterns
  • Throw error if transform is run with no input
  • Verify QNames at transform compile time
  • Detect circular variable references
  • Reject variables and parameters which have both a select attribute and text content
  • Top level variables and parameters handled according to import precedence
  • Fixed internal QName resolution - unprefixed names are not in the default namespace
  • Fixed xsl:element unprefixed names - when no namespace uri is supplied are in the default namespace
  • Don't suppress output of element namespace prefixes or attributes which are in the XSL namespace
  • ensure @xmlns|@xsmlns:* selects no nodes
  • direct information messages to std::cerr, not std::cout

Build and installation

  • Fix for problem installing headers on FreeBSD, where install doesn't understand -D
  • Changes to help out-of-tree builds
  • Added build files for Visual Studio 2008
  • Added configure tests for std::mbstate_t and/or mbstate_t. Some platforms don't have it (VxWorks, for example)
  • Visual Studio 2005 and 2003 project files are now munged from the Visual Studio 2008 files. (Don't try this at home, folks)

Other bits and bobs

  • Fixed for base URIs with leading ../
  • Convert \ to / for relative paths as well as absolute Windows paths.


[Add a comment]

Friday 17 October, 2008
#Arabica: Cutting October 2008 release

A couple of months ago a release was, I said, impending. And it really was, but then I found a niggly thing I really want to fix. And went on holiday. And got really busy at work. And all that other stuff that happens when you're not programming.

There really is a release coming now, because I'm cutting it now. The source bundles will probably go are up on Sourceforge this evening now, and tagged in subversion. Release notes should follow later this weekend or early next week. I'll write up the niggly thing too, because it's quite a nice one.

The last release was just over a year ago. That's probably a bit too long.

Is Taggle part of this release? If so, I'll propagate this to the TagSoup community.
John Cowan [e] [w], 18th Oct 2008
Hi John, Taggle is indeed part of this release.
jez, 20th Oct 2008

[Add a comment]

Wednesday 06 August, 2008
#Arabica: impending release

Now my latest gentle stroll has concluded, there are one or two platform specific build issues to resolve. With them done, I expect to be dropping a new release around the end of August or start of September. The release will include the Taggle HTML parser and improved XSLT support, along with various little bug fixes, minor build improvements.

If you can't wait, there's always the subversion repository.
[Add a comment]

#XSLT: Variable resolution

After a bit of break, I've spent time hacking on Arabica again, which has been lovely. It's really rather relaxing to just nurdle around in your own code, without any particular pressure or need. My normal way of working on Arabica's XSLT processor is to run some of the test suite, pick a failing case, and fix it. If I can get a few more tests passing in half an hour or an hour, and I generally can, then that's a little step further along.

In this latest little bit of activity, I've been focussing on variables and variable resolution. I've fixed various problem with circular references, scoping, namespace resolution, and what I thought was going to be a thorny problem with import precedence.

What constantly surprises me is how straightforward most of these problems are, requiring only a few lines of code. In fact this has been the story of Arabica's XSLT development. Once the initial development push was done, almost all the rest has been a few lines here, a few lines there. I've been working away on this now for coming up three years, on and off and with digressions, and have no idea when I'll be done, but I that doesn't bother me at all. It's like an old pair of slippers, or favourite woolly jumper. It's a comfortable, gentle thing to slip into and go for a stroll in every now and again.


[Add a comment]

Wednesday 28 May, 2008
#Visual Studio, how I curse your useless warning C4800

'type' : forcing value to bool 'true' or 'false' (performance warning)
Performance warning, my arse.


[Add a comment]

Friday 18 April, 2008
#XSLT: Implementing position matches[2]

Revisiting position matches at the moment. I've described how position matches need to be written, and the code works and works well.

Except when it doesn't. It actually fails for the less common cases, and it took me a little while to work out why.

Here's the pattern from the test case that showed the problem

foo[@att1='c'][2]
It wants to match the second node in the set of foo elements with an att1 attribute containing 'c'.

Arabica rewriting finds the positional predicate and applies its incorrect magic. The rewritten pattern is equivalent to

foo[2][@att1='c']
which picks out the second foo node if it has an att1 attribute containing 'c'. The difference isn't immediately clear, even when you have the both the incorrect output and the expected output sitting in front of you.

My small crumb of comfort is that if you do want foo[2][@att1='c'], Arabica does do the right thing. That gave me the clue. Arabica implements XSLT match patterns by rewriting them as XPath expressions.

foo[2][@att1='c']
is rewritten as an XPath along the lines of
self::foo[. = parent::*/foo[2]][@att1='c']

My faulty rewriting of

foo[@att1='c'][2]
was
self::foo[@att1='c'][. = parent::*/foo[2]]
which you should be able to see is logically identical to the above. What I need is
self::foo[. = parent::*/foo[@att='c'][2]]
I had to work quite hard to see that this is what it should be, despite being pretty familiar with XPath and XSLT use and implementation. It's only been part of my working toolkit for the last 8 years or so, after all. When rewriting a positional match, any preceding predicates must be folded into the rewritten expression. Now I see it, it's pretty obvious.

Failing tests now pass, which is lovely.

Here's the change.


jez, 18th Apr 2008

[Add a comment]

Friday 08 February, 2008
#Taggle: Parameterised on string_type

The Taggle parser in subversion is now parameterised on string_type and string_adaptor, in exactly the same way as the usual Arabica XMLReader class. The two are now equivalent, which means that all the SAX filters, the DOM builder, XPath, and so on can be applied to Taggle.


[Add a comment]

Thursday 31 January, 2008
#

Moments before I'm about to go to bed, I discover Taggle fails (in a coredumpy way) for documents which have a DOCTYPE declaration. Will have a look at a fix in the morning.

And there's the fix in subversion.
jez, 1st Feb 2008

[Add a comment]

#Taggle: Building the code

If you've grabbed the code from subversion:

svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port
you might be wondering how to build it.

For Visual Studio 2005 users, open up the vs8\taggle.sln project and build away. It should just work. If it doesn't, then check the project build notes for information on setting up search paths and things.

For Unixy types, you will need a mighty three steps:

  1. autoreconf - to create the configure script
  2. ./configure - to dig out where the various bits and pieces Arabica needs are, and to create the Makefiles
  3. make - to, erm, make everything

Problems, questions, issues? Get in touch.


[Add a comment]

Wednesday 30 January, 2008
#Taggle: And there it is ...

Taggle, Arabica's port of the TagSoup HTML parser, now builds and runs. It dodges pretty much every encoding issue on the planet, but as a first go it's really quite pleasing. Give it this -

This is <B>bold, <I>bold italic, </b>italic, </i> normal text

and get this

<html>
    <body>This is
        <b>bold,
            <i>bold italic, </i>
        </b>
    <i>italic, </i>
normal text
    </body>
</html>
(Ok, you have to squint a bit at the indenting, but that's a separate issue.)

If you want to have a play, check out the tagsoup-port branch from subversion:

svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port

In examples/Taggle, there's a little command line application that read HTML documents and prints the corrected markup to the console.

I'll merge this back into the trunk in the next few days.

Why not implement an HTML5 parser instead of porting TagSoup?
zcorpan [e], 1st Feb 2008

Time and inclination. Porting TagSoup to C++ took me a few hours. It was fun, and quite an easy win. Having done it, I'm surprised that nobody's done it before.

Writing an HTML5 parser needs rather more time than I have - not only in writing the code, developing the test suite, but then tracking the standard as it emerges. Even if I had the time, I don't actually have the inclination, because it's not something that really interests me enough right now. Sorry :)


jez, 2nd Feb 2008
Thank you, this is precisely what I wanted.

I've been HTML coding since 3.2. Long after HTML8.0 has formally broken and obsoleted HTML5.3 and previous, tag-soup still works.


David Mullin [e] [w], 11th Feb 2012

It seems the SVN server is down. Would you please publish the updated download location?
Moritz [e], 9th Jul 2013

[Add a comment]

Tuesday 29 January, 2008
#Taggle: Bringing HTML Parsing to Arabica

After a rather intense return work after Christmas, I'm taking a bit of a break from Arabica's XSLT development for something a bit lighter - porting John Cowan's excellent TagSoup package to C++ and Arabica. TagSoup, if your not familiar with it, is

a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.

Obviously, if you have a SAX parser you can apply all your standard XML techniques - not only SAX filters, but building a DOM, applying XPaths, or XSLT transformations as well.

Cowan describes what TagSoup does as

TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on.

The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like:

This is <B>bold, <I>bold italic, </b>italic, </i>normal text

gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

Looks simple, doesn't it? Well, that's a simple example and it's still a tricky and awkward result in practice. Cowan's patience in persuing this and what looks like a rather elegant solution is to be applauded. Porting his code to C++ has been pretty quick and painless so far, and I expect the new piece, which I've called Taggle, to be finished pretty soon. Arabica will be stronger for it - thanks John!


[Add a comment]

Friday 21 December, 2007
#XSLT: Implementing position matches

Earlier this week, I outlined the equivalent XPath expressions for XSLT matches which use positions. I've bene avoiding implementing this for a while for a couple of reasons. Firstly, I thought it would take more time than I generally have in one Arabica sitting, i.e. more than an hour. Secondly, I wasn't quite sure how I'd actually go about it. I knew the equivalent expression were correct, I just didn't know how I was going to arrive at them in the code.

Happily, writing it out and then reading it back again later triggered the little flash I needed, and it turns out to be really quite easy. I've spent a bit of time on it this morning, having wrapped up the paying work for Christmas, and I've got the first pass working.

Within Arabica, each match pattern is represented as one or more steps, where each step in the pattern is represented as a TestStepExpression. A TestStepExpression contains some kind of node test (obviously) and one or more predicates. The predicate is the bit in square brackets. A match pattern like this

a[3]
would be compiled into a TestStepExpression contain a NameNodeTest checking for 'a', and a single predicate '3'.

To find expressions that need rewriting is simply

foreach step in the steplist
  foreach predicate in the step predicatelist
    if predicate is type NUMBER
      predicate = rewrite(predicate)
The code is here, if you're interested.

This kind of transformation could be done earlier on, on the AST produced by Arabica's XPath parser. It's easier, however, to operate on the compiled version. For instance, in the case above the numeric predicate could be a number literal, the result of a function, or the result of an arbitrarily complex calculation. Detecting all those cases is actually quite tricky at the AST level. Once the compiled objects are generated, we can just check the predicates return type.

Finding predicates containing calls to the last() or position() functions is going be slightly more work, and probably slightly fiddlier. I'm off to have a crack at that now.


[An hour or so later] ... and I think that's it. Each predicate is an expression, and an expression may contain other expressions and so on. An expression might model a comparision operator, for instance, which would contain expression each for the left and right hand operands. I put together a little walker to zoom through this expression tree looking for instances of the last or position functions. If it finds one, then I just generate the rewrite expression in exactly the same way as before.


[Add a comment]

Monday 17 December, 2007
#XSLT: Position matches

Almost exactly two years ago (glurk!), I wrote a little item about XSLT match patterns and gave a clue to how I implement them. A pattern like

chapter/para
is equivalent to an XPath expression like
boolean(self::para/parent::chapter)
That's how Arabica actually implements it, rewriting the match pattern as an XPath expression during compilation.

This works in every case, except those involving positional matches,

chapter/para[1]
for instance. In the simple case above, the rewriting is actually quite simple. You can push each step, here chapter then para, onto a LIFO stack and then pull them off again adding the extra bit of wrapping as you go. For positional matches, the rewriting is a little more involved. A match like
chapter/para[last()]
needs to be rewritten as something along the lines of
self::para[. = parent::*/para[last()]]/parent::chapter
although you could probably simplify it to
self::para[. = parent::chapter/para[last()]]

Arabica::XSLT doesn't do this rewriting at the moment, which is, I feel, the largest single hole in it. I'm aiming to have a go at in the next few days. If I can crack it relatively quickly, maybe by this time next year I'll be looking for a new project :)

Positional matches must by the way, at least as far as I can see, be evaluated in this way (even if the underlying implementation is different). This is why matches like this can be quite expensive in terms of runtime, and are generally discouraged.


[Add a comment]

Monday 05 November, 2007
#XSLT: Test case results spreadsheet

Uploaded the XSLT test results spreadsheet to Google Docs and, should you be so motivated, the latest results will always be available here.
[Add a comment]

Sunday 04 November, 2007
#XSLT: Test case results latest

Pushed ahead on the output test cases in particular this weekend. Headline figures are now
Run Failures Errors Skips
...
output10878 2301 41
...
sort377010 2
...
Total1421174 111093 124

My spreadsheet, which you'll no doubt recall I'm excited about, says that the total tests run is 1297 with 111 failures, for a success rate of 91.4%. Go statistics!


[Add a comment]

Saturday 03 November, 2007
#XSLT: A personal programming milestone

Doing some work picking off some of the easy failures have just committed a fix to the xsl:sort implementation. The fix was a simple change from std::sort to std::stable_sort. If the stylesheet specifies a numeric sort but there's some non-numeric data in there, then the relative order of the non-numeric bits and bobs must be maintained. I'm not sure I can see this explicitly stated in the specification text, but that's how Saxon, MSXML, and Xalan work so I would be silly not to play along.

This may be the first time I have ever used std::stable_sort, so felt I should mark the occasion in my big book of programming accomplishments.


[Add a comment]

Monday 29 October, 2007
#XSLT: Test case update

A couple of months I published some results running Arabica against part of the OASIS XSLT conformance test suite. I've done a bit of work since then, and so it's time to update the numbers

Run Failures Errors Skips
attribvaltample12001
axes130002
boolean90001
conditional23000
conflictres35001
copy62000
dflt4000
expression6006
extend4004
impincl29302
lre221100
match321401
math107100
mdocs18007
message16202
modes17000
namedtemplate19001
namespace1333900
node21000
output1087801
position1117015
predicate58000
processorinfo1001
reluri11102
select85006
sort377010
string133408
variable70700
ver5004
whitespace220010
Total1421174093

Since the last published results, I have one more skip and 20 less fails. My little spreadsheet (the first I have ever constructed, career fact fans) says I'm running 1328 tests altogether, with a pass rate of 86.9%.

A failure means the test ran, but did the wrong thing. An error means it threw an exception, didn't compile the XSLT, or something similarly unexpected. A skip means the test deliberately wasn't run because of some known deficency in my code. It might be a feature I haven't implemented, the test is just plain wrong (there are a couple of these), the test is Xalan specific, or some other thing. Skips come in three flavours - don't bother at all, shouldn't compile, or shouldn't run. If a test that's not expected to compile does, or one that shouldn't run suddenly starts working, that's actually flagged as a failure. There aren't any tests doing this in these results.

Not every failure represents a unique bug. Similarly not every skip represents a unique deficiency. The biggest set of failed tests, the 78 output failures, I haven't investigated in depth but I suspect many of those are related to either HTML output (which I don't do) or text output (which the test harness can't currently compare).

These results are from current Subversion head, built on Windows XP using Visual Studio 8 and expat.


[Add a comment]

Friday 19 October, 2007
#

Been on a little code hiatus, by the way, because I've been writing an article about Arabica for Software Developer's Journal. Not sure when it's going to see print, as SDJ is published simultaneously in English and Polish so it needs translating, but it will appear here in a few months once copyright reverts.


[Add a comment]

#XPath: Interface changes

Some time ago, it was gently suggested to me that XPathValuePtr and XPathExpressionPtr both exposed an implementation detail, because they derive fromboost::shared_ptr, and provided an interface that was inconsisted with the DOM classes, because you accessed the member functions via -> rather than through the . operator.

At the time, I was just pleased to have got the XPath stuff done and wasn't really fussed, so I left it. Since then though, it's niggled and niggled away at the back of my mind and now I've decided to do something about it.

XPathValuePtr will become XPathValue, with the member functions accessed through the . operator. The XPathValuePtr name and -> member access will be retained for the meantime, so that existing code won't be broken. XPathExpressionPtr will be similarly changed.

First commit went in this evening, now I've satisfied myself that the the changes are pretty easy so long as I pay attention.


[Add a comment]

Tuesday 02 October, 2007
#Arabica October 2007 Release

Patch release, fixing a build problem with older versions of GCC.

Source tar.bz2
http://prdownloads.sourceforge.net/arabica/arabica-Oct2007.tar.bz2

Source tar.gz
http://prdownloads.sourceforge.net/arabica/arabica-Oct2007.tar.gz

Source zip
http://prdownloads.sourceforge.net/arabica/arabica-Oct2007.zip


[Add a comment]

Wednesday 26 September, 2007
#Arabica September 2007 Release 2

This is a re-release of the September 2007 release which fixes a couple of build issues, affecting some platform/parser combinations.

The September 2007 release notes were:

The "certainly-break-your-build-but-it'll-be-easily-sorted-out" release.

This is the first Arabica release ever that knowingly breaks existing code, but the changes required are all straightforward and shouldn't take more than a few minutes to recover from.

The changes are

  • All Arabica header files now have a .hpp extension. Existing references to something.h will need to be updated (or mitigated by added a forwarding header).
  • The SAX namespace has been moved within the Arabica namespace. References to SAX::something will need to be changed to Arabica::SAX::something, or mitigated by a using declaration.
  • The DOM namespace and associated namespaces, like SimpleDOM, have been moved within the Arabica namespace. References to DOM::something will need to be changed to Arabica::DOM::something, or mitigated by a using declaration.
  • SAX classes named basic_something have been renamed something. Related typedefs along the lines of typedef basic_something something; have been removed. References to SAX::something will need to be changed to SAX::something, or mitigated by adding your own typedef.
  • All SAX and DOM classes now take both a string and string adaptor template parameters. This change should be transparent and require no changes.
  • Some header files in the Utils/ subdirectory have been moved:
    Utils/uri.hpp -> io/uri.hpp
    Utils/socket_stream.hpp -> io/socket_stream.hpp
    Utils/convert_adaptor.hpp -> io/convert_adaptor.hpp
    Utils/convertstream.hpp -> io/convertstream.hpp
    Utils/*codecvt.hpp -> convert/*codecvt.hpp
    Utils/normalize_whitespace.hpp -> text/normalize_whitespace.hpp
    XML/UnicodeCharacters.hpp -> text/UnicodeCharacters.hpp
    Utils/StringAdaptor.hpp -> Arabica/StringAdaptor.hpp
    DOM/Utils/Stream.hpp -> DOM/io/Stream.hpp
  • There are some namespace changes along with these physical changes. Any class in Arabica::Utils has been moved into Arabica::io or Arabica::convert.

Source tar.bz2
http://prdownloads.sourceforge.net/arabica/arabica-Sept2007-2.tar.bz2

Source tar.gz
http://prdownloads.sourceforge.net/arabica/arabica-Sept2007-2.tar.gz

Source zip
http://prdownloads.sourceforge.net/arabica/arabica-Sept2007-2.zip


[Add a comment]

Wednesday 19 September, 2007
#Arabica September 2007 Release

The "certainly-break-your-build-but-it'll-be-easily-sorted-out" release.

This is the first Arabica release ever that knowingly breaks existing code, but the changes required are all straightforward and shouldn't take more than a few minutes to recover from.

The changes are

  • All Arabica header files now have a .hpp extension. Existing references to something.h will need to be updated (or mitigated by added a forwarding header).
  • The SAX namespace has been moved within the Arabica namespace. References to SAX::something will need to be changed to Arabica::SAX::something, or mitigated by a using declaration.
  • The DOM namespace and associated namespaces, like SimpleDOM, have been moved within the Arabica namespace. References to DOM::something will need to be changed to Arabica::DOM::something, or mitigated by a using declaration.
  • SAX classes named basic_something have been renamed something. Related typedefs along the lines of typedef basic_something<string> something; have been removed. References to SAX::something will need to be changed to SAX::something<std::string>, or mitigated by adding your own typedef.
  • All SAX and DOM classes now take both a string and string adaptor template parameters. This change should be transparent and require no changes.
  • Some header files in the Utils/ subdirectory have been moved:
      Utils/uri.hpp -> io/uri.hpp
      Utils/socket_stream.hpp -> io/socket_stream.hpp
      Utils/convert_adaptor.hpp -> io/convert_adaptor.hpp
      Utils/convertstream.hpp -> io/convertstream.hpp
      Utils/*codecvt.hpp -> convert/*codecvt.hpp
      Utils/normalize_whitespace.hpp -> text/normalize_whitespace.hpp
      XML/UnicodeCharacters.hpp -> text/UnicodeCharacters.hpp
      Utils/StringAdaptor.hpp -> Arabica/StringAdaptor.hpp
      DOM/Utils/Stream.hpp -> DOM/io/Stream.hpp
  • There are some namespace changes along with these physical changes. Any class in Arabica::Utils has been moved into Arabica::io or Arabica::convert.

Source tar.bz2
http://prdownloads.sourceforge.net/arabica/arabica-Sept2007.tar.bz2

Source tar.gz
http://prdownloads.sourceforge.net/arabica/arabica-Sept2007.tar.gz

Source zip
http://prdownloads.sourceforge.net/arabica/arabica-Sept2007.zip


[Add a comment]

Saturday 15 September, 2007
#Arabica: New release any time now

Worked through my list of things to do rather more quickly than I expected. Look for a new release in the next few days.


[Add a comment]

Wednesday 05 September, 2007
#Updated the August release build report.
[Add a comment]

#Arabica: What's going on next? Progress

  1. All Arabica header files will have .hpp rather than .h extensions..
  2. Anything in a directory called Util is liable to be renamed to something more descriptive.
  3. The SAX and DOM namespaces will be moved into the Arabica namespace.
  4. Those template classes which are only parameterised on string type will be extended to take a string adaptor class as well.
  5. Template classes named basic_Something will be renamed Something. This primarily effects SAX classes. (Stream classes excepted here, for consistency with Standard Library.)
  6. Typedef of templates on particular string classes will be removed. Again, this primarily effects SAX classes. (Again, stream classes excepted.)
  7. (Anything else along these lines that occurs to me.)


[Add a comment]

#Arabica: What's going on next?

It's not that often I have definite plans for Arabica, but I have one at the moment. The next Arabica release will be "tidying up".

Because Arabica's grown rather organically over the past several years, it's not especially consistent about all kinds of things. Some of those things aren't especially important, but some are and some are just niggling me a bit.

In relative short order, I intend to do the following

  1. All Arabica header files will have .hpp rather than .h extensions. (Already committed - did it last night while watching the telly).
  2. Anything in a directory called Util is liable to be renamed to something more descriptive.
  3. The SAX and DOM namespaces will be moved into the Arabica namespace.
  4. Those template classes which are only parameterised on string type will be extended to take a string adaptor class as well.
  5. Template classes named basic_Something will be renamed Something. This primarily effects SAX classes.
  6. Typedef of templates on particular string classes will be removed. Again, this primarily effects SAX classes.
  7. (Anything else along these lines that occurs to me.)

This is the first time I've knowingly made changes to Arabica which break existing code. Hopefully those these changes should be relatively easy to deal with.

  • Items 1 and 2 require a simple source change or tactical introduction of forwarding headers.
  • Item 3 can be mitigated the using namespace declaration.
  • Item 4 should actually require no change in client code at all and, unless you're doing something really wacky that this change is intended to prevent you shouldn't notice.
  • Items 5 and 6 require straightforward source changes, or the introduction of your own typedefs.

There will no changes to any functionality or the addition of any new code until this is done. I don't want it to drag on, so I'm aiming to get this done in the next few weeks (work/children/dog/other commitments allowing).

After that, it's back on XSLT, which will also bring in some DOM Level 3 additions. After that, whenever that is, dunno. We'll see.


[Add a comment]

Tuesday 04 September, 2007
#

The end of August is clearly the big time for C++/XML related fun -

  • Apache have released version 2.8.0 of Xerces-C++
  • Following that, version 1.1.0 of XQuilla is now available. XQuilla is an XQuery and XPath 2.0 library built on Xerces.
  • Coincident with that is the release of another XQuery engine, which I didn't previously know about, called GCX from Saarland University Database Group. It sounds a bit researchy - the first streaming XQuery engine which implements active garbage collection, a novel buffer management strategy in which both static and dynamic analysis are exploited - but there it is.


[Add a comment]

Friday 31 August, 2007
#Arabica August 2007 Release

Here's the latest in what's becoming the tradional August Arabica release. It packages a number of incremental improvements, together with a major chunk of new code.

  • Code
    • This release includes the first drop of Mangle, the Arabica XSLT engine. It covers most common cases (at least in my experience), although there are omissions and misfeatures, and it shouldn't be considered production safe. Development is tracked against the OASIS XSLT Conformance test cases.
    • There are a number of new SAX filters for whitespace stripping, tracking namespace declarations, tracking xml:base, and buffering multiple character(...) callbacks into a single callback.
  • Build
    • Further improvements to the Autotools build. The test cases can now be built and run using 'make check'. Wide string detection has been further tweaked, as has finding libxml2. Thanks to Bob Wilkinson for that.
    • Solution and project files for Visual Studio 2005 are now included.
    • Visual Studio builds now produce distinct debug and release versions of the library. Thanks to Timo Geusch and David Grigsby who separately suggested that.

This release has been built on a variety of platforms. Additional build reports are very welcome, particularly fon non-i386 platforms and/or non-GCC compilers.

Source tar.bz2
http://prdownloads.sourceforge.net/arabica/arabica-Aug2007.tar.bz2

Source tar.gz
http://prdownloads.sourceforge.net/arabica/arabica-Aug2007.tar.gz

Source zip
http://prdownloads.sourceforge.net/arabica/arabica-Aug2007.zip


[Add a comment]

#Arabica August 2007 Build Report

The Arabica August 2007 release has been build on the following platforms, with Expat 2.0.0 and Boost 1.33.1:

  • FreeBSD 6.1, i386-unknown-freebsd6.1, using GCC 3.4.4
  • DragonflyBSD 1.6.0, i386-unknown-dragonfly1.6.0, using GCC 3.4.5 (no Boost)
  • Cygwin on XP Professional, i686-pc-cygwin, using GCC 4.1.0
  • Mingw MSys on XP Professional, i686-pc-mingw32, using GCC 3.4.2
  • Ubuntu Linux 6.1.0, i686-pc-linux-gnu, using GCC 4.0.3
  • Intel Mac Tiger, i686-apple-darwin8, using GCC 4.0.1 and boost HEAD. (Thanks Alex Ott)
  • Windows XP Professional using Visual Studio 7.1/2003, with both Expat and MSXML
  • Windows XP Professional using Visual Studio 2005, with both Expat and MSXML

Additional build reports are very welcome, particularly on non-i386 platforms and/or non-GCC compilers.


[Add a comment]

Tuesday 28 August, 2007
#XSLT: Test case state of play

My ongoing XSLT development uses the OASIS XSLT Conformance test cases. I found creating meaningful XPath tests hard enough, but XSLT is a magnitude or two above that and I just didn't fancy it. There's also the danger, of course, that you construct the test case according to your reading of the spec, rather than to the meaning of the spec. Whichever way round, I couldn't have done without them.

The test cases are in two large chunks, one provided by Xalan/IBM/Lotus, the other by Microsoft. Those chunks are then further subdivided. I'm currently running the Xalan tests, for no particular reason than they're listed first in the catalogue. The results, using this morning's Subversion head, are as follows :
attribvaltemplate Run: 12 Failures: 0 Errors: 0 Skips: 1
axes Run: 130 Failures: 0 Errors: 0 Skips: 2
boolean Run: 90 Failures: 0 Errors: 0 Skips: 1
conditional Run: 23 Failures: 0 Errors: 0 Skips: 0
conflictres Run: 35 Failures: 0 Errors: 0 Skips: 1
copy Run: 62 Failures: 0 Errors: 0 Skips: 8
dflt Run: 4 Failures: 0 Errors: 0 Skips: 0
expression Run: 6 Failures: 0 Errors: 0 Skips: 6
extend Run: 4 Failures: 0 Errors: 0 Skips: 4
impincl Run: 29 Failures: 6 Errors: 0 Skips: 2
lre Run: 22 Failures: 11 Errors: 0 Skips: 0
match Run: 32 Failures: 14 Errors: 0 Skips: 1
math Run: 107 Failures: 1 Errors: 0 Skips: 0
mdocs Run: 18 Failures: 0 Errors: 0 Skips: 7
message Run: 16 Failures: 2 Errors: 0 Skips: 2
modes Run: 17 Failures: 1 Errors: 0 Skips: 0
namedtemplate Run: 19 Failures: 2 Errors: 0 Skips: 1
namespace Run: 133 Failures: 39 Errors: 0 Skips: 0
node Run: 21 Failures: 2 Errors: 0 Skips: 0
output Run: 108 Failures: 78 Errors: 0 Skips: 1
position Run: 111 Failures: 7 Errors: 0 Skips: 15
predicate Run: 58 Failures: 0 Errors: 0 Skips: 0
processorinfo Run: 1 Failures: 1 Errors: 0 Skips: 0
reluri Run: 11 Failures: 9 Errors: 0 Skips: 2
select Run: 85 Failures: 2 Errors: 0 Skips: 6
sort Run: 37 Failures: 7 Errors: 0 Skips: 10
string Run: 133 Failures: 4 Errors: 0 Skips: 8
variable Run: 70 Failures: 7 Errors: 0 Skips: 0
ver Run: 5 Failures: 0 Errors: 0 Skips: 4
whitespace Run: 22 Failures: 1 Errors: 0 Skips: 10
Total:Run: 1421 Failures: 194 Errors: 0 Skips: 92

In total, 1329 tests run with 194 failures, for a success rate of 85.4% (or fail rate of 14.6% if you prefer).

A fail means the output Arabica::XSLT generated didn't match the expected output. An error means Arabica::XSLT threw an exception somewhere, didn't compile the stylesheet, or didn't produce any output for some other reason. A skip means that the test wasn't run for some reason - either it was marked not compilable, as compilable but not executable, or just not even touched. A test that's expected not to compile but does, or not to run but does is actually flagged as a fail. (It makes sense if you think about it :)).

The majority of the skips are because of unimplemented XSLT elements or functions (xsl:number, xsl:strip-space, xsl:preserve-space, xsl:key, key(), id()), or because the test output is HTML which Arabica:XSLT doesn't support. The biggest set of fails is the output tests (78 out of 108 tests fail), mainly because the output is text rather than XML. The test harness can't actually handle text output cases yet, so tests which generate text output will always appear as failures.

I haven't added these numbers up for a while. I've got to say I'm pleased.


[Add a comment]

Friday 10 August, 2007
#Arabica: Visual Studio 2005 Builds

Recently checked in a set of solution and project files for Visual Studio 2005/VS8/whatever-you-want-to-call-it. Apart from a bump in version number, the actual text itself looks identical to the VS7 ones which makes me wonder if I can generate one from the other.

I've built everything through using libXML2 and MSXML, and one compiler bug workaround aside, everything seems to be in order.


[Add a comment]

Tuesday 07 August, 2007
#Always comes back to bite you in the end ...
  ++node;
  if(getNodeId<string_adaptor>(node) != impl::Literal_id) // not sure if this is always safe

I'd had a feeling that this line might be dereferencing an off-the-end iterator but in the absence of anything blowing up never went back to check all the corner cases.

Started building Arabica with Visual Studio 2005 this morning. It's shiny new debugging Standard Library immediate caught that it absolutely is being naughty.

Oops.

Fix hitting svn now ...
jez, 7th Aug 2007

[Add a comment]

Thursday 19 July, 2007
#XSLT: The Big Merge

Since I started on my XSLT development it's hung out on its own Subversion branch, so as not to clutter up the trunk with something that was half-cocked. In the meantime, I've committed various little changes and fixes to the trunk, and so the two have started to diverge.

For the next piece of XSLT development, I'm looking to implement it using some DOM Level 3 facilities. Aha! They don't exist yet, but would clearly be useful in their own right and so I'd like to have that stuff on the trunk too.

After some humming and harring, I've merged the dev branch into the mainline. The XSLT isn't finished, but there's enough there to be useful in certain circumstances, so I don't feel it pollutes the place too much. I've committed a big set up changes to svn this evening. It all builds ok, although I'm seeing one or two regressions in the tests. The XSLT tests unsurprisingly have loads of fails and may, depending on your platform, core dump.


[Add a comment]

Thursday 14 June, 2007
#XSLT: tests

Once of the difficulties in running the OASIS test suite was picking out real test fails from noise. Because I haven't yet implemented some XSLT elements and functions, there are many tests which will fail but which don't represent an actual bug. There are some other tests which have HTML output which I also don't do, want to use alternative text encodings (which I think is outside spec), where there's some implementor discretion, and there are a few where the test itself is wrong.

I've extended the XSLT test runner to read an list of expected fails, and adjust the test results accordingly. Individual tests can be marked as expected compile or runtime fails, and the summary output is annotated accordingly. It's a little thing, but it helps :)


[Add a comment]

Friday 11 May, 2007
#XSLT: axes tests

Back on Arabica again after what seems like a very long time. Working on the axes tests of the OASIS suite. There are 130 tests in this bit of the suite, and as I type there are 3 fails - one each because I haven't implemented xsl:number or xsl:strip-space yet, and the last one because something's broken. Let's see if I can fix it :)

Looks like a problem with my implementation of current().
jez, 11th May 2007
XPath::ExecutionContext copy constructor wasn't setting the current node. Fix committed. Axes tests now pass 128 tests, fail 2.
jez, 11th May 2007
2742/1403/29.
jez, 11th May 2007

[Add a comment]

Thursday 22 March, 2007
#XSLT: Tests again

Working away on the OASIS test suite again. My numbers are now 2742 run, 1549 failures, 53 errors. Progress, I'd call it.

Getting better all the time :) 2742/1492/30
jez, 23rd Mar 2007
2742/1486/30. Yay
jez, 23rd Mar 2007
2742/1483/30
jez, 23rd Mar 2007
2742/1477/30. I really should get on with my conference slides.
jez, 23rd Mar 2007
2742/1421/30. Should really be in bed now.
jez, 24th Mar 2007
2742/1409/30. Ok, that's it.
jez, 24th Mar 2007

[Add a comment]

Saturday 24 February, 2007
#XSLT: Tests

I'm now running the entire OASIS XSLT test suite. The raw numbers are 2742 tests run, 1367 fails, and 277 exceptions.

At first glance those numbers don't look great, I'll grant you, so let me gloss them a bit. Of the 277 exceptions, 93 are due to XSLT functions I haven't implemented yet - format-number(), generate-id(), id(), key() and lang(). I don't think I've ever used those functions in my life, so I'm not losing sleep over those at the moment. That does still leave nearly 200 other exceptions, which clearly I need to look at.

Of the failures, 352 are due to XSLT elements I haven't implemented - xsl:attribute-set, xsl:strip-space, xsl:preserve-space, xsl:key. Again, not overly worried about those yet. (Why are missing functions exceptions and missing elements fails? Implementation detail - I'll sort it out). A whopping 536 failures are because the reference output I'm comparing against isn't XML, it's either plain text or HTML so my simple minded test driver can't currently deal with it.

If we take those numbers out we get 1761 tests run, 479 fails, and 184 exceptions, which looks a bit better. A few of those fails are false negatives due to whitespace differences, attribute order differences, or non-significant differences in namespace declarations. The remaining failures don't all represent unique bugs (ie a mistake I've made) or misfeatures (ie something that, while the code is correct, does the wrong thing), so it doesn't feel overly daunting. Obviously, I'll keep chipping away at them.

Made a two line change, now 2742/1354/276. Yay!
jez, 26th Feb 2007
Unimplemented functions now marked as fails not exceptions. Numbers are now 2742/1535/95.
jez, 26th Feb 2007
And now 2742/1533/95. Rock.
jez, 26th Feb 2007

[Add a comment]

Wednesday 21 February, 2007
#XSLT: Tests

Started working with the OASIS XSLT Conformance test suite, which is a whole pile of XML + XSLT with corresponding reference output. I've hacked up a basic driver (it does the styling ok, but the output comparison is extremely simple), but it's already starting to produce useful results.

At the moment, I'm working on the 130 tests exercising various XPath axes. Started off with 36 fails and 11 errors. A fail means there was some output produced but it didn't match the reference, an error means an exception was thrown. I'm now at 41 fails and 1 error.

That's going in the right direction I think.


[Add a comment]

Tuesday 23 January, 2007
#Arabica Builds
In the last couple of days I have built Arabica on the following platforms. In each case, I used Expat 2.0.0 and Boost 1.33.1.
  • Windows XP Professional, using Visual Studio 7.1

Reports on other platforms welcome.


[Add a comment]

#Arabica January 2007 Release

Happy New Year

Here's Arabica first release of 2007. It contains a number of incremental improvements, but nothing you might describe as startling

  • Build
    • Further improvements to the configure system. The build can now be configured without the Boost libraries in which case the XPath components are skipped. Parser detection is improved, as is detecting the correct libraries for sockets.
    • Added a Visual Studio solution file to build with Boost.
  • Code
    • Added TreeWalker and NodeFilter implementation, part of DOM Traversal. Thanks to craigp for that.
    • Beefed up the MSXML version checking. Thanks to Sten Darre for that.
    • Reworked the buffering in convertstream to reduce the number of dynamic allocations. This should make it quicker, and that will be particular noticeable on large documents. Thanks to Timo Guesch for profiling and suggesting the change.
    • LexicalHandler and DeclHandler are now part of the XMLReader interface. They can now be set directly, rather than fiddling around with setProperty and the strange casting that involved. XMLFilterImpl has been extended to support them, as has the DefaultHandler. DefaultHandler2 is now redundant and has been deprecated.

Build reports are very welcome, particularly from non-i386 platforms and/or non-GCC compilers.

Source tar.bz2
http://prdownloads.sourceforge.net/arabica/arabica-Jan2007.tar.bz2

Source tar.gz
http://prdownloads.sourceforge.net/arabica/arabica-Jan2007.tar.gz

Source zip
http://prdownloads.sourceforge.net/arabica/arabica-Jan2007.zip

See you next time with some XSLT.


[Add a comment]

Friday 05 January, 2007
#XSLT: extension functions

Started work implementing the handful of functions XSLT adds to XPath. Some of them are easy-peasy

The current function returns a node-set that has the current node as its only member.
while others are not quite so ...
The format-number function converts its first argument to a string using the format pattern string specified by the second argument and the decimal-format named by the third argument, or the default decimal-format, if there is no third argument. The format pattern string is in the syntax specified by the JDK 1.1 DecimalFormat class. The format pattern string is in a localized notation: the decimal-format determines what characters have a special meaning in the pattern (with the exception of the quote character, which is not localized). The format pattern must not contain the currency sign (#x00A4); support for this feature was added after the initial release of JDK 1.1. The decimal-format name must be a QName, which is expanded as described in [2.4 Qualified Names]. It is an error if the stylesheet does not contain a declaration of the decimal-format with the specified expanded-name.
That's clear then. So make it work just like the first version of JDK 1.1. Righto.
NOTE:Implementations are not required to use the JDK 1.1 implementation, nor are implementations required to be implemented in Java.
Well, thanks for that.

I am having issue with my Netgear 834G , I resetted the router to the manufactory set but I cannot login . username: admin and password: password is rejected.

Please can you help
richardav, 11th Jan 2007


[Add a comment]

Tuesday 02 January, 2007
#XSLT: Template priority

Latest commits sort xsl:templates by priority, which is jolly super really. One last little bit to do on that

It is an error if this leaves more than one matching template rule. An XSLT processor may signal the error; if it does not signal the error, it must recover by choosing, from amongst the matching template rules that are left, the one that occurs last in the stylesheet.
Not quite sure which tack to take yet. Signalling an error would be easier, but slower at stylesheet runtime. Doing the one that occurs last will be slightly more work, but more useful and slightly quicker when running the stylesheet. I'll see how the mood takes me, but Saxon, MS-XSLT and Xalan all recover. Saxon issues are warning too, so it's taking the runtime hit too.

I was wrong about how much extra work it would be to resolve a conflict in favour if the later template. It took exactly one line :)
jez, 3rd Jan 2007

[Add a comment]

Friday 22 December, 2006
#XSLT: Template matches

Just committed some changes to take me another step closer to calculating template match priorities. I'm really quite pleased with how I wiggled through that one :)

Once this is finished, I hope to be able to cut an initial release early in the new year. That would be super.


[Add a comment]

Thursday 14 December, 2006
#DOM: TreeWalker

Something I started but never finished years ago was DOM Traversal. I wrote a load of header files, then never did anything else. Never even looked at it. Never even thought about it, nor used an implementation in another language. Looks like I was a bit silly though, because I've had a TreeWalker implementation donated by a chap called craigp. I wrote a few basic tests for it, and it actually looks rather good. I can think of several places over the past few years were it would have been jolly useful.

In short - if you're walking a DOM, have a look at TreeWalker.


[Add a comment]

Wednesday 13 December, 2006
#Mingw MSYS builds

Spent a little while playing with Mingw's MSYS, which I'd not heard of before and which turns out to be rather cool. Committed some configure patchs so that Arabica builds on it properly.


[Add a comment]

Monday 04 December, 2006
#XSLT: xsl:namespace-alias

Watching December to Dismember. Team Extreme have just won the overlong curtain jerker against MNM. I just committed initial support for xsl:namespace-alias. It doesn't work properly in the face of xsl:imports yet, nor does it do anything with attributes in namespaces. The main bits work though. Now it's time for Matt Striker vs. Balls Mahoney. Yay.

Wrestling wasn't quite what I was hoping for - ECW really is being mishandled. Attributes in namespaces are now remapped correctly. Had a crack at #default too, although it's not quite there yet.
jez, 4th Dec 2006

[Add a comment]

Friday 24 November, 2006
#

Not written up anything here for a while, but I've been nurdling away in little half-an-hour to an hour episodes. I've just committed some changes for the xsl:variable implementation. I had the select working, but now I've got child templates working too, complete with delayed execution and all that stuff. Parameters and parameter passing work too. I've also done a load of work on serialisation. It's really starting to come together.

Still some bits and bobs to do - not quite sure about how to implement namespace aliases yet, and I need to start calculationg template match priorities. If things tick on, I'm expecting to do a release around Christmas. Any takers for doing a bit of testing on a command line executable?


[Add a comment]

Sunday 29 October, 2006
#XSLT: xsl:import and xsl:apply-imports

Last week I wondered aloud whether my current SAX based approach to building an XSLT engine was going to work. Specifically the piece of spec text I was worried about was section 2.6.2 which says

The xsl:import element is only allowed as a top-level element. The xsl:import element children must precede all other element children of an xsl:stylesheet element, including any xsl:include element children. When xsl:include is used to include a stylesheet, any xsl:import elements in the included document are moved up in the including document to after any existing xsl:import elements in the including document.
It was the "moving up" part that I was worried about. Moving up - you can't do that when you're working in a stream mode.

Section 2.6.2 continues with a discussion of import precedence, which gave me the clue I needed. It says

An xsl:stylesheet element in the import tree is defined to have lower import precedence than another xsl:stylesheet element in the import tree if it would be visited before that xsl:stylesheet element in a post-order traversal of the import tree.

...
Each definition and template rule has import precedence determined by the xsl:stylesheet element that contains it.

For example, suppose
  • stylesheet A imports stylesheets B and C in that order;
  • stylesheet B imports stylesheet D;
  • stylesheet C imports stylesheet E.
Then the order of import precedence (lowest first) is D, B, E, C, A.
The first sentence there isn't the world's clearest, but the important bit is "post-order traversal". In English, that means backwards. I'll come back to this in a minute.

While the spec says imports are "moved up", that doesn't mean they have to be actually moved around and processed before the remainder of the stylesheet. In the example above, consider how you'd worked out the import of precedence of B. You need to know not only about B, but also C and E. However, you need to have looked into C to know about E, to work out the precedence of B. Makes my head hurt anyway. But you don't need to do it in the way the spec seems to be leading you. In fact, you can do anything you like so long as the result achieved is equivalent.

Turns out, processing imports can be delayed until the very end of the stylesheet. All you need to do is keep a list of the stylesheets to be imported as you go. Once you hit the end of the initial stylesheet, you work through the list you've made in reverse order (post-order traversal, remember) and load each extra stylesheet. Should one of those stylesheet import a further stylesheet, push it on to the end of the list, which automatically makes it the next stylesheet you pull in. Keep going until your list is empty. Magically, you've pulled in the imported stylesheets in exactly the order required by the spec. Run through the example above, and you'll see it just drops out. Fab.

I was quite startled when I realised how straightforward it was. Initial implementation only took an hour or so, and I've just committed it to svn.


[Add a comment]

Wednesday 25 October, 2006
#XSLT: xsl:include

Just committed an initial run at xsl:include. It's incomplete in approximately 19 ways, but for a super-trivial example

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:include href="identity.xsl"/>
</xsl:stylesheet>
where identity.xsl is pretty much what you'd expect, it all works. Yay.


[Add a comment]

Tuesday 24 October, 2006
#SAX: LexicalHandler and DeclHandler

Promoted LexicalHandler and DeclHandler to be full members of XMLReader, rather than properties set via the rather tortuous setProperty call. XMLFilter and XMLFilterImpl have been extended to provide support for Lexical~ and DeclHandler. DefaultHandler now provides default, do nothing, implementations of Lexical~ and DeclHandler. Consequently DefaultHandler2 is now deprecated and will be removed in due course.


[Add a comment]

Monday 23 October, 2006
#XSLT: Run away from the hills! If you see hills, run the other way!

An XSLT processor has two distinct pieces: a compiler, which reads the stylesheets and builds an executable model of some sort (the transformer); and the compiled transformer which you run against a target document. Obviously the compiler needs to know about the transformer and how to build it, but the transformer need know nothing about how it sprang into being.

This is good for me, because it means I might only have half the job to redo. Up to now, I've been compiling stylesheets in a streaming mode using a big pile of SAX content handlers. As I encounter an xsl:element, say, I connect the SAX event stream to an xsl:element handler which creates the xsl:element object, populates, validates it, adds it to the containing object, finally creates the next handler and connects that. It's a valid (and I think under-utilised) approach to building object graphs from XML documents - in each handler the context is well defined, the housekeeping is straightforward, the memory requirements are low. I knew there was a possibility I'd have to build the stylesheet as a DOM, if only to satisfy the document('') function. I'd hoped that maybe I could be a bit clever and only do that if that function was actually used.

Re-reading the spec last night (reading a spec is always a good idea when trying to implement it) I realised I'd coded myself into a dead end, and it was time to turn around. Two paragraphs in particular changed my mind. In section 2.6.2 it says

The xsl:import element is only allowed as a top-level element. The xsl:import element children must precede all other element children of an xsl:stylesheet element, including any xsl:include element children. When xsl:include is used to include a stylesheet, any xsl:import elements in the included document are moved up in the including document to after any existing xsl:import elements in the including document.
Section 11.4 says
Both xsl:variable and xsl:param are allowed as top-level elements. A top-level variable-binding element declares a global variable that is visible everywhere.

Do you see the problem? To implement these requirements correctly requires out of order processing. For a top-level variable to be visible everywhere, all the top-level variables must be processed before anything that might reference them. For imports to be moved up, you need to know the surrounding context.

You might still be able to deal with this using streaming processing, but it becomes much more complicated. You'd have to make one pass to build the object model, the make a pass over the model itself to validate it, perhaps defer some processing, and it all starts to look a little hairy.

Using a DOM, this is all much more straightforward. You'd parse the stylesheet into a DOM, and walk over that using XPath. It would make other things, like include handling, more straightforward too.

When I started writing this, I'd decided to rewrite what I'd done using a DOM, but now I'm not so sure. I think maybe I could get the SAX handling to work after all. The rearranging and reordering only needs to happen at the top level of the document. Hmm, perhaps I need to reread the spec again :)


[Add a comment]

Sunday 22 October, 2006
#XSLT: apply-templates mode

Modes now work. Yay.


[Add a comment]

Thursday 19 October, 2006
#XSLT: namespaces

Hooked up namespaces declared in the stylesheet to the XPath compiler, so template matches and XPath expressions can now use namespaces as normal.


[Add a comment]

Wednesday 18 October, 2006
#XSLT: inline elements

Committed first cut of inline element generation - where you have XML literals within the stylesheet. It's ok, although namespaces are not hooked up yet and attributes get dropped on the floor.

Update: attributes now handled properly
jez, 18th Oct 2006

[Add a comment]

#XSLT: call-template

Just committed an initial implementation of xsl:call-template. It does the right thing for correct XSLT. At the moment, though, it can do the wrong thing for incorrect XSLT - if you give an unknown name it raises a run-time error not a compile-time error.

Things are actually getting quite useful now. If you'd like to have a play, grab the mangle branch from Subversion

svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/mangle
and have a look in include/XSLT for the bulk of the code and in examples/XSLT directories for a little sample command line XSLT transformer to play with. Comments, bugs reports, patches are all very welcome.


[Add a comment]

Tuesday 17 October, 2006
#XSLT: variable, param and with-param

Spent two or three hours over the last couple of days implementing xsl:variable, xsl:param and xsl:with-param. In that order. Doing xsl:variable is easy, xsl:param is essentially the same except you're looking to see if something of the same name already exists, while xsl:with-param is putting that thing there beforehand (if you see what I mean).

Currently they all implement the @select behaviour. I haven't done child content yet, and that'll probably wait a while. There are more interesting things to do :)

[Anna - it's ok that this doesn't make any sense to you.]

Thank you for the reassurance. I'm still hoping for a post about cooking or ducks, though
Anna, 18th Oct 2006

[Add a comment]

Thursday 28 September, 2006
#

Spent a few minutes playing with the new Turbo C++. It looks pretty lovely, although I haven't spent much time with it yet. I have it building Arabica as a static library against expat, libxml2, and, rather surprisingly, MSXML. Minor issue with Xerces, which just requires an extra header included somewhere (memcpy not declared). No luck so far compiling with Boost. Not sure why.

Initial build files are in svn here.


[Add a comment]

Wednesday 27 September, 2006
#

Had a nice email reporting a successful build and test on Mac OS X 10.4 PPC with g++ 4.0.0 and Boost 1.33.1. Marvellous.


[Add a comment]

Saturday 23 September, 2006
#XSLT Development Branch

Back at the end of last year, I started work on an XSLT engine. I had a DOM, I had XPath, it seemed to obvious thing to do. Yea, I know.

Anyway, now the Autotools excitement is over and done with, I've kicked off a new branch for the XSLT development, and checked in the code so far. Right now, I have easy bits done (skeleton implementations of xsl:stylesheet and xsl:template, partial implementation of xsl:apply-template, and finished up xsl:choose, xsl:comment, xsl:copy, xsl:copy-of, xsl:for-each, xsl:if, xsl:processing-instruction, xsl:text, xsl:value-of and xsl:sort). There's a smidgen of test code and little command line executable to play with. I know the shape of what's to come, it's just SMOP from now on :)


[Add a comment]

Thursday 21 September, 2006
#Arabica September 2006 Release 3

Another release. I feel giddy.

This release further improves the configuration options, allowing easier selection of the underlying XML parser library.

The Visual Studio 2003 solution and project files have been properly updated (something I unaccountably forgot to do), so that all the tests and examples are built. Compiling Arabica against MSXML also now compiles correctly again.

This release introduces no other new functionality.

Build reports are welcome. Do please drop me a line about how it goes on your particular OS/compiler/library combination to jez@jezuk.co.uk

Source tar.bz2
http://prdownloads.sourceforge.net/arabica/arabica-Sept2006-3.tar.bz2

Source tar.gz
http://prdownloads.sourceforge.net/arabica/arabica-Sept2006-3.tar.gz

Source zip
http://prdownloads.sourceforge.net/arabica/arabica-Sept2006-3.zip


[Add a comment]

Monday 18 September, 2006
#Arabica September 2006 Release 2

Yet another release. Alarming, I know.

This release follows hot on the heels of the previous one, adding incremental tweaks and improvements to the new GNU Autotools based build.

Most significantly, Arabica can now be configured to build without the Boost libraries. If Boost is not available, or is specifically turned off using configure's --without-boost option, the XPath engine is excluded from the build, as is the ability to provide XMLReader template parameters in any order. Chances are you aren't using that second one anyway.

This release introduces no other new functionality.

Build reports are very welcome.

Source tar.bz2
http://prdownloads.sourceforge.net/arabica/arabica-Sept2006-2.tar.bz2

Source tar.gz
http://prdownloads.sourceforge.net/arabica/arabica-Sept2006-2.tar.gz

Source zip
http://prdownloads.sourceforge.net/arabica/arabica-Sept2006-2.zip


[Add a comment]

Thursday 07 September, 2006
#Arabica September 2006 Release

Frighteningly, I've just cut a new Arabica release, a mere month after the previous one. On the one hand I'm happy to announce this, because it's a quite significant. On the other I'm not, because the reason I've had time to do it is because I haven't had any work. Don't be afraid to get in touch if you need an extra pair of hands.

This release significantly simplifies the build procedure, as I've finally abandoned my increasingly unwieldy collection of Makefile variables in favour if GNU Autotools. Arabica has been autoconfiscated, so on any reasonable Unix box the familiar incantation of ./configure - make - make install will do the business.

The configure script will check for Expat, Libxml2 or Xerces, in that order, and use which ever it finds first. It also confirms the Boost libraries are available. Finally, it checks for std::wstring support. If std::wstring isn't available, then the appropriate bits of the build are turned off.

As part of autoconfiscating the build, the source tree has been rearranged slightly. The library source files have been split out from the header files, which makes it much easier to sort out installing the built library. The Visual Studio 7 solution files have also been moved out of the main tree into their own subdirectory.

This release introduces no new functionality or bug fixes.

The release has been successfully built on Linux, Cygwin, FreeBSD and DragonFly BSD. Built reports on other platforms, particularly those not using GCC, would be welcome.

Source tar.bz2
http://prdownloads.sourceforge.net/arabica/arabica-Sept2006.tar.bz2

Source tar.gz
http://prdownloads.sourceforge.net/arabica/arabica-Sept2006.tar.gz

Source zip
http://prdownloads.sourceforge.net/arabica/arabica-Sept2006.zip


[Add a comment]

Monday 21 August, 2006
#Cafe con Leche announcement - that should boost the download stats :)
[Add a comment]

Monday 07 August, 2006
#New Subversion repository

Using the miracle of cvs2svn, I have migrated the Arabica source from the Sourceforge CVS server to my Subversion server chugging away under my desk. I've only used Subversion seriously for a little while, but I have to say I rather like it. The CVS repository on SourceForge will remain available, but new development will now be committed to Subversion.

Anonymous access to the repository is available from

svn co svn://jezuk.dnsalias.net:/jezuk/arabica/trunk/

and the repository can be browsed at http://jezuk.dnsalias.net:/viewvc/.


[Add a comment]

SourceForge Project Page

Jez Higgins