The "Probably long overdue release" bringing a big chunk of new functionality.
TagSoup, if you're not familiar with it, is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.Obviously, if you have a SAX parser you can apply all your standard XML techniques - not only SAX filters, but building a DOM, applying XPaths, or XSLT transformations as well.
Cowan describes what TagSoup does as
TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on.Looks straightforward, doesn't it? Well, that's a simple example and it's still a tricky and awkward result in practice. Cowan's patience in persuing this and what looks like a rather elegant solution is to be applauded. Porting his code to C++ was quick and painless, and Taggle is a useful addition to Arabica. Thanks, John.
The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like:
This is <B>bold, <I>bold italic, </b>italic, </i>normal text
gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
Arabica Taggle chews through HTML, providing the same SAX XMLReader interface as the XML parser, and can be used in exactly the same way. HTML source can be fed through SAX filter stacks, used to build DOM trees, queried with XPath, or transformed using XSLT.
There are, of course, many other fixes and changes. Most are relatively minor, and if you haven't been bitten by them you won't notice. The most significant changes are in Arabica's XSLT engine, Mangle. While still not feature complete and under development, it takes, in this release, a fairly big step forward.
AttributesImpl.getIndex. Thanks to Isak Johnsson for that, and what on earth was I thinking to me
XMLReaderInterface. It only needs the
string_adaptor. Any addition parameters are only of interest the implementing parser class
TextCoalescerfilter into the DOM builder, so that consecutive bits of text get applied to a single Text or CDATA node, rather than as a series of nodes. (A series of nodes is perfectly legal, it's just slightly unexpected. Even to me, and I work with DOMs a lot :)
XPathExpressionPtrboth exposed implementation details and provided an interface that was inconsistent with the DOM classes, because you accessed the member functions via
.At the time, I was just pleased to have got the XPath stuff done and wasn't really fussed, so I left it. Since then though, it's niggled and niggled away at the back of my mind and now I've done something about it.
XPathExpression, with the member functions accessed through the
->member access are retained for the meantime, so that existing code won't be broken. Existing code using XPathValuePtr will still work, but new stuff should use XPathValue
prefix:*didn't compile. I had no test for it, and had overlooked it. Now I do, and it isn't
text()test to match CDATA nodes as well as text nodes
node()matches any node of any type. In an XSLT match pattern,
node()matches everything except attributes and the document root node. Fixed.
xsl:sortspecifies a numerical sort, but you've got some string data in there we need to maintain the relative positions of that string data. This is the first time I can recall actually using
std::stable_sort. I will mark it down in my big book of programming accomplishments.
xsl:messagecan contain another
xsl:message- now handled properly
xsl:choosehas at lease one
modeattribute is not empty
xsl:call-templatenow throws if it can't find a matching template
current()in match patterns
xsl:for-eachselects a node-set
xsl:stylesheetnow allows top-level elements when they are in a foreign namespace
last()and positional predicates in match patterns
selectattribute and text content
xsl:elementunprefixed names - when no namespace uri is supplied are in the default namespace
@xmlns|@xsmlns:*selects no nodes
Build and installation
mbstate_t. Some platforms don't have it (VxWorks, for example)
Other bits and bobs