[RSS 0.91]
Saturday 18 October, 2008
#Arabica October 2008 Release

The "Probably long overdue release" bringing a big chunk of new functionality.

Source tar.bz2
http://downloads.sourceforge.net/arabica/arabica-2008-october.tar.bz2

Source tar.gz
http://downloads.sourceforge.net/arabica/arabica-2008-october.tar.gz

Source zip
http://downloads.sourceforge.net/arabica/arabica-2008-october.zip

Exciting New Stuff

The exciting new stuff is Taggle, a port of John Cowan's rather super TagSoup package.

TagSoup, if you're not familiar with it, is

a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.
Obviously, if you have a SAX parser you can apply all your standard XML techniques - not only SAX filters, but building a DOM, applying XPaths, or XSLT transformations as well.

Cowan describes what TagSoup does as

TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on.

The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like:

This is <B>bold, <I>bold italic, </b>italic, </i>normal text
gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
Looks straightforward, doesn't it? Well, that's a simple example and it's still a tricky and awkward result in practice. Cowan's patience in persuing this and what looks like a rather elegant solution is to be applauded. Porting his code to C++ was quick and painless, and Taggle is a useful addition to Arabica. Thanks, John.

Arabica Taggle chews through HTML, providing the same SAX XMLReader interface as the XML parser, and can be used in exactly the same way. HTML source can be fed through SAX filter stacks, used to build DOM trees, queried with XPath, or transformed using XSLT.


Changes and Bug Fixes

There are, of course, many other fixes and changes. Most are relatively minor, and if you haven't been bitten by them you won't notice. The most significant changes are in Arabica's XSLT engine, Mangle. While still not feature complete and under development, it takes, in this release, a fairly big step forward.

SAX

DOM

XPath

XSLT

Build and installation

Other bits and bobs


[Add a comment]

SourceForge Project Page

Jez Higgins