Arabica is an XML and HTML processing toolkit, providing SAX, DOM, XPath, and XSLT implementations, written in Standard C++.
- SAX is an event-based XML processing API. Arabica is a full SAX2 implementation, including the optional interfaces and helper classes. It provides uniform SAX2 wrappers for the Expat parser, Xerces, Libxml2 and, on Windows, for the Microsoft XML parser.
- The DOM is a platform- and language-neutral interface which models an XML document as a tree of nodes, defined by the W3C. Arabica implements the DOM Level 2 Core on top of the SAX layer.
- XPath is a language for addressing parts of an XML document. Arabica implements XPath 1.0 over its DOM implementation.
- XSLT is a language for transforming XML documents into other XML documents. Arabica builds XSLT over its XPath engine.
- In addition to the XML parser, Arabica includes Taggle, an HTML parser derived from TagSoup.
Arabica is written in Standard C++ and should be portable to most platforms. It is parameterised on string type. Out of the box, it can provide UTF-8 encoded
std::strings or UTF-16 encoded
std::wstrings, but can easily be customised for arbitrary string types.
Wednesday 30 January, 2008
Taggle: And there it is ...
Taggle, Arabica's port of the TagSoup HTML parser, now builds and runs. It dodges pretty much every encoding issue on the planet, but as a first go it's really quite pleasing. Give it this -
This is <B>bold, <I>bold italic, </b>italic, </i> normal text
and get this
<html>(Ok, you have to squint a bit at the indenting, but that's a separate issue.)
<i>bold italic, </i>
If you want to have a play, check out the tagsoup-port branch from subversion:svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port
examples/Taggle, there's a little command line application that read HTML documents and prints the corrected markup to the console.
I'll merge this back into the trunk in the next few days.
Why not implement an HTML5 parser instead of porting TagSoup?
zcorpan [e], 1st Feb 2008
Time and inclination. Porting TagSoup to C++ took me a few hours. It was fun, and quite an easy win. Having done it, I'm surprised that nobody's done it before.
Writing an HTML5 parser needs rather more time than I have - not only in writing the code, developing the test suite, but then tracking the standard as it emerges. Even if I had the time, I don't actually have the inclination, because it's not something that really interests me enough right now. Sorry :)
jez, 2nd Feb 2008Thank you, this is precisely what I wanted.
I've been HTML coding since 3.2. Long after HTML8.0 has formally broken and obsoleted HTML5.3 and previous, tag-soup still works.
[Add a comment]
Older news ...
Get in touch Your questions, requests, updates and patches are all welcome. I can be contacted at firstname.lastname@example.org.