Taggle, Arabica's port of the TagSoup HTML parser, now builds and runs. It dodges pretty much every encoding issue on the planet, but as a first go it's really quite pleasing. Give it this -
This is <B>bold, <I>bold italic, </b>italic, </i> normal text
and get this
<html> <body>This is <b>bold, <i>bold italic, </i> </b> <i>italic, </i> normal text </body> </html>(Ok, you have to squint a bit at the indenting, but that's a separate issue.)
If you want to have a play, check out the tagsoup-port branch from subversion:
svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port
examples/Taggle, there's a little command line application that read HTML documents and prints the corrected markup to the console.
I'll merge this back into the trunk in the next few days.