<?xml version="1.0"?><!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd"><rss version="0.91"><channel>  <title>Arabica XML Toolkit in C++</title>  <description>Arabica Development Log</description>  <link>/arabica/log</link>  <language>en-gb</language>  <webMaster>jez@jezuk.co.uk</webMaster>  
<item><title><a href='http://jezuk.dnsalias.net/viewvc?view=rev&revision=1221'>Visual Studio, how I curse your useless warning C4800</a></title><link>http://www.jezuk.co.uk/arabica/log?id=3753</link><description><![CDATA[ <p><a href='http://msdn.microsoft.com/en-us/library/b6801kcy.aspx'>'<em>type</em>' : forcing value to bool 'true' or 'false' (performance warning)</a><br/>
Performance warning, my arse.</p> ]]></description></item>
<item><title>XSLT: Implementing position matches[2]</title><link>http://www.jezuk.co.uk/arabica/log?id=3711</link><description><![CDATA[ <p>Revisiting position matches at the moment.  I've <a href='/jez/2007December#3550'>described how position matches need to be written</a>, and the code works and works well.</p>
<p>Except when it doesn't.  It actually fails for <a href='http://wordaligned.org/articles/stop-the-clock-squash-the-bug' title='Stop the clock, squash the bug'>the less common cases</a>, and it took me a little while to work out why.</p>
<p>Here's the pattern from the test case that showed the problem
<blockquote>
<code>foo[@att1='c'][2]</code>
</blockquote>
It wants to match the second node in the set of foo elements with an att1 attribute containing 'c'.</p>
<p>Arabica rewriting finds the positional predicate and applies its incorrect magic.  The rewritten pattern is equivalent to 
<blockquote>
<code>foo[2][@att1='c']</code>
</blockquote> 
which picks out the second foo node if it has an att1 attribute containing 'c'.  The difference isn't immediately clear, even when you have the both the incorrect output and the expected output sitting in front of you. </p>
<p> My small crumb of comfort is that if you do want <code>foo[2][@att1='c']</code>, Arabica does do the right thing.  That gave me the clue.  Arabica implements XSLT match patterns by rewriting them as XPath expressions.
<blockquote>
<code>foo[2][@att1='c']</code>
</blockquote>
is rewritten as an XPath along the lines of 
<blockquote>
<code>self::foo[. = parent::*/foo[2]][@att1='c']</code>
</blockquote>
</p>
<p>My faulty rewriting of 
<blockquote>
<code>foo[@att1='c'][2]</code>
</blockquote>
was 
<blockquote>
<code>self::foo[@att1='c'][. = parent::*/foo[2]]</code>
</blockquote>
which you should be able to see is logically identical to the above.  What I need is 
<blockquote>
<code>self::foo[. = parent::*/foo[@att='c'][2]]</code>
</blockquote>
I had to work quite hard to see that this is what it should be, despite being pretty familiar with XPath and XSLT use and implementation.  It's only been part of my working toolkit for the last 8 years or so, after all.  When rewriting a positional match, any preceding predicates must be folded into the rewritten expression.  Now I see it, it's pretty obvious.  </p> ]]></description></item>
<item><title>Taggle: Parameterised on string_type</title><link>http://www.jezuk.co.uk/arabica/log?id=3609</link><description><![CDATA[ <p>The Taggle parser <a href='http://jezuk.dnsalias.net/viewvc/branches/tagsoup-port/examples/Taggle/'>in subversion</a> is now parameterised on string_type and string_adaptor, in exactly the same way as the usual Arabica XMLReader class.  The two are now equivalent, which means that all the SAX filters, the DOM builder, XPath, and so on can be applied to Taggle.</p>

 ]]></description></item>
<item><title>Moments before I'm about to go to bed, I discover Taggle fails (in a coredumpy way) for documents which have a DOCTYPE declaration</title><link>http://www.jezuk.co.uk/arabica/log?id=3596</link><description><![CDATA[ <p>Moments before I'm about to go to bed, I discover Taggle fails (in a coredumpy way) for documents which have a DOCTYPE declaration.  Will have a look at a fix in the morning.</p> ]]></description></item>
<item><title>Taggle: Building the code</title><link>http://www.jezuk.co.uk/arabica/log?id=3594</link><description><![CDATA[ <p>If you've grabbed the code from subversion:<br/>
<pre>svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port</pre>
you might be wondering how to build it.
</p>
<p>For Visual Studio 2005 users, open up the <code>vs8\taggle.sln</code> project and build away.  It should just work.  If it doesn't, then check the <a href='/arabica/howtobuild'>project build notes</a> for information on setting up search paths and things.</p>
<p>For Unixy types, you will need a mighty three steps:<br/>
<ol>
  <li><code>autoreconf</code> - to create the configure script</li>
  <li><code>./configure</code> - to dig out where the various bits and pieces Arabica needs are, and to create the <code>Makefile</code>s</li>
  <li><code>make</code> - to, erm, make everything
</ol> 
</p>
<p>Problems, questions, issues?  <a href='jez@jezuk.co.uk'>Get in touch</a>.</p> ]]></description></item>
<item><title>Taggle: And there it is ...</title><link>http://www.jezuk.co.uk/arabica/log?id=3591</link><description><![CDATA[ <p>Taggle, Arabica's port of the <a href='http://tagsoup.info'>TagSoup</a> HTML parser, now builds and runs.  It dodges pretty much every encoding issue on the planet, but as a first go it's really quite pleasing.  Give it this -<br/>
<br/>
<code>This is &lt;B&gt;bold, &lt;I&gt;bold italic, &lt;/b&gt;italic, &lt;/i&gt; normal text
</code><br/>
<br/>
and get this<br/>
<br/>
<code>
&lt;html&gt;<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&lt;body&gt;This is<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;b&gt;bold,<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;i&gt;bold italic, &lt;/i&gt;<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;/b&gt;<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&lt;i&gt;italic, &lt;/i&gt;<br/>
normal text<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&lt;/body&gt;<br/>
&lt;/html&gt;<br/>
</code>
(Ok, you have to squint a bit at the indenting, but that's a separate issue.)
</p>

<p>If you want to have a play, check out the tagsoup-port branch from subversion:  <pre>svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port</pre></p>

<p>In <code>examples/Taggle</code>, there's a little command line application that read HTML documents and prints the corrected markup to the console.</p>
<p>I'll merge this back into the trunk in the next few days.</p>
 ]]></description></item>
<item><title>Taggle: Bringing HTML Parsing to Arabica</title><link>http://www.jezuk.co.uk/arabica/log?id=3589</link><description><![CDATA[ <p>After a rather intense return work after Christmas, I'm taking a bit of a break from Arabica's XSLT development for something a bit lighter - porting John Cowan's excellent TagSoup package to C++ and Arabica.  <a href='http://tagsoup.info/'>TagSoup</a>, if your not familiar with it, is 
<blockquote>
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: <a href='http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html'>poor, nasty and brutish</a>, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. 
</blockquote>
</p>
<p>Obviously, if you have a SAX parser you can apply all your standard XML techniques - not only SAX filters, but building a DOM, <a href='http://www.jezuk.co.uk/cgi-bin/view/jez/2005October#2643'>applying XPaths</a>, or XSLT transformations as well.</p>
<p>Cowan describes what TagSoup does as 
<blockquote>
<p>TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on.</p>

<p>The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like:<br/>
<br/>
<code>This is &lt;B&gt;bold, &lt;I&gt;bold italic, &lt;/b&gt;italic, &lt;/i&gt;normal text</code><br/>
<br/>
gets correctly rewritten as:<br/>
<br/>
<code>This is &lt;b&gt;bold, &lt;i&gt;bold italic, &lt;/i&gt;&lt;/b&gt;&lt;i&gt;italic, &lt;/i&gt;normal text.</code>
</p>
</blockquote>
</p>
<p>Looks simple, doesn't it?  Well, that's a simple example and it's still a tricky and awkward result in practice.  Cowan's patience in persuing this and what looks like a rather elegant solution is to be applauded.  Porting his code to C++ has been pretty quick and painless so far, and I expect the new piece, which I've called Taggle, to be finished pretty soon.  Arabica will be stronger for it - thanks John!</p>
 ]]></description></item>
<item><title>XSLT: Implementing position matches</title><link>http://www.jezuk.co.uk/arabica/log?id=3550</link><description><![CDATA[ <p>Earlier this week, I outlined the equivalent XPath expressions for <a href='/jez/2007December#3546'>XSLT matches which use positions</a>.  I've bene avoiding implementing this for a while for a couple of reasons.  Firstly, I thought it would take more time than I generally have in one Arabica sitting, i.e. more than an hour.  Secondly, I wasn't quite sure how I'd actually go about it.  I knew the equivalent expression were correct, I just didn't know how I was going to arrive at them in the code. </p>
<p>Happily, writing it out and then reading it back again later triggered the little flash I needed, and it turns out to be really quite easy.  I've spent a bit of time on it this morning, having wrapped up the paying work for Christmas, and I've got the first pass working.</p>
<p>Within Arabica, each match pattern is represented as one or more steps, where each step in the pattern is represented as a TestStepExpression.  A TestStepExpression contains some kind of node test (obviously) and one or more predicates.  The predicate is the bit in square brackets.  A match pattern like this 
<blockquote>
  <code>a[3]</code>
</blockquote>
would be compiled into a TestStepExpression contain a NameNodeTest checking for 'a', and a single predicate '3'.</p>
<p>To find expressions that need rewriting is simply
<blockquote>
  <code>foreach step in the steplist<br/>
  &nbsp;&nbsp;foreach predicate in the step predicatelist<br/>
  &nbsp;&nbsp;&nbsp;&nbsp;if predicate is type NUMBER<br/>
  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;predicate = rewrite(predicate)
</code>
</blockquote>
The code is <a href='http://jezuk.dnsalias.net/viewvc/trunk/include/XPath/impl/xpath_match_rewrite.hpp?view=markup' title='xpath_match_rewrite.hpp in Subversion repository'>here</a>, if you're interested.</p>
<p>This kind of transformation could be done earlier on, on the <a href='http://en.wikipedia.org/wiki/Abstract_syntax_tree' title='Abstract systax tree'>AST</a> produced by Arabica's XPath parser.  It's easier, however, to operate on the compiled version.  For instance, in the case above the numeric predicate could be a number literal, the result of a function, or the result of an arbitrarily complex calculation.  Detecting all those cases is actually quite tricky at the AST level.  Once the compiled objects are generated, we can just check the predicates return type.</p>
<p>Finding predicates containing calls to the last() or position() functions is going be slightly more work, and probably slightly fiddlier.  I'm off to have a crack at that now.</p>
<hr width='75%'/>
<p>[An hour or so later] ... and I think that's it.  Each predicate is an expression, and an expression may contain other expressions and so on.  An expression might model a comparision operator, for instance, which would contain expression each for the left and right hand operands.  I put together <a href='http://en.wikipedia.org/wiki/Visitor_pattern' title='Visitor pattern.  I struggle to match the pattern description to the action, but that's what I did.'>a little walker</a> to zoom through this expression tree looking for instances of the last or position functions.  If it finds one, then I just generate the rewrite expression in exactly the same way as before.</p>
 ]]></description></item>
<item><title>XSLT: Position matches</title><link>http://www.jezuk.co.uk/arabica/log?id=3546</link><description><![CDATA[ <p>Almost <a href='/jez/2005December#2717' title='16 December 2005'>exactly two years ago</a> (glurk!), I wrote a little item about XSLT match patterns and gave a clue to how I implement them.  A pattern like
<blockquote><code>chapter/para</code></blockquote>
is equivalent to an XPath expression like
<blockquote><code>boolean(self::para/parent::chapter)</blockquote>
That's how Arabica actually implements it, rewriting the match pattern as an XPath expression during compilation.</p>

<p>This works in every case, except those involving positional matches,
<blockquote><code>chapter/para[1]</code></blockquote>
for instance.  In the simple case above, the rewriting is actually quite simple.  You can push each step, here <code>chapter</code> then <code>para</code>, onto a <a href='http://en.wikipedia.org/wiki/LIFO' title='Last In First Out'>LIFO stack</a> and then pull them off again adding the extra bit of wrapping as you go.  For positional matches, the rewriting is a little more involved.  A match like 
<blockquote><code>chapter/para[last()]</code></blockquote> needs to be rewritten as something along the lines of 
<blockquote><code>self::para[. = parent::*/para[last()]]/parent::chapter</code></blockquote> although you could probably simplify it to 
<blockquote><code>self::para[. = parent::chapter/para[last()]]</code></blockquote></p>
<p>Arabica::XSLT doesn't do this rewriting at the moment, which is, I feel, the largest single hole in it.  I'm aiming to have a go at in the next few days.  If I can crack it relatively quickly, maybe by this time next year I'll be looking for a new project :)</p>
<p>Positional matches must by the way, at least as far as I can see, be evaluated in this way (even if the underlying implementation is different).  This is why matches like this can be quite expensive in terms of runtime, and are generally discouraged.</p>

 ]]></description></item>
<item><title>XSLT: Test case results spreadsheet</title><link>http://www.jezuk.co.uk/arabica/log?id=3468</link><description><![CDATA[ <p>Uploaded the XSLT test results spreadsheet to Google Docs and, should you be so motivated, <a href='http://spreadsheets.google.com/pub?key=pQSUogJPG5pARCFUTTYNhWg&hl=en'>the latest results will always be available here</a>. ]]></description></item>

</channel></rss>