| JezUK Ltd - The Coffee Grounds - January 2008 |
| << December 2007 | February 2008 >> |
Moments before I'm about to go to bed, I discover Taggle fails (in a coredumpy way) for documents which have a DOCTYPE declaration. Will have a look at a fix in the morning.
As in those well versed in the use of cosmetics? [added 31st Jan 2008]
If you've grabbed the code from subversion:
svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-portyou might be wondering how to build it.
For Visual Studio 2005 users, open up the vs8\taggle.sln project and build away. It should just work. If it doesn't, then check the project build notes for information on setting up search paths and things.
For Unixy types, you will need a mighty three steps:
autoreconf - to create the configure script./configure - to dig out where the various bits and pieces Arabica needs are, and to create the Makefilesmake - to, erm, make everything
Problems, questions, issues? Get in touch.
Taggle, Arabica's port of the TagSoup HTML parser, now builds and runs. It dodges pretty much every encoding issue on the planet, but as a first go it's really quite pleasing. Give it this -
This is <B>bold, <I>bold italic, </b>italic, </i> normal text
and get this
<html>
(Ok, you have to squint a bit at the indenting, but that's a separate issue.)
<body>This is
<b>bold,
<i>bold italic, </i>
</b>
<i>italic, </i>
normal text
</body>
</html>
If you want to have a play, check out the tagsoup-port branch from subversion:
svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port
In examples/Taggle, there's a little command line application that read HTML documents and prints the corrected markup to the console.
I'll merge this back into the trunk in the next few days.
Time and inclination. Porting TagSoup to C++ took me a few hours. It was fun, and quite an easy win. Having done it, I'm surprised that nobody's done it before.
Writing an HTML5 parser needs rather more time than I have - not only in writing the code, developing the test suite, but then tracking the standard as it emerges. Even if I had the time, I don't actually have the inclination, because it's not something that really interests me enough right now. Sorry :)
[added 2nd Feb 2008]After a rather intense return work after Christmas, I'm taking a bit of a break from Arabica's XSLT development for something a bit lighter - porting John Cowan's excellent TagSoup package to C++ and Arabica. TagSoup, if your not familiar with it, is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.
Obviously, if you have a SAX parser you can apply all your standard XML techniques - not only SAX filters, but building a DOM, applying XPaths, or XSLT transformations as well.
Cowan describes what TagSoup does as
TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on.
The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like:
This is <B>bold, <I>bold italic, </b>italic, </i>normal text
gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
Looks simple, doesn't it? Well, that's a simple example and it's still a tricky and awkward result in practice. Cowan's patience in persuing this and what looks like a rather elegant solution is to be applauded. Porting his code to C++ has been pretty quick and painless so far, and I expect the new piece, which I've called Taggle, to be finished pretty soon. Arabica will be stronger for it - thanks John!
In the unlikely event you sent me email between about 17:00 on Friday and 9ish this morning, I won't have received it and you may want to send it again.
In the basket under the bed, I've got 18 unread books, and there are others lurking on the shelves downstairs. On my desk I have 6 unread technical books. The comics to read pile is about 12 inches and 22 separate volumes deep.
How did I get here?
You may recall me mentioning I've got a stack of around ten Patrick O'Brian 'Master and Commander' novels. Now, while historical adventure-soaps may be my bag, I think I'm either going to mothball these books out of sight for my retirement, or just bung 'em in the charity shop.
There are too many other good things to read, perhaps a touch of ruthlessness is in order? [added 25th Jan 2008]
I'm glad I'm not as badly off as I might be with this.
On a completely unrelated note, this: http://russl.wordpress.com/2008/01/25/splenetic-memetics-birmingham-blog-tig-tag-tick [added 25th Jan 2008]
Oh, that and being clipped by the mirror on the Focus. Twot. [added 28th Jan 2008]
License Revoked was a good title. Quantum of Solace just isn't (not for a Bond movie), regardless of the reader's vocab. Perhaps they'll change it to Tea and Sympathy ;) [added 25th Jan 2008]
For my Fantasy GDFAF I rated the chances of seeing Ministry as nil, because they didn't tour any more. But I was wrong. Trip to Wolverhampton, anyone?
[added 24th Jan 2008]
Updating my low-grade train spotter list, I discovered the 390-009 Virgin Queen that I collected in October has been renamed Treaty of Union. Realised immediately I would have to ride the train again.
BTW I have a feeling that most thrills so far (progs from Jan to April) have been exceptionally strong - a feeling I didn't get at all with 2006's output. I pray it continues... [added 17th Jan 2008]
What then? A virtual 'mind' format? Close your eyes and *feel* what the author felt? Whoa there, I've gone too far. [added 16th Jan 2008]
SMTP error from remote mail server after MAIL FROM:jez@jezuk.co.uk SIZE=5446:
host mx1.uk.tiscali.com [212.74.100.147]: 550 mail not accepted from blacklisted IP address [195.188.213.7]
Oh dear. 195.188.213.7 isn't me. I wonder who it belongs to?
jez@riven ~
$ nslookup 195.188.213.7
Server: ns1-gat.blueyonder.net
Address: 62.31.144.39:53
Name: smtp-out4.blueyonder.co.uk
Address: 195.188.213.7
Hmmm.
Spamhaus takes a rather hardline view to blacklisting, and why not? However, it does seem a little silly for one of the major UK ISPs to be dropping emails sent from one of the other major UK ISPs.
I have spent alot of time searching on line searching on my ISP address and this is the first time I have seen a negative thing about it.
The latest check on the below link say's I'm no risk, I know that I am no risk so what the heck is happening?
http://www.trustedsource.org/query/195.188.213.7
[added 27th Jun 2008]
ISP means internet service provider, and in this case it's Blueyonder (now Virgin Media). The address 195.188.213.7 isn't assigned to you, Dave, but to one of Blueyonder's outbound email relays. When you send your email, it disappears up the wire, bounces around inside Blueyonder for a bit, before emerging for its onward journey through one of the outbound mail relays (there are several).
The reason Tiscali are kicking back your mail is because some other Blueyonder customer (in fact, probably many of them) is sending spam (almost certainly unknowingly). One of the spam monitoring services has noticed and, as a result, blacklisted Blueyonder. Tiscali (because they are stupid) appear to be blindly following the blacklist, and are kicking back your email.
If you wait a while (hour, couple of hours, day), somebody at Blueyonder will have spoken to the people who run the blacklist (again), and get themselves unlisted. Hopefully they'll also talk to Tiscali (again), but I wouldn't bet on it not happening again.
[added 27th Jun 2008]
Tiscali now appear to have sorted them out, now it's another one rejecting our emails.
I don't send spam but seem to get everyone elses, even though I paid £90 for Norton 360 so stop it all.
I think that everyone who sells viagra or thinks that I want a loan have my email address!
The message we get is:
SMTP error from remote mail server after initial connection:
host manchester.sin1.netline.net.uk [213.40.66.235]:
550-Host 195.188.213.7 is listed at www.uceprotect.net as a spam source.
I even get messages returned when I reply to my friends.
Who knows, I'll be rejecting myself soon! (or is that Injecting myself)? [added 5th Jul 2008]

The schedule for accu2008 has been published and registration is now open. As ever, it's a pretty fantastic line-up including process-giant Tom Gilb, Erlang inventor Joe Armstrong, and Haskell big-brain Simon Peyton Jones.
If you have even the remotest care about your software development, you should consider going. It's not just C++ or Java or what some supplier thinks you should be interested in this week, it's a solid, deep, programme on software development, process, project management, and so on. Ask your boss if he'll pay. He really should. If you book before the end of January it's £450 for ACCU members, or £550 for non-members. (Hint: It's ok to join then book.) For a four day conference, that's an utter bargain. (Compare with, say, DevWeek which charges over a £1000 for three days and has a timetable dominated by Microsoft staff talking about as-yet-unreleased Microsoft tools.)
But you would say that, Jez, you're the ACCU Chair ...
Well, yes I am the Chair. But I joined in running ACCU because things like the conference were so good, I don't say they're good because I'm Chair.
I can't imagine you haven't been to http://www.throwingmusic.com/freemusic/ - interesting that she was using the "tip jar" model quite some time before those Oxford upstarts... [added 5th Jan 2008]
template<typename InIter, typename OutIter, typename Pred>
OutIter copy_while(InIter first, InIter last, OutIter dest, Pred pred)
{
for(; first != last && pred(*first); ++first, ++dest)
*dest = *first;
return dest;
} // copy_while
template<typename InIter, typename OutIter, typename Pred>
OutIter copy_until(InIter first, InIter last, OutIter dest, Pred pred)
{
return copy_while(first, last, dest, std::not1(pred));
} // copy_until
For a warm smug feeling, briefly outline possible problems with the above code.
<
,
,
>
(
,
,
,
)
{
etc.
So I see quite a few problems... [added 4th Jan 2008]
I don't think it liked my markup. I've changed, it but it hasn't noticed yet. Let's see what happens :) [added 4th Jan 2008]
The slightly tedious answer, by the way, is the the use of std::not1 in copy_until requires pred be an adaptable predicate (by, for example, having Pred derive from std::unary_function). This means using a simple free function as your predicate, which people expect to be able to do, isn't possible. I guess you could dink around creating traits classes and partial specialisations and so on, but it's easier just to rewrite as:
template<typename InIter, typename OutIter, typename Pred>
OutIter copy_until(InIter first, InIter last, OutIter dest, Pred pred)
{
for(; first != last && !pred(*first); ++first, ++dest)
*dest = *first;
return dest;
} // copy_until
And no, Allan, this isn't an argument to program in some other language.
[added 4th Jan 2008]Back at my desk for the first time in ages. The attic's going to be replastered in the next few days. This is both good, because it should mean I won't finish the day covered in a fine layer of dust, and bad, because I had to move everything into the bedroom next door. The JezUK computing infrastructure isn't especially large these days - couple of boxes, couple of flatscreens, wifi-router-hub, and a linkstation. The total distance moved was about 10 feet. Somehow I managed to kill my keyboard, and for several long panicky minutes it looked like my main box wasn't going to start. First off it fired up fine, but when I realised the keyboard was dead and I plugged in another, it abruptly powered down. And didn't come on. And didn't come on. And didn't come on. And didn't come on. And didn't come on. And didn't come on. And didn't come on. Suddenly, as bitter tears of impotent rage and panic sprung to my eyes, it powered up. Hopefully it won't be quite so horrible when I get to move back.
What happened as I powered up the dead drive (to see if I could reformat and eke out some more use)? It fired up with no problems, and has run well since. WTF?
Computers. You can keep 'em.
BTW Happy New Year! [added 2nd Jan 2008]
I have nearly all my work sitting in version control, so I wouldn't especially worried on that score. It was more the time - fixing the machine, rebuilding the install and all that stuff - that I was worrying about. I have a job, a deadline, and not a lot of wiggle. Let's not dwell on the might have beens :)
[added 2nd Jan 2008]| << December 2007 | February 2008 >> |