
Freelance software grandad
software created
extended or repaired
Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations
A longitudinal study
First confirmed ice cream van sighting of 2022 occurred at 11:55 on March 26th. The van, Raja’s Ices, was heading south along Kingswood Road, playing Teddy Bears' Picnic.
This year’s Saturday observation further establishes the weekend as the most likely time of the week see that first ice cream van. Saturday and Sunday together account for more observations than Monday to Friday combined. The time, 11:55, is the earliest in the day yet recorded, with most sightings occurring between 16:00 and 17:15. The date is within expectation.
Field observations
Analysis
The full ice cream van data is available as a spreadsheet.
Forward projection
Projecting from data gathered thus far, we project that next year’s first ice cream van should appear on the afternoon of 9th March 2023, with an envelope of +5/-2 days.
Previous Years
STinC++ rewrites the programs in Software Tools in Pascal using C++
The bulk of chapter five was introducing and describing regular expressions. Regexes are ubiquitous these days, but you’d still struggle to find an introduction that’s as clear as that in Software Tools in Pascal. Kernighan and Plauger are both good writers individually, but together they’ve got some kind of superpower. Introducing regexes is all very well, but the point of chapter five is implemented a regular expression engine, and that was really good fun.
I don’t have a strong computer science background. I trained in electronic engineering and came into software sort of by accident. I have been doing this for getting on for 30 years, so I’ve picked up a few things along the way but I have a generally utilitarian approach. I probably could invert a tree, but I’d look for a library function first.[1] Anyway, I learned a whole lot building my little regex library, and like last chapter’s sort algorithms, unsophisticated as my implementation is, writing it is something I would never have otherwise written. It’s given me an appreciation of regexes I didn’t have before. In my code, for instance, I can see a couple of pretty obvious optimisations, mainly handling the no-match case. The Kleene star implementation is greedy, but I know how a non-greedy version would drop right in. I can see how the code could be extended to support capture groups. It’s a whole new little arena that’s opened up for me. Thanks Brian, thanks Bill.[2]
It helped cement some C++ for me too. There are always things you know, but perhaps not consciously or with full understanding. I knew, in an abstract kind of way, that C++ lambda functions have value semantics. However, since I’d pretty much always used lambdas functions as arguments to functions that were declared right at the call site, I hadn’t appreciated that they enabled, for example, runtime polymorphic behaviour in value objects. As the code compiles a regular expression into a pattern_matcher
object, it’s building a std::vector
of matcher
objects. Each of those matcher
objects can have different behaviour - matching the start of a line, a single character, one of a set of characters, or whatever. "Classically", if I’d written this code 10 years ago say, at the very least I’d have a base class, possibly abstract, and then a series of leaf classes to describe the different behaviours. Inside the pattern_matcher
we’d have a list of pointers, there’d be a whole lot of memory management, and we’d all be a bit nervous. Today, we can, thanks to lambda functions and std::function
, do that all with value objects. There’s no explicit memory handling, no worries about lifetime management, the code is compact and to the point. It’s fantastic.
In my write-ups of find
and compare
I quoted only a little bit of Kernighan and Plauger’s Pascal. As the problems we’re working on get that bit bigger, the C++ I’m writing is further and further away from the Pascal they wrote and comparisons between the two are less and less useful. We’ve established, for instance, that Kernighan and Plauger are hampered by Pascal’s lack of a proper string type, and there’s not much point in saying it again. The bigger problems they now present naturally have a bigger solution space, and C++ (or your current language of choice) lets us explore that solution space much more freely.
Two programs for chapter five, but only one coming up in chapter six, Editing. It’s going to be a bit of job, and I think I’m going to have to find a new way to write about it too. I hope you’ll follow along with me.
STinC++ rewrites the programs in Software Tools in Pascal using C++
Back at the end of the summer, I described find
, a tool for searching text using simple regular expressions. That was quite a chunky little project, involving describing the regular expression syntax, implementing the various matchers, coping with the complication that is the Kleene star, parsing out the pattern to build the matcher, and finally putting all together to search the input text.
With find
complete, Kernighan and Plauger move breezily on
Now that we know how to identify patterns in text, let’s consider a useful tool for making selective changes…
PROGRAM |
|
USAGE |
|
FUNCTION |
The replacement string
|
EXAMPLE |
To parenthesize all sums and differences of identifiers:
|
Well this seems straightforward enough. We’ve just built a little regular expression matcher which can find a pattern in an input line. If we extend that slightly to tell us where it matched, then everything should more or less drop out.
The body of our existing find
function is a simple loop. It tries to match the pattern to the start of the line, and to each subsequent position in turn until we either get a match or run off the end.
bool pattern_matcher::match(std::string const& line) const {
bool once = true; // need to try at least once, because even zero length input might match
for (auto seq = character_sequence(line); !seq.is_eol() || once; seq.advance(), once = false) {
if (match_all(m_, seq))
return true;
}
return false;
}
The business of looping over the characters in line
is wrapped up in character_sequence
. When we match the pattern, the character_sequence
already knows where we are in the line, so extending the match
and match_all
functions to return that information to us is straightforward.
match_location pattern_matcher::find(std::string const& line) const {
bool once = true; // need to try at least once, because even zero length input might match
for (auto seq = character_sequence(line); !seq.is_eol() || once; seq.advance(), once = false) {
auto match = match_all(m_, seq);
if (match.match)
return match;
}
return not_found;
}
bool pattern_matcher::match(std::string const& line) const {
return find(line).match;
}
Our change
program wants to replace all occurrences of a pattern, so we need one further small refinement to the find
function - an offset from the start of the line.
match_location pattern_matcher::find(std::string const& line, size_type offset) const {
bool once = true; // need to try at least once, because even zero length input might match
for (auto seq = character_sequence(line, offset); !seq.is_eol() || once; seq.advance(), once = false) {
auto match = match_all(m_, seq);
if (match.match)
return match;
}
return not_found;
}
With the match location available, it should be the work of moments to drop the replacement in. Right? Wrong!
In the code above, there’s already a little bit of special handling for zero-width matches - patterns which match not on characters, but on the spaces in between characters. The beginning of line pattern, %
, matches before the first character, the end of line pattern, $
, matches after the last character. Both will also match an empty line.
But what about A*
, zero or more letter As? Where does that match against, say, 1234
? Bleeding everywhere, that’s where. It matches before 1
, and also after 1
. It matches before 2
as well. After 1
and before 2
are, of course, the same location, and similarly after 2
and before 3
are the same location, and so on.
In the change routine, accommodating zero-width matches, particularly taking into account that a call to find
may return the same match location as the previous call, it turns out, is what we call a right pain in the arse. It took a reasonable amount of back and forth to get right. It kept feeling like we needed just another flag to account for this special case, or to show we needed to do something again just this once. The gory details are all in the Git repository, but this extract gives you a flavour.
Date: Thu Dec 9 20:43:09 2021 +0000 special handling for terminal-only patterns (%|$) I’m not super keen to do this, but it turned out we were not matching a naked $ with the existing set up. It’s always going to be a bit weird, and just pulling out terminal-only patterns seems neater than trying to futz with the core loop. As a minor side-effect, not that youd notice, quicker too. |
Date: Thu Dec 9 20:49:09 2021 +0000 replace $ only patterns |
Date: Thu Dec 9 21:30:52 2021 +0000 extend find and match functions to take an offset allows us to test the pattern against any part of the line, rather than just the whole line at a time. This is obviously particularly relevant for % anchored matches |
Date: Thu Dec 9 21:33:06 2021 +0000 rework change to deal with (%|$) zero width matches |
Date: Thu Dec 9 22:03:43 2021 +0000 correct terminal-only matching with non-zero offset |
Date: Thu Dec 9 22:04:42 2021 +0000 handle zero-width non-terminal matches - need to give them one more go applying change a* - to 12345 gives -1-2-3-4-5- The a* pattern matches before each character and at the end of the line. This is not exactly what you immediately expect, but try s/a*/-/g in sed and prepare to be slightly discombobulated. |
Date: Thu Dec 9 22:47:37 2021 +0000 if we immediately match in the same location, don’t replace again eg applying change a* - to abc gives -b-c- |
Hell of an evening, I’ll tell you.
The whole business made me quite sad for a little while, but good tests and repeated gentle refactoring got the code, and me, to a comfortable place. Here’s the main loop which finds the matches and inserts the replacement:
void apply_change(
stiX::pattern_matcher const& matcher,
stiX::replacement const& replacement,
std::string_view line,
std::ostream& out
) {
size_type offset = 0;
size_type last_match = -1;
for (auto state = replacement_state(); !state.completed(); state.next(at_end(offset, line))) {
auto loc = matcher.find(line, offset);
if (!loc.match)
break;
if (last_match != loc.from || !loc.zero_width) {
auto up_to_match = line.substr(offset, loc.from - offset);
out << up_to_match;
replacement.apply(line.substr(loc.from, loc.to - loc.from), out);
}
offset = loc.to;
last_match = loc.to;
if (loc.zero_width && not_at_end(offset, line)) {
out << line[loc.from];
offset = loc.from+1;
}
}
out << line.substr(offset) << '\n';
}
There’s still a bit of state management going on, but it’s fairly well under control. It took some thinking, but the special case for zero width matches is contained in one place, the loop condition is wrapped up in a little object and there are no nested if
statements.
The changes to the pattern matcher were, in the end, not large and didn’t change the overall shape of that part of the code. With the details of where and when to insert the new text, the replacement did, as I’d expected, pretty much just drop out.
So much of the work code we produce is compromised by time, scope, by any or all of a hundred other factors, and while we can be pleased with it, it also sometimes be a little unfulfilling, disappointing even. While I would never expect someone to write code in their own time, on their own account, there is a pleasure to be had from nurdling around with hermetic little problems like this.
Of course, looking at apply_change
again two weeks after I declared it 'finished', there are further changes I’d like to make. I was doing the best I could at the time, so I’m not beating myself up about it. I’m just seeing it differently today. That function’s about twice as long as I’m usually comfortable with for a start, so at the very least I’d cut that down.
Maybe I’ll do that now.
Happy New Year, friends - I hope you find comfort in your code too.
string_view - in the middle of this I realised I didn’t need to construct and copy strings around, I could just use a string_view
. Must make a conscious effort to keep on using it in the future.
sed, the grown-up program change
could be one day if we just kept going.
The full source code, with tests and what-not, for change, indeed the whole project, is available in the stiX GitHub repository.
This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.
For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and twice renew a license for. I use it all the time now.
The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.
Freelance software grandad
software created
extended or repaired
Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations