Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Saturday 26 March 2022 First ice cream van of the year …​

A longitudinal study

First confirmed ice cream van sighting of 2022 occurred at 11:55 on March 26th. The van, Raja’s Ices, was heading south along Kingswood Road, playing Teddy Bears' Picnic.

This year’s Saturday observation further establishes the weekend as the most likely time of the week see that first ice cream van. Saturday and Sunday together account for more observations than Monday to Friday combined. The time, 11:55, is the earliest in the day yet recorded, with most sightings occurring between 16:00 and 17:15. The date is within expectation.

Field observations

Analysis

First Ice Cream Van Of The Year 2004-2022
First Ice Cream Van Of The Year 2004-2022
First Ice Cream Van Of The Year 2004-2022

The full ice cream van data is available as a spreadsheet.

Forward projection

Projecting from data gathered thus far, we project that next year’s first ice cream van should appear on the afternoon of 9th March 2023, with an envelope of +5/-2 days.

Previous Years


Tagged icecream

Tuesday 11 January 2022 The Forest Road Reader, No 2.67 : STinC++ - Chapter Five Wrap Up

STinC++ rewrites the programs in Software Tools in Pascal using C++

  • find - searching text using regular expressions

  • change - and then replacing the matches with something else

The bulk of chapter five was introducing and describing regular expressions. Regexes are ubiquitous these days, but you’d still struggle to find an introduction that’s as clear as that in Software Tools in Pascal. Kernighan and Plauger are both good writers individually, but together they’ve got some kind of superpower. Introducing regexes is all very well, but the point of chapter five is implemented a regular expression engine, and that was really good fun.

I don’t have a strong computer science background. I trained in electronic engineering and came into software sort of by accident. I have been doing this for getting on for 30 years, so I’ve picked up a few things along the way but I have a generally utilitarian approach. I probably could invert a tree, but I’d look for a library function first.[1] Anyway, I learned a whole lot building my little regex library, and like last chapter’s sort algorithms, unsophisticated as my implementation is, writing it is something I would never have otherwise written. It’s given me an appreciation of regexes I didn’t have before. In my code, for instance, I can see a couple of pretty obvious optimisations, mainly handling the no-match case. The Kleene star implementation is greedy, but I know how a non-greedy version would drop right in. I can see how the code could be extended to support capture groups. It’s a whole new little arena that’s opened up for me. Thanks Brian, thanks Bill.[2]

It helped cement some C++ for me too. There are always things you know, but perhaps not consciously or with full understanding. I knew, in an abstract kind of way, that C++ lambda functions have value semantics. However, since I’d pretty much always used lambdas functions as arguments to functions that were declared right at the call site, I hadn’t appreciated that they enabled, for example, runtime polymorphic behaviour in value objects. As the code compiles a regular expression into a pattern_matcher object, it’s building a std::vector of matcher objects. Each of those matcher objects can have different behaviour - matching the start of a line, a single character, one of a set of characters, or whatever. "Classically", if I’d written this code 10 years ago say, at the very least I’d have a base class, possibly abstract, and then a series of leaf classes to describe the different behaviours. Inside the pattern_matcher we’d have a list of pointers, there’d be a whole lot of memory management, and we’d all be a bit nervous. Today, we can, thanks to lambda functions and std::function, do that all with value objects. There’s no explicit memory handling, no worries about lifetime management, the code is compact and to the point. It’s fantastic.

In my write-ups of find and compare I quoted only a little bit of Kernighan and Plauger’s Pascal. As the problems we’re working on get that bit bigger, the C++ I’m writing is further and further away from the Pascal they wrote and comparisons between the two are less and less useful. We’ve established, for instance, that Kernighan and Plauger are hampered by Pascal’s lack of a proper string type, and there’s not much point in saying it again. The bigger problems they now present naturally have a bigger solution space, and C++ (or your current language of choice) lets us explore that solution space much more freely.

Two programs for chapter five, but only one coming up in chapter six, Editing. It’s going to be a bit of job, and I think I’m going to have to find a new way to write about it too. I hope you’ll follow along with me.


1. Assuming there are any because I can’t think of reason why you’d want to invert a tree.
2. PJ Plauger is widely known as Bill.

Tagged code, and software-tools-in-c++

Sunday 02 January 2022 The Forest Road Reader, No 2.66 : STinC++ - change

STinC++ rewrites the programs in Software Tools in Pascal using C++

Back at the end of the summer, I described find, a tool for searching text using simple regular expressions. That was quite a chunky little project, involving describing the regular expression syntax, implementing the various matchers, coping with the complication that is the Kleene star, parsing out the pattern to build the matcher, and finally putting all together to search the input text.

With find complete, Kernighan and Plauger move breezily on

Now that we know how to identify patterns in text, let’s consider a useful tool for making selective changes…​

PROGRAM

change change patterns in text

USAGE

change pattern [newstuff]

FUNCTION

change copies its input to its input to its output except that each non-overlapping string that matches pattern is replaced by newstuff. A non-existent newstuff inplies a deletion of the matched string. The patterns accepted by change are the same as those used by find.

The replacement string newstuff consists of zero of more of the following elements:

c literal character c

& ditto, i.e., whatever was matched

@c escaped character c (e.g., @&)

EXAMPLE

To parenthesize all sums and differences of identifiers:

change "[a-zA-Z][a-zA-Z0-9]*[ ]*[+-][ ]*[a-zA-Z][a-zA-Z0-9]*" "(&)"

From finding to changing

Well this seems straightforward enough. We’ve just built a little regular expression matcher which can find a pattern in an input line. If we extend that slightly to tell us where it matched, then everything should more or less drop out.

The body of our existing find function is a simple loop. It tries to match the pattern to the start of the line, and to each subsequent position in turn until we either get a match or run off the end.

bool pattern_matcher::match(std::string const& line) const {
  bool once = true; // need to try at least once, because even zero length input might match
  for (auto seq = character_sequence(line); !seq.is_eol() || once; seq.advance(), once = false) {
    if (match_all(m_, seq))
      return true;
  }
  return false;
}

The business of looping over the characters in line is wrapped up in character_sequence. When we match the pattern, the character_sequence already knows where we are in the line, so extending the match and match_all functions to return that information to us is straightforward.

match_location pattern_matcher::find(std::string const& line) const {
  bool once = true; // need to try at least once, because even zero length input might match
  for (auto seq = character_sequence(line); !seq.is_eol() || once; seq.advance(), once = false) {
    auto match = match_all(m_, seq);
    if (match.match)
      return match;
  }
  return not_found;
}

bool pattern_matcher::match(std::string const& line) const {
  return find(line).match;
}

Our change program wants to replace all occurrences of a pattern, so we need one further small refinement to the find function - an offset from the start of the line.

match_location pattern_matcher::find(std::string const& line, size_type offset) const {
  bool once = true; // need to try at least once, because even zero length input might match
  for (auto seq = character_sequence(line, offset); !seq.is_eol() || once; seq.advance(), once = false) {
    auto match = match_all(m_, seq);
    if (match.match)
      return match;
  }
  return not_found;
}

With the match location available, it should be the work of moments to drop the replacement in. Right? Wrong!

"I think you’ll find it’s a little bit more complicated than that"

In the code above, there’s already a little bit of special handling for zero-width matches - patterns which match not on characters, but on the spaces in between characters. The beginning of line pattern, %, matches before the first character, the end of line pattern, $, matches after the last character. Both will also match an empty line.

But what about A*, zero or more letter As? Where does that match against, say, 1234? Bleeding everywhere, that’s where. It matches before 1, and also after 1. It matches before 2 as well. After 1 and before 2 are, of course, the same location, and similarly after 2 and before 3 are the same location, and so on.

In the change routine, accommodating zero-width matches, particularly taking into account that a call to find may return the same match location as the previous call, it turns out, is what we call a right pain in the arse. It took a reasonable amount of back and forth to get right. It kept feeling like we needed just another flag to account for this special case, or to show we needed to do something again just this once. The gory details are all in the Git repository, but this extract gives you a flavour.

Date: Thu Dec 9 20:43:09 2021 +0000

special handling for terminal-only patterns (%|$)

I’m not super keen to do this, but it turned out we were not matching a naked $ with the existing set up. It’s always going to be a bit weird, and just pulling out terminal-only patterns seems neater than trying to futz with the core loop. As a minor side-effect, not that youd notice, quicker too.

Date: Thu Dec 9 20:49:09 2021 +0000

replace $ only patterns

Date: Thu Dec 9 21:30:52 2021 +0000

extend find and match functions to take an offset

allows us to test the pattern against any part of the line, rather than just the whole line at a time. This is obviously particularly relevant for % anchored matches

Date: Thu Dec 9 21:33:06 2021 +0000

rework change to deal with (%|$) zero width matches

Date: Thu Dec 9 22:03:43 2021 +0000

correct terminal-only matching with non-zero offset

Date: Thu Dec 9 22:04:42 2021 +0000

handle zero-width non-terminal matches - need to give them one more go

applying change a* - to 12345 gives -1-2-3-4-5-

The a* pattern matches before each character and at the end of the line. This is not exactly what you immediately expect, but try s/a*/-/g in sed and prepare to be slightly discombobulated.

Date: Thu Dec 9 22:47:37 2021 +0000

if we immediately match in the same location, don’t replace again

eg applying change a* - to abc gives -b-c-

Hell of an evening, I’ll tell you.

Soft Landing

The whole business made me quite sad for a little while, but good tests and repeated gentle refactoring got the code, and me, to a comfortable place. Here’s the main loop which finds the matches and inserts the replacement:

void apply_change(
  stiX::pattern_matcher const& matcher,
  stiX::replacement const& replacement,
  std::string_view line,
  std::ostream& out
) {
  size_type offset = 0;
  size_type last_match = -1;

  for (auto state = replacement_state(); !state.completed(); state.next(at_end(offset, line))) {
    auto loc = matcher.find(line, offset);
    if (!loc.match)
      break;

    if (last_match != loc.from || !loc.zero_width) {
      auto up_to_match = line.substr(offset, loc.from - offset);
      out << up_to_match;
      replacement.apply(line.substr(loc.from, loc.to - loc.from), out);
    }

    offset = loc.to;
    last_match = loc.to;

    if (loc.zero_width && not_at_end(offset, line)) {
      out << line[loc.from];
      offset = loc.from+1;
    }
  }

  out << line.substr(offset) << '\n';
}

There’s still a bit of state management going on, but it’s fairly well under control. It took some thinking, but the special case for zero width matches is contained in one place, the loop condition is wrapped up in a little object and there are no nested if statements.

The changes to the pattern matcher were, in the end, not large and didn’t change the overall shape of that part of the code. With the details of where and when to insert the new text, the replacement did, as I’d expected, pretty much just drop out.

So much of the work code we produce is compromised by time, scope, by any or all of a hundred other factors, and while we can be pleased with it, it also sometimes be a little unfulfilling, disappointing even. While I would never expect someone to write code in their own time, on their own account, there is a pleasure to be had from nurdling around with hermetic little problems like this.

Thanks Chums

Of course, looking at apply_change again two weeks after I declared it 'finished', there are further changes I’d like to make. I was doing the best I could at the time, so I’m not beating myself up about it. I’m just seeing it differently today. That function’s about twice as long as I’m usually comfortable with for a start, so at the very least I’d cut that down.

Maybe I’ll do that now.

Happy New Year, friends - I hope you find comfort in your code too.

References

  • string_view - in the middle of this I realised I didn’t need to construct and copy strings around, I could just use a string_view. Must make a conscious effort to keep on using it in the future.

  • sed, the grown-up program change could be one day if we just kept going.

Source Code

The full source code, with tests and what-not, for change, indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and twice renew a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


Tagged code, and software-tools-in-c++
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed