Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Sunday 02 January 2022 The Forest Road Reader, No 2.66 : STinC++ - change

STinC++ rewrites the programs in Software Tools in Pascal using C++

Back at the end of the summer, I described find, a tool for searching text using simple regular expressions. That was quite a chunky little project, involving describing the regular expression syntax, implementing the various matchers, coping with the complication that is the Kleene star, parsing out the pattern to build the matcher, and finally putting all together to search the input text.

With find complete, Kernighan and Plauger move breezily on

Now that we know how to identify patterns in text, let’s consider a useful tool for making selective changes…​

PROGRAM

change change patterns in text

USAGE

change pattern [newstuff]

FUNCTION

change copies its input to its input to its output except that each non-overlapping string that matches pattern is replaced by newstuff. A non-existent newstuff inplies a deletion of the matched string. The patterns accepted by change are the same as those used by find.

The replacement string newstuff consists of zero of more of the following elements:

c literal character c

& ditto, i.e., whatever was matched

@c escaped character c (e.g., @&)

EXAMPLE

To parenthesize all sums and differences of identifiers:

change "[a-zA-Z][a-zA-Z0-9]*[ ]*[+-][ ]*[a-zA-Z][a-zA-Z0-9]*" "(&)"

From finding to changing

Well this seems straightforward enough. We’ve just built a little regular expression matcher which can find a pattern in an input line. If we extend that slightly to tell us where it matched, then everything should more or less drop out.

The body of our existing find function is a simple loop. It tries to match the pattern to the start of the line, and to each subsequent position in turn until we either get a match or run off the end.

bool pattern_matcher::match(std::string const& line) const {
  bool once = true; // need to try at least once, because even zero length input might match
  for (auto seq = character_sequence(line); !seq.is_eol() || once; seq.advance(), once = false) {
    if (match_all(m_, seq))
      return true;
  }
  return false;
}

The business of looping over the characters in line is wrapped up in character_sequence. When we match the pattern, the character_sequence already knows where we are in the line, so extending the match and match_all functions to return that information to us is straightforward.

match_location pattern_matcher::find(std::string const& line) const {
  bool once = true; // need to try at least once, because even zero length input might match
  for (auto seq = character_sequence(line); !seq.is_eol() || once; seq.advance(), once = false) {
    auto match = match_all(m_, seq);
    if (match.match)
      return match;
  }
  return not_found;
}

bool pattern_matcher::match(std::string const& line) const {
  return find(line).match;
}

Our change program wants to replace all occurrences of a pattern, so we need one further small refinement to the find function - an offset from the start of the line.

match_location pattern_matcher::find(std::string const& line, size_type offset) const {
  bool once = true; // need to try at least once, because even zero length input might match
  for (auto seq = character_sequence(line, offset); !seq.is_eol() || once; seq.advance(), once = false) {
    auto match = match_all(m_, seq);
    if (match.match)
      return match;
  }
  return not_found;
}

With the match location available, it should be the work of moments to drop the replacement in. Right? Wrong!

"I think you’ll find it’s a little bit more complicated than that"

In the code above, there’s already a little bit of special handling for zero-width matches - patterns which match not on characters, but on the spaces in between characters. The beginning of line pattern, %, matches before the first character, the end of line pattern, $, matches after the last character. Both will also match an empty line.

But what about A*, zero or more letter As? Where does that match against, say, 1234? Bleeding everywhere, that’s where. It matches before 1, and also after 1. It matches before 2 as well. After 1 and before 2 are, of course, the same location, and similarly after 2 and before 3 are the same location, and so on.

In the change routine, accommodating zero-width matches, particularly taking into account that a call to find may return the same match location as the previous call, it turns out, is what we call a right pain in the arse. It took a reasonable amount of back and forth to get right. It kept feeling like we needed just another flag to account for this special case, or to show we needed to do something again just this once. The gory details are all in the Git repository, but this extract gives you a flavour.

Date: Thu Dec 9 20:43:09 2021 +0000

special handling for terminal-only patterns (%|$)

I’m not super keen to do this, but it turned out we were not matching a naked $ with the existing set up. It’s always going to be a bit weird, and just pulling out terminal-only patterns seems neater than trying to futz with the core loop. As a minor side-effect, not that youd notice, quicker too.

Date: Thu Dec 9 20:49:09 2021 +0000

replace $ only patterns

Date: Thu Dec 9 21:30:52 2021 +0000

extend find and match functions to take an offset

allows us to test the pattern against any part of the line, rather than just the whole line at a time. This is obviously particularly relevant for % anchored matches

Date: Thu Dec 9 21:33:06 2021 +0000

rework change to deal with (%|$) zero width matches

Date: Thu Dec 9 22:03:43 2021 +0000

correct terminal-only matching with non-zero offset

Date: Thu Dec 9 22:04:42 2021 +0000

handle zero-width non-terminal matches - need to give them one more go

applying change a* - to 12345 gives -1-2-3-4-5-

The a* pattern matches before each character and at the end of the line. This is not exactly what you immediately expect, but try s/a*/-/g in sed and prepare to be slightly discombobulated.

Date: Thu Dec 9 22:47:37 2021 +0000

if we immediately match in the same location, don’t replace again

eg applying change a* - to abc gives -b-c-

Hell of an evening, I’ll tell you.

Soft Landing

The whole business made me quite sad for a little while, but good tests and repeated gentle refactoring got the code, and me, to a comfortable place. Here’s the main loop which finds the matches and inserts the replacement:

void apply_change(
  stiX::pattern_matcher const& matcher,
  stiX::replacement const& replacement,
  std::string_view line,
  std::ostream& out
) {
  size_type offset = 0;
  size_type last_match = -1;

  for (auto state = replacement_state(); !state.completed(); state.next(at_end(offset, line))) {
    auto loc = matcher.find(line, offset);
    if (!loc.match)
      break;

    if (last_match != loc.from || !loc.zero_width) {
      auto up_to_match = line.substr(offset, loc.from - offset);
      out << up_to_match;
      replacement.apply(line.substr(loc.from, loc.to - loc.from), out);
    }

    offset = loc.to;
    last_match = loc.to;

    if (loc.zero_width && not_at_end(offset, line)) {
      out << line[loc.from];
      offset = loc.from+1;
    }
  }

  out << line.substr(offset) << '\n';
}

There’s still a bit of state management going on, but it’s fairly well under control. It took some thinking, but the special case for zero width matches is contained in one place, the loop condition is wrapped up in a little object and there are no nested if statements.

The changes to the pattern matcher were, in the end, not large and didn’t change the overall shape of that part of the code. With the details of where and when to insert the new text, the replacement did, as I’d expected, pretty much just drop out.

So much of the work code we produce is compromised by time, scope, by any or all of a hundred other factors, and while we can be pleased with it, it also sometimes be a little unfulfilling, disappointing even. While I would never expect someone to write code in their own time, on their own account, there is a pleasure to be had from nurdling around with hermetic little problems like this.

Thanks Chums

Of course, looking at apply_change again two weeks after I declared it 'finished', there are further changes I’d like to make. I was doing the best I could at the time, so I’m not beating myself up about it. I’m just seeing it differently today. That function’s about twice as long as I’m usually comfortable with for a start, so at the very least I’d cut that down.

Maybe I’ll do that now.

Happy New Year, friends - I hope you find comfort in your code too.

References

  • string_view - in the middle of this I realised I didn’t need to construct and copy strings around, I could just use a string_view. Must make a conscious effort to keep on using it in the future.

  • sed, the grown-up program change could be one day if we just kept going.

Source Code

The full source code, with tests and what-not, for change, indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and twice renew a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


Tagged code, and software-tools-in-c++


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed