Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Thursday 19 August 2021 The Forest Road Reader, No 2.65 : STinC++ - find

STinC++ rewrites the programs in Software Tools in Pascal using C++

Chapter 5 at last, and honestly this is the one I’ve been waiting for since I started this ludicrous little project. We’re at the half-way point of the book and things are getting properly serious. The problems examined are bigger, the programs presented are more substantial. The gloves are off.

Text Patterns

This chapter introduces the concept of regular expressions, before going on to build an expression matcher and a couple of programs around it. Most programmers these days are, I imagine, at least passingly familiar with regular expression,[1] and they’re provided as part of the standard language or library of every mainstream language I can think of.

Again, we need to remind ourselves that Software Tools in Pascal was published in 1981 when this was absolutely not the case. Kernighan and Plauger take care to thoroughly introduce what they call text patterns, a restricted subset of regular expressions. They describe regular expressions as extensively studied but omit to mention, other than in a footnote, the key role Bell Labs and Unix, in the person of Ken Thompson, played in moving regular expressions from the realm of study and into the realm of every day programming.

Even though they’re presenting a subset, Kernighan and Plauger’s text patterns are not simple. A text pattern is made up a string of a combination of these elements -

c

literal character

?

any character except newline

%

beginning of line

$

end of line

[…​]

character class

[^…​]

negated character class

*

closure (zero or more occurrences of previous pattern)

@c

escaped characters

If we substitute ^ for % and \ for @ then this syntax is entirely familiar to the present day reader.

In the book, Kernighan and Plauger call the * a closure, where I’d call it a Kleene star. Turns out Stephen Kleene invented regular expressions, back in 1951. We’re both right! I think! Kleene star is the name of the * operator, but a closure it what it does.

A pattern matches a line if the line contains the pattern, regardless of where in the line the pattern is. The pattern aa matches aabc, also cbaa and baac, but not xyz. Matching is done on a line-by-line basis, with no pattern matches across the line boundary (although this constraint need not necessarily be true for modern pattern matchers).

Patterns can be concatenated - one pattern followed by another forms a new pattern that matches anything matched by the first pattern which is immediately followed by anything matched by the second. Any number of patterns can be concatenated. The pattern [a-zA-Z][a-zA-Z0-9]*$, a letter followed by any number of letters or digits at a line end, would, for instance, match every line ending in a Pascal keyword or identifier.

In our code, we have our pattern (actually our concatenation of patterns), and we have our input line. Starting at the first character of the line, we try to match our first pattern. If it does, we move onto the next pattern and try to match that to the second character, and so on. If we match all of our patterns then hurrah!

If, however, a pattern does not match then we wind back to the first pattern, and start all over again from the second character in the line. If that fails, we go again from the third character, and so on and so on. If we get all the way to the end of the line without a matching each pattern in turn, then that’s a failure - there’s no match.

This feels like straightforward stuff, with only the closure presenting any kind of wrinkle. A closure matches zero or more instances of a pattern, and is resolved by finding the first character that matches and then the longest possible sequence of matches. To implement a closure, when our pattern matches we advance as far as we can along the input, then try to match our remaining patterns. If they match too then we’ve succeeded. If not, then we need to step back a little then try to match the remaining patterns again, repeating until we either have a match or we’ve stepped right back to where we started.

There are some optimisations that might immediately leap out at you. Attempting to match a pattern anchored to the start of a line can be abandoned if the second element of the pattern doesn’t match, for example. Kernighan and Plauger don’t discuss this at all. They are going deep, but they don’t go that deep. And nor do I!

Having discussed text patterns, Kernighan and Plauger get straight into the first of two programs that use them, find.

find

PROGRAM

find find patterns in text

USAGE

find pattern

FUNCTION

find reads its input a line at a time and writes to its output those lines that match the specified text pattern. A text pattern is a concatenation of the following elements

c literal character c

? any character except newline

% beginning of line

$ end of line

[…​] character class

[^…​] negated character class

* closure (zero or more occurrences of previous pattern)

@c escaped characters

Special meaning of characters in a text pattern is lost when escaped, inside […​] (except @]), or for :

% not at beginning

$ not at end

* at beginning

A character class consists of zero of more of the following elements, surrounded by [ and ]:

c literal character c, including [

c1-c2 range of characters, (digits, lower or upper case letters)

^ negated character class if at beginning

@c escaped characters (e.g., @^ @- @@ @])

Special meanings of characters in a character class is lost when escaped or for:

^ not at beginning

- at beginning or end

An escaped sequence consists of the character @ followed by a single character:

@n newline

@t tab

@c c (including @@)

EXAMPLE

To print line ending in a Pascal keyword or identifier

find [a-zA-Z][a-zA-Z0-9]*$

There’s a lot going on just to parse out the pattern, let alone implement the matching. Kernighan and Plauger start outside in, sketching out a program that takes the command line arguments and processes the input using a simple stub matcher.

All this may seem pretty elementary, but it’s surprising how many bugs are caught early this way. In large software projects the majority of bugs arise because the pieces of the system do not go together as they were expected to …​ And many other bugs survive elaborate checks on individual routines, surfacing only when the routine first interacts with the rest of the code.

It seems only natural, then, to test at the highest, most integrated level first - since that’s where most bugs are detected anyway - and to start testing as soon as possible, even before most of the actual working code is written.

I’m happy to have conversation, even with Brian Kernighan and Bill Plauger, about where most bugs might arise, but I am absolutely signed up to start testing as soon as possible.

With this initial skeleton in place, Kernighan and Plauger work their way into their matcher, initially omitting the closure, continuing into parsing out the command line arguments, before finally looping back round to the closure. Their matcher implementation is rather neat. They encode the pattern as a series of op codes. For example

%[^x]?[0-9]x$

they encode as

BOL
NCCL 1 'x'
ANY
CCL 10 '0' '1' '2' '3' '4' '5' '6' '7' '8' '9'
LITCHAR 'x'
EOL

A little routine then walks the op codes, evaluating each as it goes.

omatch
function omatch (var lin: string; var i : integer;
        var pat : string; j : integer) : boolean;  (1)
var
  advance : -1..1;
#include "locate.p"
begin
  advance := -1;
  if (lin[i] = ENDSTR) then
    omatch := false
  else if (not (pat[j] in
    [LITCHAR, BOL, EOL, ANY, CCL, NCCL, CLOSURE])) then
      error('in omatch: can''t happen')
  else
    case pat[j] of (2)
      LITCHAR:
        if (lin[i] = pat[j+1]) then
          advance := 1
      BOL:
        if (i = 1) then
          advance := 0
      ANY:
        if (lin[i] <> NEWLINE) then
          advance := 1;
      EOL:
        if (line[i] == NEWLINE) then
          advance := 0;
      CCL:
        if (locate(lin[i], pat, j+1)) then
          advance := 1;
      NCCL:
        if (lin[i] <> NEWLINE)
           and (not locate(lin[i], pat, j+1)) then
             advance := 1
    end;
  if (advance >= 0) then begin (3)
    i := i + advance
    omatch := true
  end
  else
    omatch := false
end;
  1. lin is the input line, i the current position in it, pat is the encoded pattern and j is the index of pattern opcode to evaluate

  2. The pattern tests - most are inline - and use either the current input character or the index position. If the test passes, advance is set to the number of input characters the test 'consumes'.

  3. If advance has been set, the test must have passed and processing can continue.

Since a closure applies to the preceding pattern, a* means zero-of-more a, their parsing routine inserts the closure opcode into their compiled sequence before the opcode to which it applies. Extending our earlier example

%[^x]?[0-9]*x$

encodes as

BOL
NCCL 1 'x'
ANY
CLOSURE
CCL 10 '0' '1' '2' '3' '4' '5' '6' '7' '8' '9'
LITCHAR 'x'
EOL

Evaluating the closure happens at the level above the omatch function -

amatch
function amatch (var lin : string; offset : integer;
        var pat : string; j : integer) : integer;
var
  i, k : integer;
  done : boolean;
#include "omatch.p"
#include "patsize.p"
begin
  done := false;
  while (not done) and (pat[j] <> ENDSTR) do
    if (pat[j] = CLOSURE) then begin
      j := j + patsize(pat, j);  { step over CLOSURE }
      i := offset;
      while (not done) and (lin[i] <> ENDSTR) do (1)
        if (not omatch(lin, i, pat, j)) then
          done := true;
      done := false;
      while (not done) and (i >= offset) do begin (2)
        k := amatch(lin, i, pat, j+patsize(pat,j));
        if (k > 0) then  (3)
          done := true
        else
          i := i - 1
      end;
      offset := k;  { if k = 0 failure else success }
      done := true
    end
    else if (not omatch(lin, offset, pat, j)) then begin
      offset := 0;    { non-closure }
      done := true
    end
    else  { omatch succeeded on this pattern element }
      j := j + patsize(pat, j);
  amatch := offset
end;
  1. Match as many as possible.

  2. Index i gives us the character that made us fail, so now try to match the rest of the pattern.

  3. If k > 0 we’ve succeeded and can finish. Otherwise, step back and try again.

Kernighan and Plauger’s implementation is, in essence, a tiny compiler and virtual machine embedded with the find program. It’s really quite neat.

find in C++

The Players

matcher
using match_fn = std::function<bool(character_sequence const&)>;
using match_fn_with_consume = std::pair<match_fn, bool>;

class matcher {
public:
  bool match(character_sequence const& candidate) const;
  bool consumes() const;

private:
  explicit matcher(match_fn_with_consume fn);

  friend matcher make_matcher(character_sequence&);
};

A matcher is one pattern within a larger pattern - a single literal character, a character class, etc. Rather than build base_matcher class with a cloud of derived classes - literal_matcher, etc - around it, which the younger me might well have leapt into, we have a single matcher class with instances passed their specific behaviour at construction time.[2]

For some of our matchers, the matcher instance holds a simple function pointer

bool is_bol(character_sequence const& c) {
  return c.is_bol();
}

// ---

if (c == match_beginning_of_line && characters.is_bol())
  return { is_bol, false };

while for others, it holds a lambda function which, in turn, captures the value of interest

auto is_char_matcher(char target) {
  return [target](character_sequence const& c) {
    return *c == target;
  };
}

// ---

return { is_char_matcher(c), true };

All the functions implementing the tests are pretty trivial

bool is_any_char(character_sequence const& c) {
  return !c.is_eol() && (*c != '\n');
}

bool is_bol(character_sequence const& c) {
  return c.is_bol();
}

bool is_eol(character_sequence const& c) {
  return c.is_eol();
}

auto is_char_matcher(char target) {
  return [target](character_sequence const& c) {
    return *c == target;
  };
}

auto is_one_of_matcher(std::string const&targets) {
  return [targets](character_sequence const& c) {
    return targets.find(*c) != std::string::npos;
  };
}

auto is_not_one_of_matcher(std::string const& targets) {
  return [targets](character_sequence const& c) {
    return !c.is_eol() && (targets.find(*c) == std::string::npos);
  };
}

The second parameter to the matcher constructor, bool consume, is a flag which indicates, should the match function be true, whether we should consume over the current input character. Generally we do, but the special cases of % and $ , the match start of and end of line patterns, are zero width matches. If we match the start of the line, we don’t want to advance along the input as the next pattern needs to be tried against the first character

matcher instances are created by make_matcher, which is a simple parser function.

matcher make_matcher(character_sequence& characters) {
  return matcher(make_matcher_fn(characters));
}

match_fn_with_consume make_matcher_fn(character_sequence& characters) {
  char c = *characters;

  if (c == match_any_char)
    return { is_any_char, true };

  if (c == match_beginning_of_line && characters.is_bol())
    return { is_bol, false };

  if (c == match_end_of_line && !characters.available())
    return { is_eol, false };

  if (c == start_of_class)
    return make_character_class_matcher(characters);

  if (c == stiX::Escape)
    c = escape_char(characters);

  return { is_char_matcher(c), true };
}

The particularly lovely thing about std::function, the frankly magical general-purpose polymorphic function wrapper, is that it’s copyable. Since the matcher class only has one other data member, our simple consumes flag, that also means matcher is also copyable. We can, therefore, have polymorphic behaviour while retaining value semantics. We don’t have to worry about dynamic allocation, slicing, cloning, or any of the rest of that difficult stuff. We can just create a matcher instance and pass it around in the same way we might do an int.[3]

pattern_matcher
enum class match_count {
  one,
  zero_or_more
};
struct pattern {
  matcher test;
  match_count count;
};

using patterns = std::vector<pattern>;

class pattern_matcher {
public:
  bool match(std::string const& line) const;

private:
  explicit pattern_matcher(patterns m);

  friend pattern_matcher compile_pattern(std::string const&);
};

A pattern_matcher instance describes the complete pattern we want test, captured as a list of matcher objects. For each matcher, the pattern_matcher also captures its match_count, which tells us whether there’s a Kleene star in play or not.

Instances of pattern_matcher are created by compile_pattern.

pattern_matcher compile_pattern(std::string const& pattern) {
  auto matches = patterns { };

  for(auto seq = character_sequence(pattern); !seq.is_eol(); seq.advance()) {
    if (*seq == kleene_star && !matches.empty()) { (1)
      matches.back().count = match_count::zero_or_more;
      continue;
    }

    matches.emplace_back(make_matcher(seq));
  }

  return pattern_matcher(matches);
}
  1. A * at the start of the pattern is treated as a literal *.

We’ve seen how patterns are compiled, but how do we test them against the input line? Well, exactly as described earlier. Try the pattern against the input, return true if it hits, if not move along and try again. If we run out of input, return false because we’ve failed.

bool pattern_matcher::match(std::string const& line) const {
  bool once = true; // need to try at least once, because even zero length input might match
  for (auto seq = character_sequence(line); !seq.is_eol() || once; seq.advance(), once = false) {
    if (match_all(m_.begin(), m_.end(), seq))
      return true;
  }
  return false;
}

bool match_all(
  match_stages_iter mbegin,
  match_stages_iter const& mend,
  character_sequence seq
) {
  for(auto m = mbegin; m != mend; ++m) {
    switch (m->count) {
      case stiX::match_count::one:
        if (!match_one(m->test, seq))
          return false;
        break;
      case stiX::match_count::zero_or_more:
        return match_with_closure(m, mend, seq);
    }
  }
  return true;
}

bool match_one(
  matcher const& matcher,
  character_sequence& seq
) {
  if (!matcher.match(seq))
    return false;
  if (matcher.consumes())
    seq.advance();
  return true;
}

bool match_with_closure(
  match_stages_iter mbegin,
  match_stages_iter const& mend,
  character_sequence seq
) {
  seq.checkpoint();
  while(match_one(mbegin->test, seq));
  ++mbegin;
  do {
    if (match_all(mbegin, mend, seq))
      return true;
  } while(seq.rewind());

  return false;
}

Note how match_with_closure recurses back into match_all, and where the character_sequence is passed by value and where it’s passed by reference.

As we might expect, match_with_closure turns out to have a similar shape to the corresponding piece of Kernighan and Plauger’s code, albeit rather terser.

i := offset;
while (not done) and (lin[i] <> ENDSTR) do
  if (not omatch(lin, i, pat, j)) then
    done := true;
done := false;
while (not done) and (i >= offset) do begin
  k := amatch(lin, i, pat, j+patsize(pat,j));
  if (k > 0) then
    done := true
  else
    i := i - 1
end;
offset := k;
done := true
seq.checkpoint();
while(match_one(mbegin->test, seq));


++mbegin;
do {
  if (match_all(mbegin, mend, seq))
    return true;

} while(seq.rewind());



return false;
character_sequence
  class character_sequence {
  public:
    explicit character_sequence(std::string input);

    bool is_bol() const;
    bool is_eol() const;
    bool available() const;

    char operator *() const;

    void advance();

    void checkpoint();
    bool rewind();
  };

When we want to scan through a string or other standard container, we often pass around an iterator pair to delimit the sequence of interest. I could have done that here, but it would just have cluttered things up with additional function parameters and iterator comparisons and what-not. Instead, character_sequence wraps the iterator pair up together, along with some little utility functions, into one handy package.

The only slightly unusual functions are checkpoint and rewind, which are used in match_with_closure. The checkpoint function bookmarks a position in the sequence. Following one or more calls to advance, we can use rewind to walk back to the bookmark one character at a time.

Putting it all together

void stiX::find(std::istream& in, std::ostream& out, std::string const& pattern) {
  auto matcher = stiX::compile_pattern(pattern);

  while(in) {
    auto line = stiX::getline(in);

    if (matcher.match(line))
      out << line << '\n';
  }
}

int main(int argc, char const* argv[]) {
  auto arguments = stiX::make_arguments(argc, argv);
  if (arguments.size() != 1) {
    std::cout << argv[0] << " pattern" << std::endl;
    return 0;
  }

  stiX::find(std::cin, std::cout, arguments.front());
}

I’m pleased with this, especially the matcher. The whole program is approximately 230 lines long. About half of that is given over to parsing and building the pattern, with a bit less than half to the matcher evaluation. The evaluation itself is straightforward, with the only real complexity being in the closure. Yes, pleased.

Hold on, doesn’t C++ have a regular expression library as part of the standard?

Yes, and has done for over a decade.

Why not use that?

Well, ok then.

find_cheating
int main(int argc, char const* argv[]) {
  auto arguments = stiX::make_arguments(argc, argv);
  if (arguments.size() != 1) {
    std::cout << argv[0] << " pattern" << std::endl;
    return 0;
  }

  auto modern_pattern = translate_pattern(arguments.front());
  auto matcher = std::regex { modern_pattern,  std::regex_constants::ECMAScript }; (1)

  while(std::cin) {
    auto line = stiX::getline(std::cin);

    if (std::regex_search(line, matcher))
      std::cout << line << '\n';
  }
}

std::string translate_pattern(std::string const& pattern) {
  auto fixedUpPattern = std::string { };
  auto escaped = false;
  auto classStart = false;

  for (size_t i = 0; i != pattern.size(); ++i) {
    char c = pattern[i];

    if (c == '%') {
      fixedUpPattern += (i != 0) ? c : '^';
    } else if (c == '@' && !escaped) {
      escaped = true;
      fixedUpPattern += '\\';
    } else if (c == '?' && !escaped) {
      fixedUpPattern += '.';
    } else if (c == '^' && !classStart) {
      fixedUpPattern += "\\^";
    } else if (c == '\\' || c == '.' || c == '+') { (2)
      fixedUpPattern += ("\\"s + c);
    } else {
      classStart = (c == '[');
      escaped = false;
      fixedUpPattern += c;
    }
  }

  return fixedUpPattern;
}
  1. std::regex supports a number of different grammars. I’m using the ECMAScript grammar, because I originally wrote this is Javascript.

  2. These are special characters in the ECMAScript grammar, so we need to escape them to prevent surprising behaviour.

It’s not perfect, but it’ll do.

I’ve almost never worked on anything that was performance critical, but std::regex is widely considered to be slow as all hell, so take care with it. My very rough timing suggest it’s significantly slower than my naive efforts, even on small inputs of under 1000 lines.

References

  • std::function

  • std::regex

  • CTRE - Compile Time Regular Expressions. Yes, you read that right. If you know your regexes you need ahead of time, this remarkable library is for you.

  • operator""s - std::string literals

  • grep, of which find is a pale imitation

  • Mastering Regular Expression is exhaustive without being exhausting. Properly mind-expanding stuff.

Source Code

The full source code, with tests and what-not, for find and find_cheating, indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and twice renew a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


1. Even if only "I know, I’ll use regular expressions." Now they have two problems., which, as so often, is a quote long since shorn of its context
2. Yes, patterns fans! It’s Strategy.
3. Dehnert and Stepanov describe this as a regular type, which matches the built-in type semantics, thereby making our user-defined types behave like built-in types. Scott Meyers' puts it more succinctly as When in doubt, do as the ints do

Tagged code, and software-tools-in-c++


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed