%[^x]?[0-9]x$
Freelance software grandad
software created
extended or repaired
Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations
STinC++ rewrites the programs in Software Tools in Pascal using C++
Chapter 5 at last, and honestly this is the one I’ve been waiting for since I started this ludicrous little project. We’re at the half-way point of the book and things are getting properly serious. The problems examined are bigger, the programs presented are more substantial. The gloves are off.
This chapter introduces the concept of regular expressions, before going on to build an expression matcher and a couple of programs around it. Most programmers these days are, I imagine, at least passingly familiar with regular expression,[1] and they’re provided as part of the standard language or library of every mainstream language I can think of.
Again, we need to remind ourselves that Software Tools in Pascal was published in 1981 when this was absolutely not the case. Kernighan and Plauger take care to thoroughly introduce what they call text patterns, a restricted subset of regular expressions. They describe regular expressions as extensively studied but omit to mention, other than in a footnote, the key role Bell Labs and Unix, in the person of Ken Thompson, played in moving regular expressions from the realm of study and into the realm of every day programming.
Even though they’re presenting a subset, Kernighan and Plauger’s text patterns are not simple. A text pattern is made up a string of a combination of these elements -
c |
literal character |
? |
any character except newline |
% |
beginning of line |
$ |
end of line |
[…] |
character class |
[^…] |
negated character class |
* |
closure (zero or more occurrences of previous pattern) |
@c |
escaped characters |
If we substitute ^
for %
and \
for @
then this syntax is entirely familiar to the present day reader.
In the book, Kernighan and Plauger call the *
a closure, where I’d call it a Kleene star. Turns out Stephen Kleene invented regular expressions, back in 1951. We’re both right! I think! Kleene star is the name of the *
operator, but a closure it what it does.
A pattern matches a line if the line contains the pattern, regardless of where in the line the pattern is. The pattern aa
matches aabc
, also cbaa
and baac
, but not xyz
. Matching is done on a line-by-line basis, with no pattern matches across the line boundary (although this constraint need not necessarily be true for modern pattern matchers).
Patterns can be concatenated - one pattern followed by another forms a new pattern that matches anything matched by the first pattern which is immediately followed by anything matched by the second. Any number of patterns can be concatenated. The pattern [a-zA-Z][a-zA-Z0-9]*$
, a letter followed by any number of letters or digits at a line end, would, for instance, match every line ending in a Pascal keyword or identifier.
In our code, we have our pattern (actually our concatenation of patterns), and we have our input line. Starting at the first character of the line, we try to match our first pattern. If it does, we move onto the next pattern and try to match that to the second character, and so on. If we match all of our patterns then hurrah!
If, however, a pattern does not match then we wind back to the first pattern, and start all over again from the second character in the line. If that fails, we go again from the third character, and so on and so on. If we get all the way to the end of the line without a matching each pattern in turn, then that’s a failure - there’s no match.
This feels like straightforward stuff, with only the closure presenting any kind of wrinkle. A closure matches zero or more instances of a pattern, and is resolved by finding the first character that matches and then the longest possible sequence of matches. To implement a closure, when our pattern matches we advance as far as we can along the input, then try to match our remaining patterns. If they match too then we’ve succeeded. If not, then we need to step back a little then try to match the remaining patterns again, repeating until we either have a match or we’ve stepped right back to where we started.
There are some optimisations that might immediately leap out at you. Attempting to match a pattern anchored to the start of a line can be abandoned if the second element of the pattern doesn’t match, for example. Kernighan and Plauger don’t discuss this at all. They are going deep, but they don’t go that deep. And nor do I!
Having discussed text patterns, Kernighan and Plauger get straight into the first of two programs that use them, find
.
find
PROGRAM |
|
USAGE |
|
FUNCTION |
Special meaning of characters in a text pattern is lost when escaped, inside
A character class consists of zero of more of the following elements, surrounded by
Special meanings of characters in a character class is lost when escaped or for:
An escaped sequence consists of the character
|
EXAMPLE |
To print line ending in a Pascal keyword or identifier
|
There’s a lot going on just to parse out the pattern, let alone implement the matching. Kernighan and Plauger start outside in, sketching out a program that takes the command line arguments and processes the input using a simple stub matcher.
All this may seem pretty elementary, but it’s surprising how many bugs are caught early this way. In large software projects the majority of bugs arise because the pieces of the system do not go together as they were expected to … And many other bugs survive elaborate checks on individual routines, surfacing only when the routine first interacts with the rest of the code.
It seems only natural, then, to test at the highest, most integrated level first - since that’s where most bugs are detected anyway - and to start testing as soon as possible, even before most of the actual working code is written.
I’m happy to have conversation, even with Brian Kernighan and Bill Plauger, about where most bugs might arise, but I am absolutely signed up to start testing as soon as possible.
With this initial skeleton in place, Kernighan and Plauger work their way into their matcher, initially omitting the closure, continuing into parsing out the command line arguments, before finally looping back round to the closure. Their matcher implementation is rather neat. They encode the pattern as a series of op codes. For example
%[^x]?[0-9]x$
they encode as
BOL
NCCL 1 'x'
ANY
CCL 10 '0' '1' '2' '3' '4' '5' '6' '7' '8' '9'
LITCHAR 'x'
EOL
A little routine then walks the op codes, evaluating each as it goes.
function omatch (var lin: string; var i : integer;
var pat : string; j : integer) : boolean; (1)
var
advance : -1..1;
#include "locate.p"
begin
advance := -1;
if (lin[i] = ENDSTR) then
omatch := false
else if (not (pat[j] in
[LITCHAR, BOL, EOL, ANY, CCL, NCCL, CLOSURE])) then
error('in omatch: can''t happen')
else
case pat[j] of (2)
LITCHAR:
if (lin[i] = pat[j+1]) then
advance := 1
BOL:
if (i = 1) then
advance := 0
ANY:
if (lin[i] <> NEWLINE) then
advance := 1;
EOL:
if (line[i] == NEWLINE) then
advance := 0;
CCL:
if (locate(lin[i], pat, j+1)) then
advance := 1;
NCCL:
if (lin[i] <> NEWLINE)
and (not locate(lin[i], pat, j+1)) then
advance := 1
end;
if (advance >= 0) then begin (3)
i := i + advance
omatch := true
end
else
omatch := false
end;
lin
is the input line, i
the current position in it, pat
is the encoded pattern and j
is the index of pattern opcode to evaluate
The pattern tests - most are inline - and use either the current input character or the index position. If the test passes, advance
is set to the number of input characters the test 'consumes'.
If advance
has been set, the test must have passed and processing can continue.
Since a closure applies to the preceding pattern, a*
means zero-of-more a
, their parsing routine inserts the closure opcode into their compiled sequence before the opcode to which it applies. Extending our earlier example
%[^x]?[0-9]*x$
encodes as
BOL
NCCL 1 'x'
ANY
CLOSURE
CCL 10 '0' '1' '2' '3' '4' '5' '6' '7' '8' '9'
LITCHAR 'x'
EOL
Evaluating the closure happens at the level above the omatch
function -
function amatch (var lin : string; offset : integer;
var pat : string; j : integer) : integer;
var
i, k : integer;
done : boolean;
#include "omatch.p"
#include "patsize.p"
begin
done := false;
while (not done) and (pat[j] <> ENDSTR) do
if (pat[j] = CLOSURE) then begin
j := j + patsize(pat, j); { step over CLOSURE }
i := offset;
while (not done) and (lin[i] <> ENDSTR) do (1)
if (not omatch(lin, i, pat, j)) then
done := true;
done := false;
while (not done) and (i >= offset) do begin (2)
k := amatch(lin, i, pat, j+patsize(pat,j));
if (k > 0) then (3)
done := true
else
i := i - 1
end;
offset := k; { if k = 0 failure else success }
done := true
end
else if (not omatch(lin, offset, pat, j)) then begin
offset := 0; { non-closure }
done := true
end
else { omatch succeeded on this pattern element }
j := j + patsize(pat, j);
amatch := offset
end;
Match as many as possible.
Index i
gives us the character that made us fail, so now try to match the rest of the pattern.
If k > 0
we’ve succeeded and can finish. Otherwise, step back and try again.
Kernighan and Plauger’s implementation is, in essence, a tiny compiler and virtual machine embedded with the find
program. It’s really quite neat.
find
in C++using match_fn = std::function<bool(character_sequence const&)>;
using match_fn_with_consume = std::pair<match_fn, bool>;
class matcher {
public:
bool match(character_sequence const& candidate) const;
bool consumes() const;
private:
explicit matcher(match_fn_with_consume fn);
friend matcher make_matcher(character_sequence&);
};
A matcher is one pattern within a larger pattern - a single literal character, a character class, etc. Rather than build base_matcher
class with a cloud of derived classes - literal_matcher
, etc - around it, which the younger me might well have leapt into, we have a single matcher class with instances passed their specific behaviour at construction time.[2]
For some of our matchers, the matcher
instance holds a simple function pointer
bool is_bol(character_sequence const& c) {
return c.is_bol();
}
// ---
if (c == match_beginning_of_line && characters.is_bol())
return { is_bol, false };
while for others, it holds a lambda function which, in turn, captures the value of interest
auto is_char_matcher(char target) {
return [target](character_sequence const& c) {
return *c == target;
};
}
// ---
return { is_char_matcher(c), true };
All the functions implementing the tests are pretty trivial
bool is_any_char(character_sequence const& c) {
return !c.is_eol() && (*c != '\n');
}
bool is_bol(character_sequence const& c) {
return c.is_bol();
}
bool is_eol(character_sequence const& c) {
return c.is_eol();
}
auto is_char_matcher(char target) {
return [target](character_sequence const& c) {
return *c == target;
};
}
auto is_one_of_matcher(std::string const&targets) {
return [targets](character_sequence const& c) {
return targets.find(*c) != std::string::npos;
};
}
auto is_not_one_of_matcher(std::string const& targets) {
return [targets](character_sequence const& c) {
return !c.is_eol() && (targets.find(*c) == std::string::npos);
};
}
The second parameter to the matcher
constructor, bool consume
, is a flag which indicates, should the match function be true, whether we should consume over the current input character. Generally we do, but the special cases of %
and $
, the match start of and end of line patterns, are zero width matches. If we match the start of the line, we don’t want to advance along the input as the next pattern needs to be tried against the first character
matcher
instances are created by make_matcher
, which is a simple parser function.
matcher make_matcher(character_sequence& characters) {
return matcher(make_matcher_fn(characters));
}
match_fn_with_consume make_matcher_fn(character_sequence& characters) {
char c = *characters;
if (c == match_any_char)
return { is_any_char, true };
if (c == match_beginning_of_line && characters.is_bol())
return { is_bol, false };
if (c == match_end_of_line && !characters.available())
return { is_eol, false };
if (c == start_of_class)
return make_character_class_matcher(characters);
if (c == stiX::Escape)
c = escape_char(characters);
return { is_char_matcher(c), true };
}
The particularly lovely thing about std::function
, the frankly magical general-purpose polymorphic function wrapper, is that it’s copyable. Since the matcher
class only has one other data member, our simple consumes
flag, that also means matcher
is also copyable. We can, therefore, have polymorphic behaviour while retaining value semantics. We don’t have to worry about dynamic allocation, slicing, cloning, or any of the rest of that difficult stuff. We can just create a matcher
instance and pass it around in the same way we might do an int
.[3]
enum class match_count {
one,
zero_or_more
};
struct pattern {
matcher test;
match_count count;
};
using patterns = std::vector<pattern>;
class pattern_matcher {
public:
bool match(std::string const& line) const;
private:
explicit pattern_matcher(patterns m);
friend pattern_matcher compile_pattern(std::string const&);
};
A pattern_matcher
instance describes the complete pattern we want test, captured as a list of matcher
objects. For each matcher
, the pattern_matcher
also captures its match_count
, which tells us whether there’s a Kleene star in play or not.
Instances of pattern_matcher
are created by compile_pattern
.
pattern_matcher compile_pattern(std::string const& pattern) {
auto matches = patterns { };
for(auto seq = character_sequence(pattern); !seq.is_eol(); seq.advance()) {
if (*seq == kleene_star && !matches.empty()) { (1)
matches.back().count = match_count::zero_or_more;
continue;
}
matches.emplace_back(make_matcher(seq));
}
return pattern_matcher(matches);
}
A *
at the start of the pattern is treated as a literal *
.
We’ve seen how patterns are compiled, but how do we test them against the input line? Well, exactly as described earlier. Try the pattern against the input, return true
if it hits, if not move along and try again. If we run out of input, return false
because we’ve failed.
bool pattern_matcher::match(std::string const& line) const {
bool once = true; // need to try at least once, because even zero length input might match
for (auto seq = character_sequence(line); !seq.is_eol() || once; seq.advance(), once = false) {
if (match_all(m_.begin(), m_.end(), seq))
return true;
}
return false;
}
bool match_all(
match_stages_iter mbegin,
match_stages_iter const& mend,
character_sequence seq
) {
for(auto m = mbegin; m != mend; ++m) {
switch (m->count) {
case stiX::match_count::one:
if (!match_one(m->test, seq))
return false;
break;
case stiX::match_count::zero_or_more:
return match_with_closure(m, mend, seq);
}
}
return true;
}
bool match_one(
matcher const& matcher,
character_sequence& seq
) {
if (!matcher.match(seq))
return false;
if (matcher.consumes())
seq.advance();
return true;
}
bool match_with_closure(
match_stages_iter mbegin,
match_stages_iter const& mend,
character_sequence seq
) {
seq.checkpoint();
while(match_one(mbegin->test, seq));
++mbegin;
do {
if (match_all(mbegin, mend, seq))
return true;
} while(seq.rewind());
return false;
}
Note how match_with_closure
recurses back into match_all
, and where the character_sequence
is passed by value and where it’s passed by reference.
As we might expect, match_with_closure
turns out to have a similar shape to the corresponding piece of Kernighan and Plauger’s code, albeit rather terser.
|
|
class character_sequence {
public:
explicit character_sequence(std::string input);
bool is_bol() const;
bool is_eol() const;
bool available() const;
char operator *() const;
void advance();
void checkpoint();
bool rewind();
};
When we want to scan through a string
or other standard container, we often pass around an iterator pair to delimit the sequence of interest. I could have done that here, but it would just have cluttered things up with additional function parameters and iterator comparisons and what-not. Instead, character_sequence
wraps the iterator pair up together, along with some little utility functions, into one handy package.
The only slightly unusual functions are checkpoint
and rewind
, which are used in match_with_closure
. The checkpoint
function bookmarks a position in the sequence. Following one or more calls to advance
, we can use rewind
to walk back to the bookmark one character at a time.
void stiX::find(std::istream& in, std::ostream& out, std::string const& pattern) {
auto matcher = stiX::compile_pattern(pattern);
while(in) {
auto line = stiX::getline(in);
if (matcher.match(line))
out << line << '\n';
}
}
int main(int argc, char const* argv[]) {
auto arguments = stiX::make_arguments(argc, argv);
if (arguments.size() != 1) {
std::cout << argv[0] << " pattern" << std::endl;
return 0;
}
stiX::find(std::cin, std::cout, arguments.front());
}
I’m pleased with this, especially the matcher
. The whole program is approximately 230 lines long. About half of that is given over to parsing and building the pattern, with a bit less than half to the matcher evaluation. The evaluation itself is straightforward, with the only real complexity being in the closure. Yes, pleased.
Yes, and has done for over a decade.
Why not use that?
Well, ok then.
int main(int argc, char const* argv[]) {
auto arguments = stiX::make_arguments(argc, argv);
if (arguments.size() != 1) {
std::cout << argv[0] << " pattern" << std::endl;
return 0;
}
auto modern_pattern = translate_pattern(arguments.front());
auto matcher = std::regex { modern_pattern, std::regex_constants::ECMAScript }; (1)
while(std::cin) {
auto line = stiX::getline(std::cin);
if (std::regex_search(line, matcher))
std::cout << line << '\n';
}
}
std::string translate_pattern(std::string const& pattern) {
auto fixedUpPattern = std::string { };
auto escaped = false;
auto classStart = false;
for (size_t i = 0; i != pattern.size(); ++i) {
char c = pattern[i];
if (c == '%') {
fixedUpPattern += (i != 0) ? c : '^';
} else if (c == '@' && !escaped) {
escaped = true;
fixedUpPattern += '\\';
} else if (c == '?' && !escaped) {
fixedUpPattern += '.';
} else if (c == '^' && !classStart) {
fixedUpPattern += "\\^";
} else if (c == '\\' || c == '.' || c == '+') { (2)
fixedUpPattern += ("\\"s + c);
} else {
classStart = (c == '[');
escaped = false;
fixedUpPattern += c;
}
}
return fixedUpPattern;
}
std::regex
supports a number of different grammars. I’m using the ECMAScript grammar, because I originally wrote this is Javascript.
These are special characters in the ECMAScript grammar, so we need to escape them to prevent surprising behaviour.
It’s not perfect, but it’ll do.
I’ve almost never worked on anything that was performance critical, but std::regex is widely considered to be slow as all hell, so take care with it. My very rough timing suggest it’s significantly slower than my naive efforts, even on small inputs of under 1000 lines.
CTRE - Compile Time Regular Expressions. Yes, you read that right. If you know your regexes you need ahead of time, this remarkable library is for you.
operator""s -
std::string
literals
grep, of which find
is a pale imitation
Mastering Regular Expression is exhaustive without being exhausting. Properly mind-expanding stuff.
The full source code, with tests and what-not, for find and find_cheating, indeed the whole project, is available in the stiX GitHub repository.
This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.
For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and twice renew a license for. I use it all the time now.
The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.
Freelance software grandad
software created
extended or repaired
Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations