Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Thursday 19 August 2021 The Forest Road Reader, No 2.65 : STinC++ - find

STinC++ rewrites the programs in Software Tools in Pascal using C++

Chapter 5 at last, and honestly this is the one I’ve been waiting for since I started this ludicrous little project. We’re at the half-way point of the book and things are getting properly serious. The problems examined are bigger, the programs presented are more substantial. The gloves are off.

Text Patterns

This chapter introduces the concept of regular expressions, before going on to build an expression matcher and a couple of programs around it. Most programmers these days are, I imagine, at least passingly familiar with regular expression,[1] and they’re provided as part of the standard language or library of every mainstream language I can think of.

Again, we need to remind ourselves that Software Tools in Pascal was published in 1981 when this was absolutely not the case. Kernighan and Plauger take care to thoroughly introduce what they call text patterns, a restricted subset of regular expressions. They describe regular expressions as extensively studied but omit to mention, other than in a footnote, the key role Bell Labs and Unix, in the person of Ken Thompson, played in moving regular expressions from the realm of study and into the realm of every day programming.

Even though they’re presenting a subset, Kernighan and Plauger’s text patterns are not simple. A text pattern is made up a string of a combination of these elements -

c

literal character

?

any character except newline

%

beginning of line

$

end of line

[…​]

character class

[^…​]

negated character class

*

closure (zero or more occurrences of previous pattern)

@c

escaped characters

If we substitute ^ for % and \ for @ then this syntax is entirely familiar to to the present day reader.

In the book, Kernighan and Plauger call the * a closure, where I’d call it a Kleene star. Turns out Stephen Kleene invented regular expressions, back in 1951. We’re both right! I think! Kleene star is the name of the * operator, but a closure it what it does.

A pattern matches a line if the line contains the pattern, regardless of where in the line the pattern is. The pattern aa matches aabc, also cbaa and baac, but not xyz. Matching is done on a line-by-line basis, with no pattern matches across the line boundary(although this constraint need not necessarily be true for modern pattern matchers).

Patterns can be concatenated - one pattern followed by another forms a new pattern that matches anything matched by the first pattern which is immediately followed by anything matched by the second. Any number of patterns can be concatenated. The pattern [a-zA-Z][a-zA-Z0-9]*$, a letter followed by any number of letters or digits at a line end, would, for instance, match every line ending in a Pascal keyword or identifier.

In our code, we have our pattern (actually our concatenation of patterns), and we have our input line. Starting at the first character of the line, we try to match our first pattern. If it does, we move onto the next pattern and try to match that to the second character, and so on. If we match all of our patterns then hurrah!

If, however, a pattern does not match then we wind back to the first pattern, and start all over again from the next second in the line. If that fails, we go again from the third character, and so on and so on. If we get all the way to the end of the line without a matching each pattern in turn, then that’s a failure - there’s no match.

This feels like straightforward stuff, with only the closure presenting any kind of wrinkle. A closure matches zero or more instances of a pattern, and is resolved by finding the first character that matches and then the longest possible sequence of matches. To implement a closure, when our pattern matches we advance as far as we can along the input, then try to match our remaining patterns. If they match too then we’ve succeeded. If not, then we need to step back a little then try to match the remaining patterns again, repeating until we either have a match or we’ve stepped right back to where we started.

There are some optimisations that might immediately leap out at you. Attempting to match a pattern anchored to the start of a line can be abandoned if the second element of the pattern doesn’t match, for example. Kernighan and Plauger don’t discuss this at all. They are going deep, but they don’t go that deep. And nor do I!

Having discussed text patterns, Kernighan and Plauger get straight into the first of two programs that use them, find.

find

PROGRAM

find find patterns in text

USAGE

find pattern

FUNCTION

find reads its input a line at a time and writes to its output those lines that match the specified text pattern. A text pattern is a concatenation of the following elements

c literal character c

? any character except newline

% beginning of line

$ end of line

[…​] character class

[^…​] negated character class

* closure (zero or more occurrences of previous pattern)

@c escaped characters

Special meaning of characters in a text pattern is lost when escaped, inside […​] (except @]), or for :

% not at beginning

$ not at end

* at beginning

A character class consists of zero of more of the following elements, surrounded by [ and ]:

c literal character c, including [

c1-c2 range of characters, (digits, lower or upper case letters)

^ negated character class if at beginning

@c escaped characters (e.g., @^ @- @@ @])

Special meanings of characters in a character class is lost when escaped or for:

^ not at beginning

- at beginning or end

An escaped sequence consists of the character @ followed by a single character:

@n newline

@t tab

@c c (including @@)

EXAMPLE

To print line ending in a Pascal keyword or identifier

find [a-zA-Z][a-zA-Z0-9]*$

There’s a lot going on just to parse out the pattern, let alone implement the matching. Kernighan and Plauger start outside in, sketching out a program that takes the command line arguments and processes the input using a simple stub matcher.

All this may seem pretty elementary, but it’s surprising how many bugs are caught early this way. In large software projects the majority of bugs arise because the pieces of the system do not go together as they were expected to …​ And many other bugs survive elaborate checks on individual routines, surfacing only when the routine first interacts with the rest of the code.

It seems only natural, then, to test at the highest, most integrated level first - since that’s where most bugs are detected anyway - and to start testing as soon as possible, even before most of the actual working code is written.

I’m happy to have conversation, even with Brian Kernighan and Bill Plauger, about where most bugs might arise, but I am absolutely signed up to start testing as soon as possible.

With this initial skeleton in place, Kernighan and Plauger work their way into their matcher, initially omitting the closure, continuing into parsing out the command line arguments, before finally looping back round to the closure. Their matcher implementation is rather neat. They encode the pattern as a series of op codes. For example

%[^x]?[0-9]x$

they encode as

BOL
NCCL 1 'x'
ANY
CCL 10 '0' '1' '2' '3' '4' '5' '6' '7' '8' '9'
LITCHAR 'x'
EOL

A little routine then walks the op codes, evaluating each as it goes.

omatch
function omatch (var lin: string; var i : integer;
        var pat : string; j : integer) : boolean;  (1)
var
  advance : -1..1;
#include "locate.p"
begin
  advance := -1;
  if (lin[i] = ENDSTR) then
    omatch := false
  else if (not (pat[j] in
    [LITCHAR, BOL, EOL, ANY, CCL, NCCL, CLOSURE])) then
      error('in omatch: can''t happen')
  else
    case pat[j] of (2)
      LITCHAR:
        if (lin[i] = pat[j+1]) then
          advance := 1
      BOL:
        if (i = 1) then
          advance := 0
      ANY:
        if (lin[i] <> NEWLINE) then
          advance := 1;
      EOL:
        if (line[i] == NEWLINE) then
          advance := 0;
      CCL:
        if (locate(lin[i], pat, j+1)) then
          advance := 1;
      NCCL:
        if (lin[i] <> NEWLINE)
           and (not locate(lin[i], pat, j+1)) then
             advance := 1
    end;
  if (advance >= 0) then begin (3)
    i := i + advance
    omatch := true
  end
  else
    omatch := false
end;
  1. lin is the input line, i the current position in it, pat is the encoded pattern and j is the index of pattern opcode to evaluate

  2. The pattern tests - most are inline - and use either the current input character or the index position. If the test passes, advance is set to the number of input characters the test 'consumes'.

  3. If advance has been set, the test must have passed and processing can continue.

Since a closure applies to the preceding pattern, a* means zero-of-more a, their parsing routine inserts the closure opcode into their compiled sequence before the opcode to which it applies. Extending our earlier example

%[^x]?[0-9]*x$

encodes as

BOL
NCCL 1 'x'
ANY
CLOSURE
CCL 10 '0' '1' '2' '3' '4' '5' '6' '7' '8' '9'
LITCHAR 'x'
EOL

Evaluating the closure happens at the level above the omatch function -

amatch
function amatch (var lin : string; offset : integer;
        var pat : string; j : integer) : integer;
var
  i, k : integer;
  done : boolean;
#include "omatch.p"
#include "patsize.p"
begin
  done := false;
  while (not done) and (pat[j] <> ENDSTR) do
    if (pat[j] = CLOSURE) then begin
      j := j + patsize(pat, j);  { step over CLOSURE }
      i := offset;
      while (not done) and (lin[i] <> ENDSTR) do (1)
        if (not omatch(lin, i, pat, j)) then
          done := true;
      done := false;
      while (not done) and (i >= offset) do begin (2)
        k := amatch(lin, i, pat, j+patsize(pat,j));
        if (k > 0) then  (3)
          done := true
        else
          i := i - 1
      end;
      offset := k;  { if k = 0 failure else success }
      done := true
    end
    else if (not omatch(lin, offset, pat, j)) then begin
      offset := 0;    { non-closure }
      done := true
    end
    else  { omatch succeeded on this pattern element }
      j := j + patsize(pat, j);
  amatch := offset
end;
  1. Match as many as possible.

  2. Index i gives us the character that made us fail, so now try to match the rest of the pattern.

  3. If k > 0 we’ve succeeded and can finish. Otherwise, step back and try again.

Kernighan and Plauger’s implementation is, in essence, a tiny compiler and virtual machine embedded with the find program. It’s really quite neat.

find in C++

The Players

matcher
using match_fn = std::function<bool(character_sequence const&)>;
using match_fn_with_consume = std::pair<match_fn, bool>;

class matcher {
public:
  bool match(character_sequence const& candidate) const;
  bool consumes() const;

private:
  explicit matcher(match_fn_with_consume fn);

  friend matcher make_matcher(character_sequence&);
};

A matcher is one pattern within a larger pattern - a single literal character, a character class, etc. Rather than build base_matcher class with a cloud of derived classes - literal_matcher, etc - around it, which the younger me might well have leapt into, we have a single matcher class with instances passed their specific behaviour at construction time.[2]

For some of our matchers, the matcher instance holds a simple function pointer

bool is_bol(character_sequence const& c) {
  return c.is_bol();
}

// ---

if (c == match_beginning_of_line && characters.is_bol())
  return { is_bol, false };

while for others, it holds a lambda function which, in turn, captures the value of interest

auto is_char_matcher(char target) {
  return [target](character_sequence const& c) {
    return *c == target;
  };
}

// ---

return { is_char_matcher(c), true };

All the functions implementing the tests are pretty trivial

bool is_any_char(character_sequence const& c) {
  return !c.is_eol() && (*c != '\n');
}

bool is_bol(character_sequence const& c) {
  return c.is_bol();
}

bool is_eol(character_sequence const& c) {
  return c.is_eol();
}

auto is_char_matcher(char target) {
  return [target](character_sequence const& c) {
    return *c == target;
  };
}

auto is_one_of_matcher(std::string const&targets) {
  return [targets](character_sequence const& c) {
    return targets.find(*c) != std::string::npos;
  };
}

auto is_not_one_of_matcher(std::string const& targets) {
  return [targets](character_sequence const& c) {
    return !c.is_eol() && (targets.find(*c) == std::string::npos);
  };
}

The second parameter to the matcher constructor, bool consume, is a flag which indicates, should the match function be true, whether we should consume over the current input character. Generally we do, but the special cases of % and $ , the match start of and end of line patterns, are zero width matches. If we match the start of the line, we don’t want to advance along the input as the next pattern needs to be tried against the first character

matcher instances are created by make_matcher, which is a simple parser function.

matcher make_matcher(character_sequence& characters) {
  return matcher(make_matcher_fn(characters));
}

match_fn_with_consume make_matcher_fn(character_sequence& characters) {
  char c = *characters;

  if (c == match_any_char)
    return { is_any_char, true };

  if (c == match_beginning_of_line && characters.is_bol())
    return { is_bol, false };

  if (c == match_end_of_line && !characters.available())
    return { is_eol, false };

  if (c == start_of_class)
    return make_character_class_matcher(characters);

  if (c == stiX::Escape)
    c = escape_char(characters);

  return { is_char_matcher(c), true };
}

The particularly lovely thing about std::function, the frankly magical general-purpose polymorphic function wrapper, is that it’s copyable. Since the matcher class only has one other data member, our simple consumes flag, that also means matcher is also copyable. We can, therefore, have polymorphic behaviour while retaining value semantics. We don’t have to worry about dynamic allocation, slicing, cloning, or any of the rest of that difficult stuff. We can just create a matcher instance and pass it around in the same way we might do an int.[3]

pattern_matcher
enum class match_count {
  one,
  zero_or_more
};
struct pattern {
  matcher test;
  match_count count;
};

using patterns = std::vector<pattern>;

class pattern_matcher {
public:
  bool match(std::string const& line) const;

private:
  explicit pattern_matcher(patterns m);

  friend pattern_matcher compile_pattern(std::string const&);
};

A pattern_matcher instance describes the complete pattern we want test, captured as a list of matcher objects. For each matcher, the pattern_matcher also captures its match_count, which tells us whether there’s a Kleene star in play or not.

Instances of pattern_matcher are created by compile_pattern.

pattern_matcher compile_pattern(std::string const& pattern) {
  auto matches = patterns { };

  for(auto seq = character_sequence(pattern); !seq.is_eol(); seq.advance()) {
    if (*seq == kleene_star && !matches.empty()) { (1)
      matches.back().count = match_count::zero_or_more;
      continue;
    }

    matches.emplace_back(make_matcher(seq));
  }

  return pattern_matcher(matches);
}
  1. A * at the start of the pattern is treated as a literal *.

We’ve seen how patterns are compiled, but how do we test them against the input line? Well, exactly as described earlier. Try the pattern against the input, return true if it hits, if not move along and try again. If we run out of input, return false because we’ve failed.

bool pattern_matcher::match(std::string const& line) const {
  bool once = true; // need to try at least once, because even zero length input might match
  for (auto seq = character_sequence(line); !seq.is_eol() || once; seq.advance(), once = false) {
    if (match_all(m_.begin(), m_.end(), seq))
      return true;
  }
  return false;
}

bool match_all(
  match_stages_iter mbegin,
  match_stages_iter const& mend,
  character_sequence seq
) {
  for(auto m = mbegin; m != mend; ++m) {
    switch (m->count) {
      case stiX::match_count::one:
        if (!match_one(m->test, seq))
          return false;
        break;
      case stiX::match_count::zero_or_more:
        return match_with_closure(m, mend, seq);
    }
  }
  return true;
}

bool match_one(
  matcher const& matcher,
  character_sequence& seq
) {
  if (!matcher.match(seq))
    return false;
  if (matcher.consumes())
    seq.advance();
  return true;
}

bool match_with_closure(
  match_stages_iter mbegin,
  match_stages_iter const& mend,
  character_sequence seq
) {
  seq.checkpoint();
  while(match_one(mbegin->test, seq));
  ++mbegin;
  do {
    if (match_all(mbegin, mend, seq))
      return true;
  } while(seq.rewind());

  return false;
}

Note how match_with_closure recurses back into match_all, and where the character_sequence is passed by value and where it’s passed by reference.

As we might expect, match_with_closure turns out to have a similar shape to the corresponding piece of Kernighan and Plauger’s code, albeit rather terser.

i := offset;
while (not done) and (lin[i] <> ENDSTR) do
  if (not omatch(lin, i, pat, j)) then
    done := true;
done := false;
while (not done) and (i >= offset) do begin
  k := amatch(lin, i, pat, j+patsize(pat,j));
  if (k > 0) then
    done := true
  else
    i := i - 1
end;
offset := k;
done := true
seq.checkpoint();
while(match_one(mbegin->test, seq));


++mbegin;
do {
  if (match_all(mbegin, mend, seq))
    return true;

} while(seq.rewind());



return false;
character_sequence
  class character_sequence {
  public:
    explicit character_sequence(std::string input);

    bool is_bol() const;
    bool is_eol() const;
    bool available() const;

    char operator *() const;

    void advance();

    void checkpoint();
    bool rewind();
  };

When we want to scan through a string or other standard container, we often pass around an iterator pair to delimit the sequence of interest. I could have done that here, but it would just have cluttered things up with additional function parameters and iterator comparisons and what-not. Instead, character_sequence effective wraps the iterator pair up together, along with some little utility functions, into one handy package.

The only slightly unusual functions are checkpoint and rewind, which are used in match_with_closure. The checkpoint function bookmarks a position in the sequence. Following one or more calls to advance, we can use rewind to walk back to the bookmark one character at a time.

Putting it all together

void stiX::find(std::istream& in, std::ostream& out, std::string const& pattern) {
  auto matcher = stiX::compile_pattern(pattern);

  while(in) {
    auto line = stiX::getline(in);

    if (matcher.match(line))
      out << line << '\n';
  }
}

int main(int argc, char const* argv[]) {
  auto arguments = stiX::make_arguments(argc, argv);
  if (arguments.size() != 1) {
    std::cout << argv[0] << " pattern" << std::endl;
    return 0;
  }

  stiX::find(std::cin, std::cout, arguments.front());
}

I’m pleased with this, especially the matcher. The whole program is approximately 230 lines long. About half of that is given over to parsing and building the pattern, with a bit less than half to the matcher evaluation. The evaluation itself is straightforward, with the only real complexity being in the closure. Yes, pleased.

Hold on, doesn’t C++ have a regular expression library as part of the standard?

Yes, and has done for over a decade.

Why not use that?

Well, ok then.

find_cheating
int main(int argc, char const* argv[]) {
  auto arguments = stiX::make_arguments(argc, argv);
  if (arguments.size() != 1) {
    std::cout << argv[0] << " pattern" << std::endl;
    return 0;
  }

  auto modern_pattern = translate_pattern(arguments.front());
  auto matcher = std::regex { modern_pattern,  std::regex_constants::ECMAScript }; (1)

  while(std::cin) {
    auto line = stiX::getline(std::cin);

    if (std::regex_search(line, matcher))
      std::cout << line << '\n';
  }
}

std::string translate_pattern(std::string const& pattern) {
  auto fixedUpPattern = std::string { };
  auto escaped = false;
  auto classStart = false;

  for (size_t i = 0; i != pattern.size(); ++i) {
    char c = pattern[i];

    if (c == '%') {
      fixedUpPattern += (i != 0) ? c : '^';
    } else if (c == '@' && !escaped) {
      escaped = true;
      fixedUpPattern += '\\';
    } else if (c == '?' && !escaped) {
      fixedUpPattern += '.';
    } else if (c == '^' && !classStart) {
      fixedUpPattern += "\\^";
    } else if (c == '\\' || c == '.' || c == '+') { (2)
      fixedUpPattern += ("\\"s + c);
    } else {
      classStart = (c == '[');
      escaped = false;
      fixedUpPattern += c;
    }
  }

  return fixedUpPattern;
}
  1. std::regex supports a number of different grammars. I’m using the ECMAScript grammar, because I originally wrote this is Javascript.

  2. These are special characters in the ECMAScript grammar, so we need to escape them to prevent surprising behaviour.

It’s not perfect, but it’ll do.

I’ve almost never worked on anything that was performance critical, but std::regex is widely considered to be slow as all hell, so take care with it. My very rough timing suggest it’s significantly slower than my naive efforts, even on small inputs of under 1000 lines.

References

  • std::function

  • std::regex

  • CTRE - Compile Time Regular Expressions. Yes, you read that right. If you know your regexes you need ahead of time, this remarkable library is for you.

  • operator""s - std::string literals

  • grep, of which find is a pale imitation

  • Mastering Regular Expression is exhaustive without being exhausting. Properly mind-expanding stuff.

Source Code

The full source code, with tests and what-not, for find and find_cheating, indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and twice renew a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


1. Even if only "I know, I’ll use regular expressions." Now they have two problems., which, as so often, is a quote long since shorn of its context
2. Yes, patterns fans! It’s Strategy.
3. Dehnert and Stepanov describe this as a regular type, which matches the built-in type semantics, thereby making our user-defined types behave like built-in types. Scott Meyers' puts it more succinctly as When in doubt, do as the ints do

Tagged code, and software-tools-in-c++

Saturday 29 May 2021 The Forest Road Reader, No 2.64 : STinC++ - Chapter Four Wrap Up

STinC++ rewrites the programs in Software Tools in Pascal using C++

This chapter is called Sorting and, by Jove, did we do some sorting. Three different sort algorithms and two separate text sort programs, one entirely in-memory, the other using temporary backing files so it can cope with arbitrarily enormous files. Finally we rounded out with a couple of examples of things having a sort utility enables, removing duplicates and the surprisingly fiddly keyword-in-context.

I enjoyed this chapter actually, especially implementing the various sort algorithms. Unless you’re doing something absurdly specialist (or an utterly ludicrous job interview) there’s no need to write your own sort routine. Because I’m neither absurdly specialist (quite the reverse, generalism is my thing) nor do I really do interviews (we’ll have a conversation, but I’m not doing tricks), I never have written a sort before. Working through quicksort was particularly illuminating, in that before I had a vague idea of how it worked, and now I actually understand it.

I also enjoyed the coincidence of starting to work on keyword-in-context at exactly the time there was a minor Twitter excitement about it, even it did subsequently take me six weeks to write the blessed thing up.

Hitting the end of Chapter Four marks the half-way point of Software Tools in Pascal, just shy of two years since starting this ridiculous little project. Thanks for coming along.

Chapter Five is entitled Text Patterns, I’m 52 today, and I’m excited. I’ve been looking forward to this chapter from the start.


Tagged code, and software-tools-in-c++

Wednesday 26 May 2021 The Forest Road Reader, No 2.63 : STinC++ - kwic and unrotate

STinC++ rewrites the programs in Software Tools in Pascal using C++

About a month ago, Alistair Cockburn tweeted a little extract from a paper by David Parnas:

The following description of a KWIC index will suffice for this paper. The KWIC index system accepts an ordered set of lines, each line is an ordered set of words, and each word is an ordered set of characters. Any line may be "circularly shifted" by repeated removing the first word and appending it at the end of the line. The KWIC index system outputs a listing of all circular shifts of all lines in alphabetical order.

This is a small system. Except under extreme circumstances (huge data base, no supporting software), such a system could be produced by a good programmer within a week or two.

Cockburn didn’t give the reference, but I think this is from On the Criteria To Be Used in Decomposing Systems into Modules, published in CACM in December 1972.

What Cockburn picked up on was "a week or two" and asked, fifty years on,

How long does it take you?

It’s an entertaining little thread with answers in the 15 minutes to a few hours range, and you can judge for yourself what you make of the proffered code. Friend of FRR[1] Ron Jeffries wrote up his work, concluding I just have one question: who should I bill for 40 to 80 hours for this thing?

Permuted Index

Cockburn sent his tweet almost exactly as I was getting to work on the final section of Chapter 4, which begins

Once a flexible sorting program is available, other programs can use it as a component. In this section we will describe one such application, a program for creating a permuted index (often called a keyword-in-context or "KWIC" index). A permuted index lists each word (or some similar useful token) in its original context, but sorted by word and rearranged so the keywords line up.

What a turn up! Kernighan and Plauger ask for only slightly more work than Parnas, and while I didn’t keep a detailed tally I too was very comfortably short of Parnas' week or two.

On the basis that we already have a sort program, Kernighan and Plauger propose a pipelined solution.

  create rotations | sort | unrotate and print

To me this feels a little forced, but it reinforces their message of tool composition (and, of course, we could rehash the arguments about available memory, Pascal’s lack of a string type and poor support for libraries, etc, etc), so I’m happy to go with it.

kwic

The first program is kwic which generates the rotated lines.

PROGRAM

kwic produce lines for KWIC index

USAGE

kwic

FUNCTION

kwic writes one or more "folded" versions of each input line to its output. A line is "folded" at the beginning of each alphanumeric string within the line by writing from that string through the end of the line, followed by the fold character $, followed by the beginning of the line.

kwic is used with sort and unrotate to produce a KeyWord In Context, or KWIC, index.

EXAMPLE

kwic
This is a test.
<ENDFILE>
This is a test.$
is a test.$This
a test.$This is
test.$This is a

Normal usage is

 kwic < document | sort | unrotate

Let’s get straight into it …​

int main() {
  stiX::kwic(std::cin, std::cout, '$');
}

void stiX::kwic(std::istream& in, std::ostream& out, char fold_marker) {
  while(in.peek() != eof) {
    auto line = stiX::getline(in);

    auto rotations = generate_rotations(line, fold_marker);

    for (auto& rotation: rotations)
      out << rotation << "\n";
  };
}

std::vector<std::string> generate_rotations(
    const std::string &line,
    char fold_marker) {
  if (line.empty())
    return { };

  auto words = split_into_words(line);

  auto rotations = std::vector<std::string> { };

  size_t fold_word = words.size() - 1; (1)
  for (
    size_t rotation = 0;
    rotation != words.size();
    ++rotation, --fold_word
  ) {
    rotations.emplace_back(make_rotated_line(words, fold_word, fold_marker));
    std::rotate(words.begin(), words.begin()+1, words.end());
  }

  return rotations;
}

std::vector<std::string> split_into_words(std::string const& input) {
  auto iss = std::istringstream { input };
  auto words = std::vector<std::string> { (2)
    std::istream_iterator<std::string>{iss},
    std::istream_iterator<std::string>()
  };
  return words;
}

std::string make_rotated_line(
    std::vector<std::string> const& words,
    size_t fold_word,
    char fold_marker) {
  auto out = std::ostringstream { };

  for (size_t i = 0; i != words.size(); ++i) {
    out << words[i];
    out << ((i == fold_word) ? fold_marker : ' '); (3)
  }

  return out.str();
}

kwic is straightforward enough - read a line, create all the rotations, write those back out.

There’s very little to surprise here, I think.

  1. The only bit that’s even slightly involved is placing the fold marker character $ in the appropriate place, and even that’s not especially difficult. We know a line with N words in must have N rotations. We also know the line we have is the first rotation. Therefore for the first line, the fold marker goes at the end of the line after the last word. As we then step through the remaining rotations, removing the word at the front and popping it on the end of the line, the fold marker moves one step towards the front each time.

  2. I chose to create the rotations by splitting the line into words (or at least whitespace delimited alphanumeric chunks), and manipulating that list of words. Splitting a string into its constituent words always feels fiddlier than it should be in C++, and here I’ve used the classic technique of letting our old friend istream_iterator<std::string> handle it for me.

  3. Putting the rotated words back together again is simple enough - smash them together, popping in the fold character $ instead of space at the appropriate point.

This was a fun little program, made all the easier by std::rotate sitting there in the standard library.

unrotate

PROGRAM

unrotate format lines for KWIC index

USAGE

unrotate

FUNCTION

unrotate reads its input a line at a time and writes an "unfolded" version to its output. A line is "folded" if it contains within it an instance of the fold character $; "unfolding" involves writing from the end of the line down to but not including the fold character, starting in column 39 of the output line, wrapping characters that would thus appear before column 1 around to the end of the line, then writing the reminder of the line starting at column 41 and wrapping around at column 80 if necessary.

unrotate is used with kwic and sort to produce a KeyWord In Context, or KWIC, index.

EXAMPLE

unrotate
a test.$This is
is a test.$This
test.$This is a
This is a test.$
<ENDFILE>
                             This is  a test.
                                This  is a test.
                           This is a  test.
                        test.         This is a

unrotate feels like it should be a doddle, but Kernighan and Plauger’s long-line wrapping requirement sent me off into a couple of mazey passages. Initially, I misunderstood the requirement to only wrap one way - that is if the text in the first half of the output is too long it should wrap around onto the end of the output. I then wrote some rather overwrought code to do that before realising that my code, while correct for what it did, was just doing the wrong thing. The output can wrap both ways - if the second half is too long, it wraps around to start of the output as well. As clearly shown in the little man page for the program.

I tried further tricky schemes, involving all kinds of clever string slicing, offset calculations and so on, wondering how to resolve the problem of text wrapping both ways, when I had my second realisation. Kernighan and Plauger model strings as fixed length buffers. They’re never going to have a string longer than their usual 80 characters, consequently their unfolded lines can only ever wrap at one end. It can’t be both.

My solution to overlength lines was to trim any excess from one or both ends, before unfolding. With that in place, everything else fell out nicely.

Unrotating
int main() {
  stiX::unrotate(std::cin, std::cout, 80, '$');
}

void stiX::unrotate(std::istream &in, std::ostream &out, size_t line_length, char fold_marker) {
  while(in.peek() != eof) {
    auto line = stiX::getline(in);
    out << unrotate_line(line, line_length, fold_marker) << '\n';
  };
}

struct output_sizes { (1)
  size_t const full_width;
  size_t const available;
  size_t const half;
  // ...

  output_sizes(size_t width):
    full_width(width), (2)
    available(width - gap), (3)
    half(available / 2),
    ... { (4)
  }
};

std::string stiX::unrotate_line(
  std::string const& input_line,
  size_t output_line_length,
  char fold_marker
) {
  auto fold_position = input_line.find(fold_marker);
  if (fold_position == std::string::npos)
    return input_line;

  auto output_length = output_sizes { output_line_length };
  if (input_line.length() > output_length.available)
    return trim_and_unrotate(
      input_line,
      fold_position,
      output_length,
      fold_marker);

  auto output_line = std::string(output_line_length, ' ');

  // copy after the fold into first half of the output
  auto after_fold = input_line.substr(fold_position+1);
  copy_and_wrap_left(output_line, after_fold, output_length);

  // copy before the fold into the second half of the output
  auto before_fold = input_line.substr(0, fold_position);
  copy_and_wrap_right(output_line, before_fold, output_length);

  return trim_trailing_blanks(output_line);
}
  1. This little struct wraps up several values all related to the output line length, actual usable space, and where that space is

  2. the full width of the output - 80 characters according to Kernighan and Plauger

  3. some of that width is taken up with a little gap of spaces, so this is the space available to use

  4. we compose the output in two pieces, this is the space available to each of those pieces+

  5. Do we actually have to do any work?

  6. Oh, is this line a bit on the large size?

  7. If we’ve made it this far - which hopefully will be most of the time - the processing is straightforward. Recall

l|unrotate
a test.$This is
is a test.$This
test.$This is a
This is a test.$
<ENDFILE>
                             This is  a test.
                                This  is a test.
                           This is a  test.
                        test.         This is a

The portion of the input string after the fold marker actually appears on the left-hand side of the output, while the portion of the input string before the marker appears on the right-hand side of the output. So, having created a buffer of the appropriate size, we divide the input into those two pieces and copy them into place.

Wrapping

We handle the left hand side of the output first. We actually start at the end of the after fold text (a position known in C++ circles at rbegin()), which we copy into the right-most position in the first half of the output. We then work left, back towards the beginning of the after fold text. If we hit the start of the output buffer as we go, we wrap around to the far end. This, simply and safely, wraps lengthy text without having to do tricky string slicing, or working out leading or trailing padding.

void copy_and_wrap_left(
  std::string& output_line,
  std::string const& input,
  output_sizes const& output_length
) {
  auto output_index = output_length.first_half_begin;
  for (
    auto ai = input.rbegin();
    ai != input.rend() && output_index != output_length.first_half_end;
    ++ai, --output_index
  ) {
    output_line[output_index] = *ai;
    if (output_index == output_length.first_half_wrap_at)
      output_index = output_length.first_half_wrap_to;
  }
}

We handle the before fold text in similar fashion, except this time we start at the beginning of that text, placing it at the leftmost position in the second half of the output and working to the right. If we hit the end of the buffer, we just wrap around to the beginning.

void copy_and_wrap_right(
  std::string& output_line,
  std::string const& input,
  output_sizes const& output_length
) {
  auto output_index = output_length.second_half_begin;
  for (
    auto bi = input.begin();
    bi != input.end() && output_index != output_length.second_half_end;
    ++bi
  ) {
    output_line[output_index] = *bi;

    ++output_index;
    if (output_index == output_length.second_half_wrap_at)
      output_index = output_length.second_half_wrap_to;
  }
}
Long Lines

That only leaves the vexed question of long lines.

Let’s say we have a folded line that looks like this

long.$This folds at the end and the rest is quite

and we want to unfold it into a output line only 20 characters long. The output should look like

 is quite  long. Thi

By chopping out the excess characters from the middle of the input, the copy and wrap processes we’ve just looked at will be fine.

long.$This folds at the end and the rest is quite
---------                               ---------

Keep the underlined parts, drop the middle to give

long.$Thi is quite

The abrupt cut-off at the end of This falls where the text wraps. This chopping out works when the fold marker is within half the output width of either end. When the fold marker falls somewhere in the middle, we have to be a bit cleverer.

is quite long.$This folds in the middle and the rest

should give an output of

 the rest  is quite

We still want either end of the input, but we will need to insert a new fold marker to ensure the output unfolds properly.

is quite long.$This folds in the middle and the rest
---------                                   --------

The function trim_and_unrotate handles the long line case, applying the appropriate trimming, then calling back into unrotate_line.

std::string trim_and_unrotate(
  std::string const& input_line,
  size_t fold_position,
  output_sizes const& output_length,
  char fold_marker
) {
  auto excess = input_line.length() - output_length.available;

  auto needs_double_trim =
    (fold_position >= output_length.half) &&
    (fold_position <= (input_line.length() - output_length.half));

  auto trimmed = needs_double_trim
    ? input_line.substr(0, output_length.half) +
      fold_marker +
      input_line.substr(input_line.length() - (output_length.half-1))
    : input_line.substr(0, output_length.half) +
      input_line.substr(output_length.half + excess);

  return stiX::unrotate_line(trimmed, output_length.full_width, fold_marker);
}
Thrashing

I feel like I’ve ended up with quite an elegant solution. The functions are short and straightforward, and their conditionals are contained. It did take a little while to get there, especially after my initial incorrect solution, but sometimes that’s the nature of the beast. It’s hard to tell, but I’d guess I’ve spent about 5 hours in total, mainly on the output formatting. Still comfortably within Parnas' week or two. Oh, yea. I still got it.

If you feel so motivated, the git history will give some indication of how I thrashed about on this with partial and incorrect solutions before I finally caught on.

Modularisation

The Parnas paper Cockburn cites above, and which it turns out Kernighan and Plauger also cite in their bibliography for this chapter, isn’t really about KWIC at all. That’s merely a proxy for a discussion about modularisation, and the benefits of two alternative ways of modularising a KWIC program. At first glance, Kernighan and Plauger’s pipeline might appear to be what Parnas' describes as approximately the decomposition which would be proposed by most programmers for the task specified. He was, however, talking about modules within a single program, rather than cooperating programs. His alternative modularisation is based on information hiding rather than tasks. If he asks, we want to change this particular aspect of the program, which of our modules would be touched? Kernighan and Plauger’s solution also practices very strong information hiding, as the processing within each step is opaque to its predecessor and successor steps. I assume Parnas would approved. I wonder if Parnas, Kernighan, and Plauger ever discussed it.

KWIC for the opening section of this article

kwic < article-opening.txt | sort | unrotate

KWIC indexes are long!

Source Code

The full source code, with tests and what-not, for kwic and unrotate, and indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


1. Well, in my head at least

Tagged code, kwic, and software-tools-in-c++
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed