Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Wednesday 26 May 2021 The Forest Road Reader, No 2.63 : STinC++ - kwic and unrotate

STinC++ rewrites the programs in Software Tools in Pascal using C++

About a month ago, Alistair Cockburn tweeted a little extract from a paper by David Parnas:

The following description of a KWIC index will suffice for this paper. The KWIC index system accepts an ordered set of lines, each line is an ordered set of words, and each word is an ordered set of characters. Any line may be "circularly shifted" by repeated removing the first word and appending it at the end of the line. The KWIC index system outputs a listing of all circular shifts of all lines in alphabetical order.

This is a small system. Except under extreme circumstances (huge data base, no supporting software), such a system could be produced by a good programmer within a week or two.

Cockburn didn’t give the reference, but I think this is from On the Criteria To Be Used in Decomposing Systems into Modules, published in CACM in December 1972.

What Cockburn picked up on was "a week or two" and asked, fifty years on,

How long does it take you?

It’s an entertaining little thread with answers in the 15 minutes to a few hours range, and you can judge for yourself what you make of the proffered code. Friend of FRR[1] Ron Jeffries wrote up his work, concluding I just have one question: who should I bill for 40 to 80 hours for this thing?

Permuted Index

Cockburn sent his tweet almost exactly as I was getting to work on the final section of Chapter 4, which begins

Once a flexible sorting program is available, other programs can use it as a component. In this section we will describe one such application, a program for creating a permuted index (often called a keyword-in-context or "KWIC" index). A permuted index lists each word (or some similar useful token) in its original context, but sorted by word and rearranged so the keywords line up.

What a turn up! Kernighan and Plauger ask for only slightly more work than Parnas, and while I didn’t keep a detailed tally I too was very comfortably short of Parnas' week or two.

On the basis that we already have a sort program, Kernighan and Plauger propose a pipelined solution.

  create rotations | sort | unrotate and print

To me this feels a little forced, but it reinforces their message of tool composition (and, of course, we could rehash the arguments about available memory, Pascal’s lack of a string type and poor support for libraries, etc, etc), so I’m happy to go with it.

kwic

The first program is kwic which generates the rotated lines.

PROGRAM

kwic produce lines for KWIC index

USAGE

kwic

FUNCTION

kwic writes one or more "folded" versions of each input line to its output. A line is "folded" at the beginning of each alphanumeric string within the line by writing from that string through the end of the line, followed by the fold character $, followed by the beginning of the line.

kwic is used with sort and unrotate to produce a KeyWord In Context, or KWIC, index.

EXAMPLE

kwic
This is a test.
<ENDFILE>
This is a test.$
is a test.$This
a test.$This is
test.$This is a

Normal usage is

 kwic < document | sort | unrotate

Let’s get straight into it …​

int main() {
  stiX::kwic(std::cin, std::cout, '$');
}

void stiX::kwic(std::istream& in, std::ostream& out, char fold_marker) {
  while(in.peek() != eof) {
    auto line = stiX::getline(in);

    auto rotations = generate_rotations(line, fold_marker);

    for (auto& rotation: rotations)
      out << rotation << "\n";
  };
}

std::vector<std::string> generate_rotations(
    const std::string &line,
    char fold_marker) {
  if (line.empty())
    return { };

  auto words = split_into_words(line);

  auto rotations = std::vector<std::string> { };

  size_t fold_word = words.size() - 1; (1)
  for (
    size_t rotation = 0;
    rotation != words.size();
    ++rotation, --fold_word
  ) {
    rotations.emplace_back(make_rotated_line(words, fold_word, fold_marker));
    std::rotate(words.begin(), words.begin()+1, words.end());
  }

  return rotations;
}

std::vector<std::string> split_into_words(std::string const& input) {
  auto iss = std::istringstream { input };
  auto words = std::vector<std::string> { (2)
    std::istream_iterator<std::string>{iss},
    std::istream_iterator<std::string>()
  };
  return words;
}

std::string make_rotated_line(
    std::vector<std::string> const& words,
    size_t fold_word,
    char fold_marker) {
  auto out = std::ostringstream { };

  for (size_t i = 0; i != words.size(); ++i) {
    out << words[i];
    out << ((i == fold_word) ? fold_marker : ' '); (3)
  }

  return out.str();
}

kwic is straightforward enough - read a line, create all the rotations, write those back out.

There’s very little to surprise here, I think.

  1. The only bit that’s even slightly involved is placing the fold marker character $ in the appropriate place, and even that’s not especially difficult. We know a line with N words in must have N rotations. We also know the line we have is the first rotation. Therefore for the first line, the fold marker goes at the end of the line after the last word. As we then step through the remaining rotations, removing the word at the front and popping it on the end of the line, the fold marker moves one step towards the front each time.

  2. I chose to create the rotations by splitting the line into words (or at least whitespace delimited alphanumeric chunks), and manipulating that list of words. Splitting a string into its constituent words always feels fiddlier than it should be in C++, and here I’ve used the classic technique of letting our old friend istream_iterator<std::string> handle it for me.

  3. Putting the rotated words back together again is simple enough - smash them together, popping in the fold character $ instead of space at the appropriate point.

This was a fun little program, made all the easier by std::rotate sitting there in the standard library.

unrotate

PROGRAM

unrotate format lines for KWIC index

USAGE

unrotate

FUNCTION

unrotate reads its input a line at a time and writes an "unfolded" version to its output. A line is "folded" if it contains within it an instance of the fold character $; "unfolding" involves writing from the end of the line down to but not including the fold character, starting in column 39 of the output line, wrapping characters that would thus appear before column 1 around to the end of the line, then writing the reminder of the line starting at column 41 and wrapping around at column 80 if necessary.

unrotate is used with kwic and sort to produce a KeyWord In Context, or KWIC, index.

EXAMPLE

unrotate
a test.$This is
is a test.$This
test.$This is a
This is a test.$
<ENDFILE>
                             This is  a test.
                                This  is a test.
                           This is a  test.
                        test.         This is a

unrotate feels like it should be a doddle, but Kernighan and Plauger’s long-line wrapping requirement sent me off into a couple of mazey passages. Initially, I misunderstood the requirement to only wrap one way - that is if the text in the first half of the output is too long it should wrap around onto the end of the output. I then wrote some rather overwrought code to do that before realising that my code, while correct for what it did, was just doing the wrong thing. The output can wrap both ways - if the second half is too long, it wraps around to start of the output as well. As clearly shown in the little man page for the program.

I tried further tricky schemes, involving all kinds of clever string slicing, offset calculations and so on, wondering how to resolve the problem of text wrapping both ways, when I had my second realisation. Kernighan and Plauger model strings as fixed length buffers. They’re never going to have a string longer than their usual 80 characters, consequently their unfolded lines can only ever wrap at one end. It can’t be both.

My solution to overlength lines was to trim any excess from one or both ends, before unfolding. With that in place, everything else fell out nicely.

Unrotating
int main() {
  stiX::unrotate(std::cin, std::cout, 80, '$');
}

void stiX::unrotate(std::istream &in, std::ostream &out, size_t line_length, char fold_marker) {
  while(in.peek() != eof) {
    auto line = stiX::getline(in);
    out << unrotate_line(line, line_length, fold_marker) << '\n';
  };
}

struct output_sizes { (1)
  size_t const full_width;
  size_t const available;
  size_t const half;
  // ...

  output_sizes(size_t width):
    full_width(width), (2)
    available(width - gap), (3)
    half(available / 2),
    ... { (4)
  }
};

std::string stiX::unrotate_line(
  std::string const& input_line,
  size_t output_line_length,
  char fold_marker
) {
  auto fold_position = input_line.find(fold_marker);
  if (fold_position == std::string::npos) (5)
    return input_line;

  auto output_length = output_sizes { output_line_length };
  if (input_line.length() > output_length.available) (6)
    return trim_and_unrotate(
      input_line,
      fold_position,
      output_length,
      fold_marker);

  auto output_line = std::string(output_line_length, ' '); (7)

  // copy after the fold into first half of the output
  auto after_fold = input_line.substr(fold_position+1);
  copy_and_wrap_left(output_line, after_fold, output_length);

  // copy before the fold into the second half of the output
  auto before_fold = input_line.substr(0, fold_position);
  copy_and_wrap_right(output_line, before_fold, output_length);

  return trim_trailing_blanks(output_line);
}
  1. This little struct wraps up several values all related to the output line length, actual usable space, and where that space is

  2. the full width of the output - 80 characters according to Kernighan and Plauger

  3. some of that width is taken up with a little gap of spaces, so this is the space available to use

  4. we compose the output in two pieces, this is the space available to each of those pieces

  5. Do we actually have to do any work?

  6. Oh, is this line a bit on the large size?

  7. If we’ve made it this far - which hopefully will be most of the time - the processing is straightforward. Recall

l|unrotate
a test.$This is
is a test.$This
test.$This is a
This is a test.$
<ENDFILE>
                             This is  a test.
                                This  is a test.
                           This is a  test.
                        test.         This is a

The portion of the input string after the fold marker actually appears on the left-hand side of the output, while the portion of the input string before the marker appears on the right-hand side of the output. So, having created a buffer of the appropriate size, we divide the input into those two pieces and copy them into place.

Wrapping

We handle the left hand side of the output first. We actually start at the end of the after fold text (a position known in C++ circles at rbegin()), which we copy into the right-most position in the first half of the output. We then work left, back towards the beginning of the after fold text. If we hit the start of the output buffer as we go, we wrap around to the far end. This, simply and safely, wraps lengthy text without having to do tricky string slicing, or working out leading or trailing padding.

void copy_and_wrap_left(
  std::string& output_line,
  std::string const& input,
  output_sizes const& output_length
) {
  auto output_index = output_length.first_half_begin;
  for (
    auto ai = input.rbegin();
    ai != input.rend() && output_index != output_length.first_half_end;
    ++ai, --output_index
  ) {
    output_line[output_index] = *ai;
    if (output_index == output_length.first_half_wrap_at)
      output_index = output_length.first_half_wrap_to;
  }
}

We handle the before fold text in similar fashion, except this time we start at the beginning of that text, placing it at the leftmost position in the second half of the output and working to the right. If we hit the end of the buffer, we just wrap around to the beginning.

void copy_and_wrap_right(
  std::string& output_line,
  std::string const& input,
  output_sizes const& output_length
) {
  auto output_index = output_length.second_half_begin;
  for (
    auto bi = input.begin();
    bi != input.end() && output_index != output_length.second_half_end;
    ++bi
  ) {
    output_line[output_index] = *bi;

    ++output_index;
    if (output_index == output_length.second_half_wrap_at)
      output_index = output_length.second_half_wrap_to;
  }
}
Long Lines

That only leaves the vexed question of long lines.

Let’s say we have a folded line that looks like this

long.$This folds at the end and the rest is quite

and we want to unfold it into a output line only 20 characters long. The output should look like

 is quite  long. Thi

By chopping out the excess characters from the middle of the input, the copy and wrap processes we’ve just looked at will be fine.

long.$This folds at the end and the rest is quite
---------                               ---------

Keep the underlined parts, drop the middle to give

long.$Thi is quite

The abrupt cut-off at the end of This falls where the text wraps. This chopping out works when the fold marker is within half the output width of either end. When the fold marker falls somewhere in the middle, we have to be a bit cleverer.

is quite long.$This folds in the middle and the rest

should give an output of

 the rest  is quite

We still want either end of the input, but we will need to insert a new fold marker to ensure the output unfolds properly.

is quite long.$This folds in the middle and the rest
---------                                   --------

The function trim_and_unrotate handles the long line case, applying the appropriate trimming, then calling back into unrotate_line.

std::string trim_and_unrotate(
  std::string const& input_line,
  size_t fold_position,
  output_sizes const& output_length,
  char fold_marker
) {
  auto excess = input_line.length() - output_length.available;

  auto needs_double_trim =
    (fold_position >= output_length.half) &&
    (fold_position <= (input_line.length() - output_length.half));

  auto trimmed = needs_double_trim
    ? input_line.substr(0, output_length.half) +
      fold_marker +
      input_line.substr(input_line.length() - (output_length.half-1))
    : input_line.substr(0, output_length.half) +
      input_line.substr(output_length.half + excess);

  return stiX::unrotate_line(trimmed, output_length.full_width, fold_marker);
}
Thrashing

I feel like I’ve ended up with quite an elegant solution. The functions are short and straightforward, and their conditionals are contained. It did take a little while to get there, especially after my initial incorrect solution, but sometimes that’s the nature of the beast. It’s hard to tell, but I’d guess I’ve spent about 5 hours in total, mainly on the output formatting. Still comfortably within Parnas' week or two. Oh, yea. I still got it.

If you feel so motivated, the git history will give some indication of how I thrashed about on this with partial and incorrect solutions before I finally caught on.

Modularisation

The Parnas paper Cockburn cites above, and which it turns out Kernighan and Plauger also cite in their bibliography for this chapter, isn’t really about KWIC at all. That’s merely a proxy for a discussion about modularisation, and the benefits of two alternative ways of modularising a KWIC program. At first glance, Kernighan and Plauger’s pipeline might appear to be what Parnas' describes as approximately the decomposition which would be proposed by most programmers for the task specified. He was, however, talking about modules within a single program, rather than cooperating programs. His alternative modularisation is based on information hiding rather than tasks. If he asks, we want to change this particular aspect of the program, which of our modules would be touched? Kernighan and Plauger’s solution also practices very strong information hiding, as the processing within each step is opaque to its predecessor and successor steps. I assume Parnas would approved. I wonder if Parnas, Kernighan, and Plauger ever discussed it.

KWIC for the opening section of this article

kwic < article-opening.txt | sort | unrotate

KWIC indexes are long!

Source Code

The full source code, with tests and what-not, for kwic and unrotate, and indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


1. Well, in my head at least

Tagged code, kwic, and software-tools-in-c++


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed