Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Saturday 29 May 2021 The Forest Road Reader, No 2.64 : STinC++ - Chapter Four Wrap Up

STinC++ rewrites the programs in Software Tools in Pascal using C++

This chapter is called Sorting and, by Jove, did we do some sorting. Three different sort algorithms and two separate text sort programs, one entirely in-memory, the other using temporary backing files so it can cope with arbitrarily enormous files. Finally we rounded out with a couple of examples of things having a sort utility enables, removing duplicates and the surprisingly fiddly keyword-in-context.

I enjoyed this chapter actually, especially implementing the various sort algorithms. Unless you’re doing something absurdly specialist (or an utterly ludicrous job interview) there’s no need to write your own sort routine. Because I’m neither absurdly specialist (quite the reverse, generalism is my thing) nor do I really do interviews (we’ll have a conversation, but I’m not doing tricks), I never have written a sort before. Working through quicksort was particularly illuminating, in that before I had a vague idea of how it worked, and now I actually understand it.

I also enjoyed the coincidence of starting to work on keyword-in-context at exactly the time there was a minor Twitter excitement about it, even it did subsequently take me six weeks to write the blessed thing up.

Hitting the end of Chapter Four marks the half-way point of Software Tools in Pascal, just shy of two years since starting this ridiculous little project. Thanks for coming along.

Chapter Five is entitled Text Patterns, I’m 52 today, and I’m excited. I’ve been looking forward to this chapter from the start.


Tagged code, and software-tools-in-c++

Wednesday 26 May 2021 The Forest Road Reader, No 2.63 : STinC++ - kwic and unrotate

STinC++ rewrites the programs in Software Tools in Pascal using C++

About a month ago, Alistair Cockburn tweeted a little extract from a paper by David Parnas:

The following description of a KWIC index will suffice for this paper. The KWIC index system accepts an ordered set of lines, each line is an ordered set of words, and each word is an ordered set of characters. Any line may be "circularly shifted" by repeated removing the first word and appending it at the end of the line. The KWIC index system outputs a listing of all circular shifts of all lines in alphabetical order.

This is a small system. Except under extreme circumstances (huge data base, no supporting software), such a system could be produced by a good programmer within a week or two.

Cockburn didn’t give the reference, but I think this is from On the Criteria To Be Used in Decomposing Systems into Modules, published in CACM in December 1972.

What Cockburn picked up on was "a week or two" and asked, fifty years on,

How long does it take you?

It’s an entertaining little thread with answers in the 15 minutes to a few hours range, and you can judge for yourself what you make of the proffered code. Friend of FRR[1] Ron Jeffries wrote up his work, concluding I just have one question: who should I bill for 40 to 80 hours for this thing?

Permuted Index

Cockburn sent his tweet almost exactly as I was getting to work on the final section of Chapter 4, which begins

Once a flexible sorting program is available, other programs can use it as a component. In this section we will describe one such application, a program for creating a permuted index (often called a keyword-in-context or "KWIC" index). A permuted index lists each word (or some similar useful token) in its original context, but sorted by word and rearranged so the keywords line up.

What a turn up! Kernighan and Plauger ask for only slightly more work than Parnas, and while I didn’t keep a detailed tally I too was very comfortably short of Parnas' week or two.

On the basis that we already have a sort program, Kernighan and Plauger propose a pipelined solution.

  create rotations | sort | unrotate and print

To me this feels a little forced, but it reinforces their message of tool composition (and, of course, we could rehash the arguments about available memory, Pascal’s lack of a string type and poor support for libraries, etc, etc), so I’m happy to go with it.

kwic

The first program is kwic which generates the rotated lines.

PROGRAM

kwic produce lines for KWIC index

USAGE

kwic

FUNCTION

kwic writes one or more "folded" versions of each input line to its output. A line is "folded" at the beginning of each alphanumeric string within the line by writing from that string through the end of the line, followed by the fold character $, followed by the beginning of the line.

kwic is used with sort and unrotate to produce a KeyWord In Context, or KWIC, index.

EXAMPLE

kwic
This is a test.
<ENDFILE>
This is a test.$
is a test.$This
a test.$This is
test.$This is a

Normal usage is

 kwic < document | sort | unrotate

Let’s get straight into it …​

int main() {
  stiX::kwic(std::cin, std::cout, '$');
}

void stiX::kwic(std::istream& in, std::ostream& out, char fold_marker) {
  while(in.peek() != eof) {
    auto line = stiX::getline(in);

    auto rotations = generate_rotations(line, fold_marker);

    for (auto& rotation: rotations)
      out << rotation << "\n";
  };
}

std::vector<std::string> generate_rotations(
    const std::string &line,
    char fold_marker) {
  if (line.empty())
    return { };

  auto words = split_into_words(line);

  auto rotations = std::vector<std::string> { };

  size_t fold_word = words.size() - 1; (1)
  for (
    size_t rotation = 0;
    rotation != words.size();
    ++rotation, --fold_word
  ) {
    rotations.emplace_back(make_rotated_line(words, fold_word, fold_marker));
    std::rotate(words.begin(), words.begin()+1, words.end());
  }

  return rotations;
}

std::vector<std::string> split_into_words(std::string const& input) {
  auto iss = std::istringstream { input };
  auto words = std::vector<std::string> { (2)
    std::istream_iterator<std::string>{iss},
    std::istream_iterator<std::string>()
  };
  return words;
}

std::string make_rotated_line(
    std::vector<std::string> const& words,
    size_t fold_word,
    char fold_marker) {
  auto out = std::ostringstream { };

  for (size_t i = 0; i != words.size(); ++i) {
    out << words[i];
    out << ((i == fold_word) ? fold_marker : ' '); (3)
  }

  return out.str();
}

kwic is straightforward enough - read a line, create all the rotations, write those back out.

There’s very little to surprise here, I think.

  1. The only bit that’s even slightly involved is placing the fold marker character $ in the appropriate place, and even that’s not especially difficult. We know a line with N words in must have N rotations. We also know the line we have is the first rotation. Therefore for the first line, the fold marker goes at the end of the line after the last word. As we then step through the remaining rotations, removing the word at the front and popping it on the end of the line, the fold marker moves one step towards the front each time.

  2. I chose to create the rotations by splitting the line into words (or at least whitespace delimited alphanumeric chunks), and manipulating that list of words. Splitting a string into its constituent words always feels fiddlier than it should be in C++, and here I’ve used the classic technique of letting our old friend istream_iterator<std::string> handle it for me.

  3. Putting the rotated words back together again is simple enough - smash them together, popping in the fold character $ instead of space at the appropriate point.

This was a fun little program, made all the easier by std::rotate sitting there in the standard library.

unrotate

PROGRAM

unrotate format lines for KWIC index

USAGE

unrotate

FUNCTION

unrotate reads its input a line at a time and writes an "unfolded" version to its output. A line is "folded" if it contains within it an instance of the fold character $; "unfolding" involves writing from the end of the line down to but not including the fold character, starting in column 39 of the output line, wrapping characters that would thus appear before column 1 around to the end of the line, then writing the reminder of the line starting at column 41 and wrapping around at column 80 if necessary.

unrotate is used with kwic and sort to produce a KeyWord In Context, or KWIC, index.

EXAMPLE

unrotate
a test.$This is
is a test.$This
test.$This is a
This is a test.$
<ENDFILE>
                             This is  a test.
                                This  is a test.
                           This is a  test.
                        test.         This is a

unrotate feels like it should be a doddle, but Kernighan and Plauger’s long-line wrapping requirement sent me off into a couple of mazey passages. Initially, I misunderstood the requirement to only wrap one way - that is if the text in the first half of the output is too long it should wrap around onto the end of the output. I then wrote some rather overwrought code to do that before realising that my code, while correct for what it did, was just doing the wrong thing. The output can wrap both ways - if the second half is too long, it wraps around to start of the output as well. As clearly shown in the little man page for the program.

I tried further tricky schemes, involving all kinds of clever string slicing, offset calculations and so on, wondering how to resolve the problem of text wrapping both ways, when I had my second realisation. Kernighan and Plauger model strings as fixed length buffers. They’re never going to have a string longer than their usual 80 characters, consequently their unfolded lines can only ever wrap at one end. It can’t be both.

My solution to overlength lines was to trim any excess from one or both ends, before unfolding. With that in place, everything else fell out nicely.

Unrotating
int main() {
  stiX::unrotate(std::cin, std::cout, 80, '$');
}

void stiX::unrotate(std::istream &in, std::ostream &out, size_t line_length, char fold_marker) {
  while(in.peek() != eof) {
    auto line = stiX::getline(in);
    out << unrotate_line(line, line_length, fold_marker) << '\n';
  };
}

struct output_sizes { (1)
  size_t const full_width;
  size_t const available;
  size_t const half;
  // ...

  output_sizes(size_t width):
    full_width(width), (2)
    available(width - gap), (3)
    half(available / 2),
    ... { (4)
  }
};

std::string stiX::unrotate_line(
  std::string const& input_line,
  size_t output_line_length,
  char fold_marker
) {
  auto fold_position = input_line.find(fold_marker);
  if (fold_position == std::string::npos)
    return input_line;

  auto output_length = output_sizes { output_line_length };
  if (input_line.length() > output_length.available)
    return trim_and_unrotate(
      input_line,
      fold_position,
      output_length,
      fold_marker);

  auto output_line = std::string(output_line_length, ' ');

  // copy after the fold into first half of the output
  auto after_fold = input_line.substr(fold_position+1);
  copy_and_wrap_left(output_line, after_fold, output_length);

  // copy before the fold into the second half of the output
  auto before_fold = input_line.substr(0, fold_position);
  copy_and_wrap_right(output_line, before_fold, output_length);

  return trim_trailing_blanks(output_line);
}
  1. This little struct wraps up several values all related to the output line length, actual usable space, and where that space is

  2. the full width of the output - 80 characters according to Kernighan and Plauger

  3. some of that width is taken up with a little gap of spaces, so this is the space available to use

  4. we compose the output in two pieces, this is the space available to each of those pieces+

  5. Do actually have to do any work?

  6. Oh, is this line a bit on the large size?

  7. If we’ve made it this far - which hopefully will be most of the time - the processing is straightforward. Recall

l|unrotate
a test.$This is
is a test.$This
test.$This is a
This is a test.$
<ENDFILE>
                             This is  a test.
                                This  is a test.
                           This is a  test.
                        test.         This is a

The portion of the input string after the fold marker actually appears on the left-hand side of the output, while the portion of the input string before the marker appears on the right-hand side of the output. So, having created a buffer of the appropriate size, we divide the input into those two pieces and copy them into place.

Wrapping

We handle the left hand side of the output first. We actually start at the end of the after fold text (a position known in C++ circles at rbegin()), which we copy into the right-most position in the first half of the output. We then work left, back towards the beginning of the after fold text. If we hit the start of the output buffer as we go, we wrap around to the far end. This, simply and safely, wraps lengthy text without having to do tricky string slicing, or working out leading or trailing padding.

void copy_and_wrap_left(
  std::string& output_line,
  std::string const& input,
  output_sizes const& output_length
) {
  auto output_index = output_length.first_half_begin;
  for (
    auto ai = input.rbegin();
    ai != input.rend() && output_index != output_length.first_half_end;
    ++ai, --output_index
  ) {
    output_line[output_index] = *ai;
    if (output_index == output_length.first_half_wrap_at)
      output_index = output_length.first_half_wrap_to;
  }
}

We handle the before fold text in similar fashion, except this time we start at the beginning of that text, placing it at the leftmost position in the second half of the output and working to the right. If we hit the end of the buffer, we just wrap around to the beginning.

void copy_and_wrap_right(
  std::string& output_line,
  std::string const& input,
  output_sizes const& output_length
) {
  auto output_index = output_length.second_half_begin;
  for (
    auto bi = input.begin();
    bi != input.end() && output_index != output_length.second_half_end;
    ++bi
  ) {
    output_line[output_index] = *bi;

    ++output_index;
    if (output_index == output_length.second_half_wrap_at)
      output_index = output_length.second_half_wrap_to;
  }
}
Long Lines

That only leaves the vexed question of long lines.

Let’s say we have a folded line that looks like this

long.$This folds at the end and the rest is quite

and we want to unfold it into a output line only 20 characters long. The output should look like

 is quite  long. Thi

By chopping out the excess characters from the middle of the input, the copy and wrap processes we’ve just looked at will be fine.

long.$This folds at the end and the rest is quite
---------                               ---------

Keep the underlined parts, drop the middle to give

long.$Thi is quite

The abrupt cut-off at the end of This falls where the text wraps. This chopping out works when the fold marker is within half the output width of either end. When the fold marker falls somewhere in the middle, we have to be a bit cleverer.

is quite long.$This folds in the middle and the rest

should give an output of

 the rest  is quite

We still want either end of the input, but we will need to insert a new fold marker to ensure the output unfolds properly.

is quite long.$This folds in the middle and the rest
---------                                   --------

The function trim_and_unrotate handles the long line case, applying the appropriate trimming, then calling back into unrotate_line.

std::string trim_and_unrotate(
  std::string const& input_line,
  size_t fold_position,
  output_sizes const& output_length,
  char fold_marker
) {
  auto excess = input_line.length() - output_length.available;

  auto needs_double_trim =
    (fold_position >= output_length.half) &&
    (fold_position <= (input_line.length() - output_length.half));

  auto trimmed = needs_double_trim
    ? input_line.substr(0, output_length.half) +
      fold_marker +
      input_line.substr(input_line.length() - (output_length.half-1))
    : input_line.substr(0, output_length.half) +
      input_line.substr(output_length.half + excess);

  return stiX::unrotate_line(trimmed, output_length.full_width, fold_marker);
}
Thrashing

I feel like I’ve ended up with quite an elegant solution. The functions are short and straightforward, and their conditionals are contained. It did take a little while to get there, especially after my initial incorrect solution, but sometimes that’s the nature of the beast. It’s hard to tell, but I’d guess I’ve spent about 5 hours in total, mainly on the output formatting. Still comfortably within Parnas' week or two. Oh, yea. I still got it.

If you feel so motivated, the git history will give some indication of how I thrashed about on this with partial and incorrect solutions before I finally caught on.

Modularisation

The Parnas paper Cockburn cites above, and which it turns out Kernighan and Plauger also cite in their bibliography for this chapter, isn’t really about KWIC at all. That’s merely a proxy for a discussion about modularisation, and the benefits of two alternative ways of modularising a KWIC program. At first glance, Kernighan and Plauger’s pipeline might appear to be what Parnas' describes as _ approximately the decomposition which would be proposed by most programmers for the task specified_. He was, however, talking about modules within a single program, rather than cooperating programs. His alternative modularisation is based on information hiding rather than tasks. If he asks, we want to change this particular aspect of the program, which of our modules would be touched? Kernighan and Plauger’s solution also practices very strong information hiding, as the processing within each step is opaque to its predecessor and successor steps. I assume Parnas would approved. I wonder if Parnas, Karnighan, and Plauger ever discussed it.

KWIC for the opening section of this article

kwic < article-opening.txt | sort | unrotate

KWIC indexes are long!

Source Code

The full source code, with tests and what-not, for kwic and unrotate, and indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


1. Well, in my head at least

Tagged code, kwic, and software-tools-in-c++

Friday 23 April 2021 The Forest Road Reader, No 2.62 : STinC++ - unique

STinC++ rewrites the programs in Software Tools in Pascal using C++

One common reason for sorting is to bring together all occurrences of a particular item so they can be treated as a group. Sometimes this is done just to discard all but one occurrence in a group …​ It’s certainly easy to add an option to sort which discards multiple occurrences of an input line.

This is true …​

But should this code be part of sort at all?

Kernighan and Plauger argue, quite strongly, that in this case the answer is no. They don’t quite go to premature optimisation is the route of all evil but they do say

Combining functions too early is a mistake.

They also lean heavily on the possibility of reuse.

By separating the function of stripping duplicates from that of sorting, we can do things which are not possible when they are combined.

I’m often wary of arguments about enabling reuse, but in this case it’s really relevant and what’s more actually likely to happen. We’re building tools, and tools can be combined in ways we can’t necessarily anticipate. By keeping our duplicate stripping functionality separate from sorting, we allow people to do things which might not be possible were they squished together in one package.

unique

PROGRAM

unique delete adjacent duplicate lines

USAGE

unique

FUNCTION

unique writes to its output only the first line from each group of adjacent identical input lines. It is most useful for text that has been sorted to bring identical lines together; in this case it passes through only unique instances of input lines.

EXAMPLE

To eliminate duplicate lines in the output of a program:

program | sort | unique
Stripping duplicates

We’re back in familiar filtering ground here. We’re reading a sequence of lines of text, doing some processing on each line, and then spitting the result back out. The processing here is removing duplicates, and that’s straightforward:

  • Read a line

  • If this line is different from the previous line, print it out

  • Loop back around until there’s nothing more to read

We do have small bootstrapping problem though - what do we do for the very first line? What do we compare the first line we read with? In a language like, say, Java we can lean on null and do something like this

String previousLine = null;

while ( /* there are more lines */ ) {
  String currentLine = readLine(...);

  if (!currentLine.equals(previousLine)) {
    /* print out */
  }

  previousLine = currentLine;
}

In Java, we’re never going to read a null string from the input, so we can safely use that as our initial value. For the first line we read currentLine.equals(previousLine) will never be true, so it gets printed out and the program proceeds properly.

In C++ strings are values rather than references. If I declare a string

std::string str;

then str has a value. The value happens to be the empty string, but it’s still a value. What’s more, it’s perfectly possible that we might read empty strings from our input.

So what to do? Do we need to apply some special processing to the very first line, perhaps using a flag to indicate this is first time through?

auto const eof = std::char_traits<char>::eof();

void stiX::unique(std::istream& in, std::ostream& out) {
  auto previous_line = std::string { };
  bool first_line = true;

  while(in.peek() != eof) {
    auto current_line = stiX::getline(in);

    if (previous_line != current_line || first_line)
      out << current_line << "\n";

    previous_line = current_line;
    first_line = false;
  };
}

This works as we want, but that first_line flag really grates on me. It’s a wart.

Filthy but I’m ashamed to say I like it

When we write something like

auto str = std::string { "I am a string" };

we’re reaching back into C++'s deep past. C has no real string type, instead kind of bodging through with arrays of characters. A sequence of characters in double quotes, like "I am a string", is stored by C as an array of characters containing the characters of the string and, crucially, terminated with an extra '\0' character as an end of string marker. This behaviour is baked into the C language and library.

C++ preserves this behaviour and these functions - we couldn’t have C interop without that - and a C++ std::string can happily be initialised with a C style character array. However, a std::string isn’t just a simple wrapper around char*, it’s a little more sophisticated than that. Significantly, once it’s initialised and knows its own length, it does not rely on the '\0' character as a terminator. Even more than that, std::string is happy to contain multiple '\0' within its value.

Generally, when we create a std::string we’re initialising from a char*, copying another std::string, or just bringing up a default empty string. But std::string has a several other constructors, including

std::string(size_type count, char ch);

This constructs a string containing count copies of the character ch. std::string(5, 'a') would be, therefore, "aaaaa". Excitingly std::string(1, '\0') constructs a string containing a single '\0' character. This is something you wouldn’t ever ordinarily encounter, especially when, for example, reading text from an istream.

That’s a rather round-the-houses way of saying we can eliminate the first_line flag using our weird string of '\0'.

auto const eof = std::char_traits<char>::eof();

void stiX::unique(std::istream& in, std::ostream& out) {
  auto previous_line = std::string { 1, '\0' };

  while(in.peek() != eof) {
    auto current_line = stiX::getline(in);

    if (previous_line != current_line)
      out << current_line << "\n";

    previous_line = current_line;
  };
}

I can’t quite decide if this is a neat use of an intended feature, or a crime against code. Possibly both. It’s definitely well-formed C++, but it does feel kind of funky. If it moves you, in a pleasant or uncomfortable direction, I’d be delighted to hear from you.

References

  • The many constructors of std::string

  • Back in 1982, AT&T produced a video to promote Unix. A few minutes in Brian Kernighan, his feet up on his desk, demonstrates the power of pipelines by combining various programs, including sort and unique, to build a spell checker. It’s really quite a remarkable demo, and he explains it brilliantly. The whole film, which runs just shy of 30 minutes, is a fascinating historical artefact.

Source Code

The full source code, with tests and what-not, for unique, and indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


Tagged code, and software-tools-in-c++
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed