Jez Higgins

Freelance software generalist
software created
extended or repaired


Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About

Tuesday 12 November 2019 The Forest Road Reader, No 2.36 : STinC++ - expand

Having presented a simple text compression utility, Kernighan and Plauger’s next program is, unsurprisingly, the companion program to reinflate the compressed text. Expansion is easier than compression - processing a sequence of characters with interleaved occasional encoded runs - and so the code should be straightforward. “Our first impulse is thus to write :”

while (get(c) <> ENDFILE)
    if (c = WARNING)
        n := count implied by next char
        c := getc(c)
        for i := 1 to n
            putc(c)
    else
        putc(c)

They counsel caution though.

For valid input this works fine, but what happens if an ENDFILE is encountered while reading into the count value? What happens if the count is out of range? The program is unprepared for this, and so produces some random number of repetitions of the next character. It could even make additional calls to getc and putc after ENDFILE is encountered.

They mention possible mitigations - adding a shim over getc or modifying getc itself to be safe in the face of ENDFILE, having the program stop immediately on hitting an error state - but discard them. Their preference is explicit error checking and reasonable recovery.

So our working version of expand checks for sensible input after every call on getc. It decompresses if possible, but if not, it prints the input as is.

This strikes me as an application of Postel’s Robustness Principle. The Robustness Principle, also called Postel’s Law, is given in RFC 760 as

In general, an implementation should be conservative in its sending behavior, and liberal in its receiving behavior.

RFC 760 describes TCP and so is world changing in all kinds of ways, but it’s arguable that if not for Jon Postel’s call to generosity in implementation then TCP would have foundered, certainly that take-up would have been significantly slower than it was. This isn’t to say we should apply the Robustness Principle at all times and in every case - it certainly isn’t a Law in that sense - but in the smaller context of a little text expansion utility, it’s certainly worth considering.

I’m not sure when the Robustness Principle was first named as such (possibly not until RFC 1122 in 1989), but as a point of pedantry Kernighan and Plauger certainly didn’t know that name when they wrote this chapter. RFC 760 was written in 1980, and I don’t believe it was projected back through time to 1976, when Software Tools was published. Still, good advice is good advice no matter when it was said or who said it.

Bearing that in mind, the tests I used to write this program included a number of malformed inputs.

expand.cpp
enum ExpandState {
    passthrough,
    haveMarker,
    haveCount
};

class expander {
public:
    expander(std::ostream& out)
        : out_(out) {
    }
    ~expander() {
        if (state == passthrough)
            return;

        out_ << recoveryString;
        if (state == haveCount)
            out_ << lastChar;
    }

    std::string operator()(char c) {
        lastChar = c;

        switch (state) { (1)
            case passthrough: (2)
                if (ismarker(c)) {
                    state = haveMarker;
                    return empty;
                }
                return std::string(1, c);
            case haveMarker: (3)
                count = decodeCount(c);
                if (count == std::string::npos)
                    return recover(c);
                state = haveCount;
                return empty;
            case haveCount: (4)
                state = passthrough;
                return std::string(count, c);
        }
    }

private:
    bool ismarker(char c) { return c == marker; }

    size_t decodeCount(char c) {
        auto position = countEncoding.find(c);
        return position != std::string::npos
            ? position + 1
            : std::string::npos;
    }

    std::string recover(char c) {
        if (ismarker(c)) // repeated marker
          return recoveryString;

        // bad count
        state = passthrough;
        return recoveryString + c;
    }

    std::ostream& out_;
    ExpandState state = passthrough;
    size_t count;
    char lastChar;

    std::string const countEncoding =
            "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; (5)
    char const marker = '~';
    std::string const recoveryString = std::string(1, marker);
    std::string const empty;
};

namespace stiX {
    void expand(std::istream& in, std::ostream& out) {
        filter(in, out, expander(out));
    }
}
  1. Stateful filters - stateful a lot of things - are often implicit state machines. There are a number of specific actions or operations, with clear triggers and transitions between them. For expand I’ve chosen to make that obvious by explicitly naming the states. Even though I only have three states, identifying and naming them helped me with the implementation and will, I hope, help the readability of the code. Remember, the operator()(char c) is called once for each character in the input, and that ~expander() will be called once the input is exhausted.

  2. The first state is passthrough. Unless we’re encountering the start of an encoded run, spit out what we just received.

  3. Having read the ~ sentinel, the next character encodes the character count. Once the count is secure, transition to haveCount.

  4. Saw the marker, got the count - all that needs to happen now is to output the appropriate number of characters and transition back to passthrough.

  5. As I was annotating this code, I realised these lines are duplicated from compress. They should really have gone into a shared library.

In terms of lines of code, my implementation is rather longer than Kernighan and Plauger’s Pascal version. Our respective versions have the same three states, but they’re not as obvious in the Pascal. While working on this code, I thought I had found two error states that K&P had missed. I was wrong, they had them both covered, but it took quite a close reading of the code for me to realise.

Consider the code above and below with the following inputs - the first two well-formed, the second two not so much.

  • ABCDE

  • ~EA

  • ~~EA

  • ~E

expand.pas
procedure expand;
const
    WARNING = TILDE;  { ~ }
var
    c : character;
    n : integer;
begin
    while (getc(c) <> ENDFILE) do
        if (c <> WARNING) then
            putc(c)
        else if (isupper(getc(c))) then begin
            n := c - ord('A') + 1;
            if (getc(c) <> ENDFILE) then
                for n := n downto 1 do
                    putc(c)
            else begin
                putc(WARNING);
                putc(n - 1 + ord('A'))
            end
        end
        else begin
            putc(WARNING);
            if (c <> ENDFILE) then
                putc(c)
        end
end;

Source code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. expand is program 4 of Chapter 2.

Out and about

If you’re enjoying my little odyssey through code past, then you might well be interested in a talk Robin Williams has coming up ACCU Oxford in a couple of weeks.

In their classic 1984 book, Kernighan & Pike describe the development of 'hoc'. Over the course of a chapter, hoc grows from a four-function calculator into a fairly complete scripting language, with variables, control logic and procedures. This pattern can be applied to use cases ranging from flexible control file input to compiler front-ends.

In this talk, we will revisit the development of hoc from the perspective of 2019. While Kernighan & Pike’s emphasis on the value of incremental development remains valid, other aspects of the presentation have not aged as well. Their book pre-dates Test Driven Development, with large increments by modern standards (including a ground-up re-write) and free use of global variables. By now, the STL provides the required data structures and safe memory management.

We’ll start by applying TDD to the initial development of the four-function calculator, and review the code which results. I’ll then describe one onward route, circumventing the K&P’s rewrite and applying Parsing Expression Grammars to allow for fully dynamic flexibility in the language represented.

It sounds so much up my street I’m going to do everything I can to be there. You can sign up here and maybe I’ll see you there.

Endnotes

This whole endeavour relies Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They’re both still in print, but new copies are, frankly, just ridiculously expensive. Happily, here are plenty of second-hand copies floating round, or you can borrow them from the The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license. CLion uses CMake to build projects. My previous flirtations with CMake, admittedly many years ago, weren’t a huge success. Not so this time - it’s easy to use and works a treat.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this is my first time really using it. I like it and will use it again.


Tagged code, and software-tools-in-c++


Jez Higgins

Freelance software generalist
software created
extended or repaired

Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About