Jez Higgins

Freelance software generalist
software created
extended or repaired


Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About

Thursday 28 November 2019 The Forest Road Reader, No 2.40 : STinC++ - Chapter Two Wrap Up

STinC++ rewrites the programs in Software Tools in Pascal using C++

  • entab - replacing spaces with tabs

  • overstrike - crazy old school print preprocessing

  • compress - simple run length encoding

  • expand - and simple run length decoding

  • echo - command line argument handling

  • translit - a positive hydra of character substitution and deletion

Each program in chapter one was progressively a little more complex than the one before. This is not the case in Chapter Two, where the various programs form more of a loose map of the territory. Here we have a program that replaces many characters with a single character, while over there we have a program that might produce many lines of output for a single line of input, and so on.

This chapter is, without directly saying so, pretty explicitly about the Unix philosophy. While Unix systems (or at least Unix-ish systems) are basically the substrate of most modern computing (yes, even if you’re a Windows, Android, or iOS user) in 1981, when Software Tools in Pascal was published, there were “800 systems [already] in use outside [Bell Labs]”. Kernighan and Pike have to take time to describe what’s now commonplace - combining small programs with pipelines, text as common currency, indeed the whole idea of software tools and of the programmer as tool user.

When I first read Software Tools in Pascal, I’d just started my first substantial job on a Unix system (in fact four Unix systems, all slightly different) and I was only passingly familiar with a handleful of command line tools. The idea that these small programs not only did simple things, but could be combined like some kind of giant tokusatsu robot to do very powerful things was really quite startling. It certainly changed, and continues to influence, how I think about programming and I have no doubt I’m a better programmer for that. I’m not sure I’ll ever touch the heights achieved by my friend Thomas in his masterful He Sells Shell Scripts To Intersect Sets, in which he combines standard Unix tools to do full on computer science, but I’m nearer than I otherwise would be.

Next up, files!

Endnotes

This whole endeavour relies Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They’re both still in print but are ruinously expense. Happily, there are plenty of second-hand editions floating round.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license. CLion uses CMake to build projects. My previous flirtations with CMake, admittedly many years ago, weren’t a huge success. Not so this time - it’s easy to use and works a treat.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this is my first time really using it. I like it and will use it again.


Tagged code, and software-tools-in-c++

Tuesday 26 November 2019 The Forest Road Reader, No 2.39 : STinC++ - translit

STinC++ rewrites the programs in Software Tools in Pascal using C++

After the relative calm of echo, it all kicks off (or, if you prefer, all comes together) in translit, the finale of Chapter Two. As its name suggests translit transliterates certain characters on their way through. “We would like”, say Kernighan and Plauger, “to have a program translit so that we can write”

  translit x y

“and have all occurrences of x in the standard input be replaced by y on the standard output.”

Ah, so this is why we took that detour into command line arguments. Given what we’ve been through to get here, though, this task is pretty trivial. But hold on, they’re just getting started.

“Multiple translations are also handy:”

  translit xyx XYZ

“would change all lower case x, y and z to the corresponding upper case letter.”

Ok, that’s slightly more involved but not particularly tricky.

“It would be nice”, they continue breezily, “to have a shorthand for alphabets, so”

  translit a-z A-Z

“would translate all lower case letters to upper case.”

All right, this still isn’t too difficult. We can deal with the arguments easily enough, and then the transliteration is routine

    std::string replace ...
    std::string with ...

    std::transform(
        in,
        out,
        [&replace, &with](char c) {
            auto replacement = replace.find(c);
            return replacement != std::string::npos
                ? with[replacement]
                : c;
        }
    );

We don’t even need to write a function - we do all it all in line with a simple lambda. This seems a touch anticlimactic. But Kernighan and Plauger haven’t finished. Not by a long shot. “There are times when we would like to translate a whole class of characters into just one character, and even to squash runs of that translation into just one instance …​ We can specify this squashing operation by giving a second argument that is shorter than the first …​ we could also take [the case when the second argument is not given] as a request to delete all occurrences of characters in the first string …​” Easy there fellows “…​ One final capability is worth adding: sometimes we would like to be able to translate and compress all but a set of characters. For instance”

  translit ^a-z -

“would replace strings of non-letters with a dash. And”

  translit ^a-z

“would delete all but lower case letters.”

Wait, wait, wait.

When I write my C++ versions of the programs in Software Tools, I don’t translate from the Pascal. Of course, I do read the code along with the commentary, and then write a series of unit tests to capture the behaviour. Leaving aside argument expansion, for translit I finished up with five distinct sets of tests -

  • replacement - where the two inputs, the from-set and the to-set, are the same size

  • replacement and squashing - where the from-set is longer that the to-set

  • deletion - where everything in the from-set is removed

  • negated squashing - where the everything not in the from-set is matched to the last item in the to-set

  • negated deletion - everything not in the from-set is removed

Obviously the two squashing cases and the two deletions cases are pretty similar, but that still leaves, as far as I was concerned, with three distinct operations. Kernighan and Plauger implement them in one loop

extract from translit.pas
    lastto := length(toset)
    squash := (length(fromset) > lastto) or (allbut);
    repeat
        i := xindex(fromset, getc(c), allbut, lastto);
        if (squash) and (i>=lastto) and (lastto>0) then begin
            putc(toset[lastto]);
            repeat
                i := xindex(fromset, getc(c), allbut, lastto);
            until (i < lastto)
        end;
        if (c <> ENDFILE) then begin
            if (i > 0) and (lastto > 0) then  { translate }
                putc(toset[i])
            else if (i = 0) then { copy }
                putc(c)
            { else delete }
        end
    until (c = ENDFILE)

As you read this section of the book, you see the program grow a little bit at a time from an initial half a dozen lines to the code above, along with all the preamble to handle the command line arguments, the inner works of xindex and all the rest of it. This piece of code isn’t difficult at all, because you know it inside out. Coming to it cold, it’s pretty daunting. There are two different flags, nested loops, an if-ladder nested within another if, and assignment through out parameters. It’s compact and dense and not entirely straightforward.

Partly because I was starting from a different place, partly because I wanted to keep the std::transform model I’d adopted for the preceding filters, and partly because I wanted to try and hoist as many of the conditions up and away from the main body of the code, I finished up with rather different

extract from translit.cpp
    void stiX::translit(
        const std::string& replace,
        const std::string& with,
        stiX::Selection selectionMode,
        std::istream& in,
        std::ostream& out
    ) {
        auto matcher = (selectionMode == stiX::Selection::Normal)
            ? in_range
            : out_of_range;

        if (with.empty())
            removeCharacters(replace, matcher, in, out);
        else if (replace.size() > with.size())
            transliterateWithSquash(replace, with, matcher, in, out);
        else
            transliterate(replace, with, in, out);
    }

The argument expansion populates the replace and with strings. It’s also set selectionMode based on the presence of the leading ^ character. Not shown here are the implementations of the in_range and out_of_range functions, which determine whether a particular character is within or not within a given string. The variable matcher then is a function pointer, and I implement the normal or negated search by pointing it to the appropriate function.

The if-ladder then determines which mode translit has been invoked in - deletion, squashing, or simple replacement. In the Pascal implementation, this if-ladder is at the heart of the read loop, traversed for every character. I’ve hoisted it right up and out - we decide here, then never have to worry about it again.

The three modes are then implemented in three separate functions. This turned out pretty well. Two of the cases - replacing and removing - can be implemented using functions provided by the standard library. Replacement uses std::transform, as shown above, while removing uses the catchily named std::remove_copy_if.

extract from translit.cpp
    void removeCharacters(
        const std::string& remove,
        Matcher matcher,
        std::istream& in, std::ostream& out
    ) {
        std::remove_copy_if(
            std::istreambuf_iterator(in),
            std::istreambuf_iterator<char>(),
            std::ostreambuf_iterator(out),
            [&remove, &matcher](char c) {
                return matcher(remove.find(c));
            }
        );
    }

std::remove_copy_if copies the elements in the range, except those elements for which predicate returns true. In this case, the predicate is a little lambda wrapped around the matcher function pointer which we set up earlier.

That leaves squashing. The implementation is similar to that of entab. When entab encounters a space in the input, it switches into entabbing mode until it encounters a non-space or hits a tab stop. When squashing when we encounter one of our from range, we flip into squash mode until we see a character that isn’t in the range. There’s a bit of house keeping to do then, but it’s straightforward enough.

extract from translit.cpp
    std::string operator()(char c) {
        auto replacement = replace_.find(c);

        if (!matcher(replacement)) {
            squashing_ = false;
            return std::string(1, c);
        } // nothing to do

        if (
            (replacement >= boundary_)
            && squashing_
        ) {
            return std::string();
        } // already squashing - do nothing

        squashing_ = (replacement >= boundary_);
        return std::string(
            1,
            squashing_ ? squash_ : with_[replacement]
        );
    } // operator

I Don’t Know About You, But I’m Exhausted

translit was, by a distance, the most difficult program so far. Even with a reference program to read, I found it hard work. For a while, I doubted my dogmatic approach to filters, to sticking to a core built around std::transform, but I think in the end it paid off.

Kernighan and Plauger acknowledge that translit is based on “a program written originally by M.D. McIlroy” (that program is tr and M.D. McIlroy is better known as Doug), and there’s an argument to made that translit suffers a little from feature envy, that it pulls in too much, and that its various modes of operation are not quite as overlapping as they’re presented. On the other hand, translit does bring together the various different types of filters examined in the chapter and uses them in a single program.

By splitting out those different modes into distinct functions, I feel the code is clearer and simpler, and the cognitive load on the reader is reduced. The simple cases are simple, and the complicated case is made simpler because it’s not competing with the other two. And should Kernighan and Plauger produce a third revision of Software Tools adding another mode of operation to translit (after all, tr has four!), I’ll be in a good place to safely implement it.

Source Code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. translit is program 6 of Chapter 2.


Tagged code, and software-tools-in-c++

Thursday 21 November 2019 The Forest Road Reader, No 2.38 : Winter/Spring Tour 2020

Somewhat to my surprise, because after last year’s epic stint on the road I was going to take it easy, the JezUK Winter/Spring Tour is back in motion.

  • Wednesday November 27th: The Very Slow Time Machine
    Worcester Source meeting at The Paul Pry Pub
    Reprising of an old favourite one last time.

  • February 26th - 28th: A Mouthful Of C++
    nor(DEV):con in Norwich
    A short, probably quite intense, C++ taster workshop for people who already program in another language. This is something of a new thing, and I’ve been fretting about it since I agreed to it back in September and won’t stop until it’s done and delivered.

  • Tuesday March 17th: A Software Tool in Many Languages
    BrumJS Meetup at BlackCat Technology Solutions’s office in Colmore Square
    Riffing off Software Tools in C++, I’m going to take one of Kernighan and Plauger’s tools and reimplement it in four different languages - C++, JavaScript, probably some kind of Lisp, and another one. At least, that’s my current plan.

If you run some kind of programmers' get-together and any of the above is of interest to you, do get in touch and let’s see if we can work something out.


Tagged software-tools-in-c++, on-tour, nordev, nordevcon, and brum.js

Tuesday 19 November 2019 The Forest Road Reader, No 2.37 : STinC++ - echo

Chapter 2 of Software Tools in Pascal is entitled Filters, and thus far it’s been absolutely focused on processing streams of text. We’ve substituted tabs for spaces, messed around with printer pre-processing, compressed and expanded text files. Where on earth could we be going next? I’m going to bet you wouldn’t expect something that’s very decidedly not a filter in any shape or form.

PROGRAM

echo echo arguments to output

USAGE

echo [ argument …​ ]

FUNCTION

echo copies its command line arguments to its output as a line of text with one space between each argument. If there are no arguments, no output is produced,

EXAMPLE

echo hello world!
hello world!

Weird huh?

In many ways, this is a little chill-out program - a light palate cleanser ahead of the big meaty one coming up next. It also allows Kernighan and Plauger to go off for a page on Pascal’s, or at least standard Pascal’s, weaknesses as a proper programming language. They don’t put it in quite those terms, but they bemoan its lack of provision for command line argument handling and, more seriously, the weaknesses of its type system, particularly its lack of dedicated string type. They work around the string type problem by using a fixed length array with an end-of-string sentinel, and “pay the price of wasted storage”. They are rather more coy on how they solved the command line argument problem, other than to say they’ve defined a primitive

function getarg(n : integer; var charstr : string; maxsize : integer) : boolean

which can grab the ith command line argument. echo then is a few lines to demonstrate getarg and their string type.

This is one of the few places where Software Tools in Pascal differs in any significant way from Software Tools. As far as I’m aware neither FORTRAN nor PL/1 had command line argument handling or a string type either, but the corresponding place in chapter 2 of Software Tools instead presents a trivial text encryption program. That program also uses a getarg primitive of exactly the same sort, but it goes entirely unremarked.

C++ provides ready access to command line arguments and (setting aside the char* legacy) offers a serviceable string type, and so echo offers little challenge.

echo.cpp
    void echo(int argc, char const* argv[], std::ostream &out) {
        auto arguments = make_arguments(argc, argv); (1)
        join(
            arguments,
            std::ostream_iterator<std::string>(out)
        ); (2)
    }
  1. While char const* argv[] has a certain retro charm, I like my data to be in a nice container that knows things like its own size. make_arguments simply shuffles the contents of argv into a std::vector<std::string>. C++ can infer the type of the variable arguments for me, hence I can declare it as auto. At compile time, the compiler will cross out auto and write in the std::vector<std::string> for me.

  2. And now we can pour those arguments, interspersed with spaces, down the out stream. Job done. Sorry, pardon? join you say?

Post-it note reading Do The Simplest Thing That Could Possible Work, Refactor Merciliessly, Remember You Aren’t Going To Need It, Once And Only Once

In spite of having “Do The Simplest Thing That Could Possibly Work” taped to the side of my monitor (and I really do try to hold true to it), I chose to set that aside and indulge in a little standard library style nonsense.

join
    template<typename InputIt, typename OutputIt, typename Separator = std::string>
    OutputIt join(InputIt first, InputIt last, OutputIt dest, Separator sep = " ") {
        if (first != last)
            *dest++ = *first++;
        while (first != last) {
            *dest++ = sep;
            *dest++ = *first++;
        }
        return dest;
    }

You’re probably familiar with the general shape of join - handle the initial value in the sequence, then loop through rest, prefixing them with the separator. The declaration and return type follow the general form of standard library. It is worth just taking a moment to consider just how much is going in

            *dest++ = *first++;

I’m not sufficiently down in the details to be sure if there are four or five (or some other number) of separate operations occurring here. It’s something along the lines of

  • dereferencing the input iterator first, *first, yields a value.

  • first++ advances the input iterator to the next value in the sequence

  • assigning to *dest writes a value into the output iterator. In the case of echo that’s to standard output, but it could equally be another container, a file, or almost anything else.

  • dest++ advances the write position.

After many years, many years of waiting, the standard library is finally catching up with the fact that most of the time the sequence we’re operating over is already neatly wrapped up in a container. Having to spell out mycontainer.begin() and mycontainer.end() is a bit redundant and it would be nice if we could just lob the whole shebang over in one go. In preparation of C++ 20 then, I put together a range version of join. (I’m being overly flippant. This is the absolute least of what’s coming in ranges, which looks to be an important and exciting addition to the library.)

join on a range
    template<typename InputRange, typename OutputIt, typename Separator = std::string>
    OutputIt join(InputRange&& range, OutputIt out, Separator sep = " ") {
      return join(std::begin(range), std::end(range), out, sep);
    }

This is, I believe, the more-or-less canonical form of a range adapter function. Jonathan Boccara discusses why passing the range with a forwarding reference, InputRange&&, is the preferred form in Algorithms on Ranges. It’s a subtle business, and I’m happy to pass the heavy thinking off to some one else.

Source Code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. echo is program 5 of Chapter 2.

An old-school shell hack on a line printer describes hooking up an old printer to a Linux box as a proper old timey TTY - an actual teletypewriter. The screen you’re reading this on? It’s a pseudo-TTY - it’s just pretending to be a printer while this chap’s got the real deal just like Grandad used to have 🙂.


Tagged code, and software-tools-in-c++

Tuesday 12 November 2019 The Forest Road Reader, No 2.36 : STinC++ - expand

Having presented a simple text compression utility, Kernighan and Plauger’s next program is, unsurprisingly, the companion program to reinflate the compressed text. Expansion is easier than compression - processing a sequence of characters with interleaved occasional encoded runs - and so the code should be straightforward. “Our first impulse is thus to write :”

while (get(c) <> ENDFILE)
    if (c = WARNING)
        n := count implied by next char
        c := getc(c)
        for i := 1 to n
            putc(c)
    else
        putc(c)

They counsel caution though.

For valid input this works fine, but what happens if an ENDFILE is encountered while reading into the count value? What happens if the count is out of range? The program is unprepared for this, and so produces some random number of repetitions of the next character. It could even make additional calls to getc and putc after ENDFILE is encountered.

They mention possible mitigations - adding a shim over getc or modifying getc itself to be safe in the face of ENDFILE, having the program stop immediately on hitting an error state - but discard them. Their preference is explicit error checking and reasonable recovery.

So our working version of expand checks for sensible input after every call on getc. It decompresses if possible, but if not, it prints the input as is.

This strikes me as an application of Postel’s Robustness Principle. The Robustness Principle, also called Postel’s Law, is given in RFC 760 as

In general, an implementation should be conservative in its sending behavior, and liberal in its receiving behavior.

RFC 760 describes TCP and so is world changing in all kinds of ways, but it’s arguable that if not for Jon Postel’s call to generosity in implementation then TCP would have foundered, certainly that take-up would have been significantly slower than it was. This isn’t to say we should apply the Robustness Principle at all times and in every case - it certainly isn’t a Law in that sense - but in the smaller context of a little text expansion utility, it’s certainly worth considering.

I’m not sure when the Robustness Principle was first named as such (possibly not until RFC 1122 in 1989), but as a point of pedantry Kernighan and Plauger certainly didn’t know that name when they wrote this chapter. RFC 760 was written in 1980, and I don’t believe it was projected back through time to 1976, when Software Tools was published. Still, good advice is good advice no matter when it was said or who said it.

Bearing that in mind, the tests I used to write this program included a number of malformed inputs.

expand.cpp
enum ExpandState {
    passthrough,
    haveMarker,
    haveCount
};

class expander {
public:
    expander(std::ostream& out)
        : out_(out) {
    }
    ~expander() {
        if (state == passthrough)
            return;

        out_ << recoveryString;
        if (state == haveCount)
            out_ << lastChar;
    }

    std::string operator()(char c) {
        lastChar = c;

        switch (state) { (1)
            case passthrough: (2)
                if (ismarker(c)) {
                    state = haveMarker;
                    return empty;
                }
                return std::string(1, c);
            case haveMarker: (3)
                count = decodeCount(c);
                if (count == std::string::npos)
                    return recover(c);
                state = haveCount;
                return empty;
            case haveCount: (4)
                state = passthrough;
                return std::string(count, c);
        }
    }

private:
    bool ismarker(char c) { return c == marker; }

    size_t decodeCount(char c) {
        auto position = countEncoding.find(c);
        return position != std::string::npos
            ? position + 1
            : std::string::npos;
    }

    std::string recover(char c) {
        if (ismarker(c)) // repeated marker
          return recoveryString;

        // bad count
        state = passthrough;
        return recoveryString + c;
    }

    std::ostream& out_;
    ExpandState state = passthrough;
    size_t count;
    char lastChar;

    std::string const countEncoding =
            "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; (5)
    char const marker = '~';
    std::string const recoveryString = std::string(1, marker);
    std::string const empty;
};

namespace stiX {
    void expand(std::istream& in, std::ostream& out) {
        filter(in, out, expander(out));
    }
}
  1. Stateful filters - stateful a lot of things - are often implicit state machines. There are a number of specific actions or operations, with clear triggers and transitions between them. For expand I’ve chosen to make that obvious by explicitly naming the states. Even though I only have three states, identifying and naming them helped me with the implementation and will, I hope, help the readability of the code. Remember, the operator()(char c) is called once for each character in the input, and that ~expander() will be called once the input is exhausted.

  2. The first state is passthrough. Unless we’re encountering the start of an encoded run, spit out what we just received.

  3. Having read the ~ sentinel, the next character encodes the character count. Once the count is secure, transition to haveCount.

  4. Saw the marker, got the count - all that needs to happen now is to output the appropriate number of characters and transition back to passthrough.

  5. As I was annotating this code, I realised these lines are duplicated from compress. They should really have gone into a shared library.

In terms of lines of code, my implementation is rather longer than Kernighan and Plauger’s Pascal version. Our respective versions have the same three states, but they’re not as obvious in the Pascal. While working on this code, I thought I had found two error states that K&P had missed. I was wrong, they had them both covered, but it took quite a close reading of the code for me to realise.

Consider the code above and below with the following inputs - the first two well-formed, the second two not so much.

  • ABCDE

  • ~EA

  • ~~EA

  • ~E

expand.pas
procedure expand;
const
    WARNING = TILDE;  { ~ }
var
    c : character;
    n : integer;
begin
    while (getc(c) <> ENDFILE) do
        if (c <> WARNING) then
            putc(c)
        else if (isupper(getc(c))) then begin
            n := c - ord('A') + 1;
            if (getc(c) <> ENDFILE) then
                for n := n downto 1 do
                    putc(c)
            else begin
                putc(WARNING);
                putc(n - 1 + ord('A'))
            end
        end
        else begin
            putc(WARNING);
            if (c <> ENDFILE) then
                putc(c)
        end
end;

Source code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. expand is program 4 of Chapter 2.

Out and about

If you’re enjoying my little odyssey through code past, then you might well be interested in a talk Robin Williams has coming up ACCU Oxford in a couple of weeks.

In their classic 1984 book, Kernighan & Pike describe the development of 'hoc'. Over the course of a chapter, hoc grows from a four-function calculator into a fairly complete scripting language, with variables, control logic and procedures. This pattern can be applied to use cases ranging from flexible control file input to compiler front-ends.

In this talk, we will revisit the development of hoc from the perspective of 2019. While Kernighan & Pike’s emphasis on the value of incremental development remains valid, other aspects of the presentation have not aged as well. Their book pre-dates Test Driven Development, with large increments by modern standards (including a ground-up re-write) and free use of global variables. By now, the STL provides the required data structures and safe memory management.

We’ll start by applying TDD to the initial development of the four-function calculator, and review the code which results. I’ll then describe one onward route, circumventing the K&P’s rewrite and applying Parsing Expression Grammars to allow for fully dynamic flexibility in the language represented.

It sounds so much up my street I’m going to do everything I can to be there. You can sign up here and maybe I’ll see you there.

Endnotes

This whole endeavour relies Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They’re both still in print, but new copies are, frankly, just ridiculously expensive. Happily, here are plenty of second-hand copies floating round, or you can borrow them from the The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license. CLion uses CMake to build projects. My previous flirtations with CMake, admittedly many years ago, weren’t a huge success. Not so this time - it’s easy to use and works a treat.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this is my first time really using it. I like it and will use it again.


Tagged code, and software-tools-in-c++
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software generalist
software created
extended or repaired

Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About