Jez Higgins

Freelance software generalist
software created
extended or repaired


Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About

Sunday 03 November 2019 The Forest Road Read, No 2.35 : STinC++ - compress

PROGRAM

compress compress input by encoding repeated characters

USAGE

compress

FUNCTION

compress copies its input to its output, replacing strings of four or more identical characters by a code sequence so that the output generally contains fewer characters that the input. A run of x’s is encoded as ~nx, where the count n is a character: 'A' calls for a repetition of one x, 'B' a repetition of two x’s, and so on. Runs longer than 26 are broken into several shorter ones. Runs of ~'s of any length are encoded.

EXAMPLE

compress
Item    Name         Value
Item~D Name~I Value
1       car          ~$7,000.00
1~G car~J ~A~$7,000.00
<ENDFILE>

BUGS

The implementation assumes 26 legal characters beginning with an A.

Implementation

compress.cpp
#include "compress.h"
#include "../../lib/filter.h"

class compressor {
public:
  compressor(std::ostream& out)
    : out_(out) {
  }
  ~compressor() {
    out_ << repeated(repeat);
  }

  std::string operator()(char c) {
    if (lastChar == c) { (1)
      ++repeat;
      return empty;
    }

    std::string output = repeated(repeat); (2)
    repeat = 0;

    lastChar = c;

    return output;
  }

private:
  std::string repeated(size_t count) {
    if (!lastChar)
      return empty;

    if (count > maxCount) { (3)
      return repeated(maxCount)
          + repeated(count - (maxCount+1));
    }
    if ((count < 3) && (lastChar != marker)) (4)
      return std::string(count+1, lastChar);

    std::string r; (5)
    r += marker;
    r += countEncoding[count];
    r += lastChar;
    return r;
  }

  std::string const countEncoding =
            "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
  size_t const maxCount = countEncoding.size() - 1;
  std::string const empty;
  char const marker = '~';

  std::ostream& out_;
  char lastChar = 0;
  size_t repeat = 0;
};

namespace stiX {
  void compress(std::istream &in, std::ostream &out) {
    filter(in, out, compressor(out));
  }
}

The general shape of this code is probably pretty familiar now. It reads the input a character at a time, and spits outs a string in response - the string might be empty, a single character, or several characters.

Given the description in the little man page above, I hope the implementation is straightforward.

  1. If the newly arrived character is the same as the previous character, well, that’s a repetition, so keep a note.

  2. If the new character is different to the previous, then I need to output what ever the previous character was. The repeat member function handles that.

  3. If the number of repetitions is more than the encoding scheme accommodates, encode the longest run possible then recurse to deal with the remaining repetitions.

  4. If the encoded output would be longer than the original - ie there were only one, two, or three repetitions - simply output the original characters. For any kind of normal text, this is the most common case.

  5. After all that, when we reach here there must be encoding a repetition of at least length four.

Kernighan and Plauger’s compress isn’t, by their own lengthy admission, particularly effective, and perhaps the window in which it was ever of use has passed. It is a nice little introduction to run-length encoding schemes though. One feature streaming compression schemes like RLE is that you can’t start to generate output until you’ve consumed at least some of the input. In this case, compress needs to see at least two input characters before any output can be generated. If fed a stream of, say, X after X after X, it will generate no output at all until a different character arrives. There is, of necessity, a certain amount of delay, a degree of implicit buffering. When the end of the input is reached, therefore, there may be a little bit of remaining output that has to be flushed. This is why, for the first time, my function object has a constructor and destructor. It’s great when that happens. Makes me feel like I do have some idea of what I’m doing.

The line

    filter(in, out, compressor(out));

creates a temporary compressor object. It’s constructed immediately before the call to the filter function, and destroyed immediately after the call to filter completes. (In reality, it’s slightly more complex than that, but that’s the essence of it.) The destructor is, therefore, the place to do any clean up needed at the end of the processing, in this case a final call to repeated to flush the last bit of output.

Deterministic destruction is much overlooked feature of C++, at least by those less familiar with the language, and this is an example of quite how powerful they are. In this example, the compressor constructor and destructor provide customisation points either side of the call to filter. I’ve been able to use my existing library code, and my existing approach, but extend it to include a new piece of functionality. Without that ability, I’d have had to discard my existing approach and find something new. When I started working on https://www.jezuk.co.uk/blog/2019/09/sticpp-detab.html`detab` back in August I wasn’t looking ahead to what I might need in the future, so this isn’t the result of some long term plan. It was just there when I needed it. My tests were telling me the last bit of output was missing, and the destructor was just the thing to solve the problem.

Source code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. compress is program 3 of Chapter 2.

Library Reference

In a similiar vein

In a similar spirit to this undertaking, here’s a rapid-fire screencast reimplementing the BSD’s cat command in 1994-style C.


Tagged code, and software-tools-in-c++

Friday 25 October 2019 The Forest Road Reader, No 2.34 : Well, I guess you had to be there

A programmer and a systems analyst are going hunting in the woods together. After several hours of rugged hiking, they arrive at the cabin where they’re staying.

The systems analyst say to the programmer "You stay here while I go and find a bear", and then heads off into the trees.

The programmer sits down on a tree stump to wait. Some time later the programmer hears a commotion. Looking up he sees a bear approaching at high speed, chased by the systems analyst.

"Open the door!" yells the systems analyst, "Open the door!".

The programmer opens the cabin door, the systems analyst chases the bear inside, then slams the door shut.

"Right," says the systems analyst, "you kill it and skin it, and I’ll go and find another bear."

I read this joke in "Jack Schofield’s Computer Joke Book" column in, I think, Personal Computer World in 1984-85 while I was still at school. At the time it made absolutely no sense to me at all, until I actually met a real-life systems analyst at a open evening at the local college.

During our conversation he made programming sound like the most tedious and boring job in the world. By his telling, systems analysts did all the thinking and problem solving, then the programmers typed out the answer.

To kids like us, who sat up all night programming out Spectrums (or BBC Micros if you were a posh kid), that made no sense. That bloke, who ever he was, is partially responsible for me doing electronics engineering at college and university instead.

Still, worked out ok in the end.

Thursday 24 October 2019 The Forest Road Reader, No 2.33 : STinC++ - overstrike

Printing looms large in Brian Kernighan’s professional life and so it’s no surprise our next software tool, overstrike is, like entab, a print preprocessor.

You can overstrike characters on a typewriter be backspacing over what is already typed. This is how you underline words, for one thing; …​ If you send your output to a line printer, however, the result may be a hash, because a typical printer doesn’t know what to do with backspace characters.

Many printers do, however, provide for overstriking entire lines. [The convention] is to provide an extra carriage control character at the beginning of each line: a blank means "space before printing," and a plus sign means "do not space before printing," i.e. overstrike what has gone before.

The overstrike filter looks for backspaces in the incoming text and generates a series of lines with carriage control to reproduce the effect of the backspaces. Given, say,

Let's underline "Hello World!^H^H^H^H^H^H^H^H^H^H^H^H____________"

the filter will produce

 Let's underline "Hello World!
+                 ____________"

As Kernighan and Plauger note, this is not the only way to do it, but it is one of the least complicated. They continue

It is often better to get on with something that does most of the job well enough, then improve and add things as they prove to be worthwhile.

Did someone just say MVP?

A line printer prints a whole line of output at time, hence why they’re not much interested in backspace characters and why overstrike is looking to generate a multiple lines instead. Line printers were high-end bits of kit and pretty quick - some could bash out over 1000 lines a minute which makes that little inkjet in the corner look positively pedestrian. If you’re of a similar vintage to me, the existence of line printers is why your DOS printer port was called LPT1.

For younger readers, ^H is how a backspace character would sometimes display on a terminal that didn’t know any better. ^ means the Ctrl key while H is, well, H and hitting Ctrl and H generated ASCII code 08 aka Backspace. ASCII was the cool kids' character encoding scheme of choice back in 1981. If you use some variety of Unix derived operating system, there’s a high likelihood Ctrl+H will work as Backspace in your terminal window. Go on, try it.

The implementation

Because a single character on the input might produce no output, one character, or many characters of output, for the overstrike implementation I used the same skeleton as detab and entab - using std::transform to read char from the input stream and write std::stream the output.

overstrike.cpp
class overstriker {
public:
    std::string operator()(char c) {
        if (stiX::isbackspace(c)) { (1)
                ++backspaced_;
                return empty;
        }

        std::string output;

        if (backspaced_) {
            position_ = (backspaced_ < position_) ? position_ - backspaced_ : 0; (2)
            output += noskip;  (3)
            output += std::string(position_, ' '); (4)
        } else if (position_ == 0) (5)
            output += skip;

        output += c; (6)

        if (stiX::isnewline(c)) (7)
            position_ = 0;
        else
            ++position_;
        backspaced_ = 0;

        return output; (8)
    }
private:
    std::string const empty;
    std::string const skip = " ";
    std::string const noskip = "\n+";

    size_t position_;
    size_t backspaced_;
};

namespace stiX {
    void overstrike(std::istream &in, std::ostream &out) {
        filter(in, out, overstriker());
    }
}
  1. Is that a backspace? If so, make a note, chuck it away, and we’re done for now.

  2. Wind back the position_ counter for as many backspaces as there were. For well-formed input, the conditional part of this expression guarding against excessive backspacing shouldn’t be necessary but Kernighan and Plauger quietly implement Postel’s Law (as it wasn’t yet known) throughout the book.

  3. We’ve been backspacing, so prefix the line with the + control character.

  4. Pad out the line with appropriate number of spaces. Obviously this only works if the printer is monospaced, but that was absolutely the case when Software Tools In Pascal was written.

  5. If we haven’t been backspacing, are we perhaps at the start of normal row?

  6. Having setup the output with whatever prefix (possibly none) we need, actually output the character we started with.

  7. This is familiar from detab and entab. If we’re at the end of the current line, then reset the counter.

  8. Boom! We’re done. Phew.

This was quite a fun little program to do, although sometimes tricky to visualise the output - especially in the presence of more than one bout of backspacing. It TDDed out very nicely though. I’m aware that I’m in a bit of filter groove at the moment, and am well prepared to think about the kind of little state machines that are implicit in this sort of processing. I do think, though, that what I’ve come up with is a tidy little solution. If you think otherwise (or even if you don’t), I’d be delighted to hear from you. Feel free to email me or get in touch on Twitter.

Source code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. overstrike is program 2 of Chapter 2.

Wait, I arrived late! What’s going on?

Software Tools In Pascal by Brian Kernighan and PJ Plauger is a book that I love. I’m working through the book, reimplementing their tools in C++.


Tagged code, and software-tools-in-c++

Thursday 10 October 2019 The Forest Road Reader, No 2.32: Typesetting With Brian W Kernighan

Brian Kernighan has been typesetting his own books for well over forty years. Herein an almost complete history.

This book was set in Times Roman and News Gothic Condensed by the authors, using a Graphic Systems phototypesetter driven by a PDP 11/45 running under the UNIX operating system.

— Kernighan and Plauger
The Elements of Programming Style, 1974

This book was set in Times Roman and Helvetica Regular by the authors, using a Graphic Systems phototypesetter driven by a PDP-11/45 running under the UNIX operating system.

— Kernighan and Plauger
Software Tools, 1976

This book was set in Times Roman and Courier 12 by the authors, using a Graphic Systems phototypesetter driven by a PDP-11/70 running under the UNIX operating system.

— Kernighan and Plauger
The Elements of Programming Style 2nd Ed., 1978

This book was set in Times Roman and Courier by the authors, using a Mergenthaler Linotron 202 phototypesetter driven by a PDP-11/70 running under the UNIX operating system.

— Kernighan and Plauger
Software Tools in Pascal, 1981

This book was typeset in Times Roman and Courier by the authors, using a Mergenthaler Linotron 202 phototypesetter driven by a VAX-11/750 running the 8th Edition of the UNIX operating system.

— Kernighan and Pike
The Unix Programming Environment, 1984

This book was typeset in Times Roman and Courier by the authors, using an Autologic APS-5 phototypesetter and a DEC VAX 8550 running the 9th Edition of the UNIX operating system.

— Aho, Kernighan, Weinberger
The AWK Programming Language, 1988

The book was typeset (pic|tbl|eqn|troff -ms) in Times Roman and Courier by the authors, using an Autologic APS-5 phototypesetter and a DEC VAX 8550 running the 9th Edition of the UNIX operating system.

— Kernighan and Richie
The C Programming Language 2nd Ed., 1988

This book was typeset (grap|pic|tbl|eqn|troff -mpm) in Times New Roman and Lucida Sans Typewriter by the authors.

— Kernighan and Pike
The Practice of Programming, 1999

Typeset by the authors in Minion Pro, Lato, and Consolas, using Go, groff, ghostscript, and a host of other open-source Unix tools. Figures were created in Google Drawings.

— Donovan and Kernighan
The Go Programming Language, 2015

Camera-ready copy for this book was produced by the author in Times Roman and Helvetica, using groff, ghostscript, and other open source Unix tools.

— Kernighan
Millions, Billions, Zillions: Defending Yourself in a World of Too Many Numbers, 2018

Perplexing Footnote

Ossanna and Kernighan’s Troff User’s Manual does not say how it was typeset.


Thanks to Ben Deane for details of The Elements of Programming Style, and to Chris Oldwood for details of The Elements of Programming Style 2nd Edition.

Monday 07 October 2019 The Forest Road Reader, No 2.31 : STinC++ - entab

Software Tools In Pascal’s first chapter races you through five programs, each a little bit more complex than the last, giving you a nice little handful of tools and a tour of the Pascal language. With that extended preamble out of the way, chapter two focuses on filters, programs which read from standard input and write to standard output, making some useful change to the stream of data on the way through. Filters are unremarkable today, so widely has the Unix philosophy become the default model of computing. At the time Kernighan and Plauger were writing it did warrant a brief explanation and this remarkably understated bibliographic note.

The idea of pipelines and filters - a uniform mechanism for connecting the input of one program to the output of another - was conceived by MD McIlroy …​ Pipelines have an especially clean implementation in the UNIX operating system developer at Bell Labs by KL Thompson and DM Richie. Here the processes in the pipeline run concurrently, with sufficient interlocks to ensure that the receiving process never reads information that isn’t there, and that the sending process never creates too much information before the receiver has read some. The vertical bar is the UNIX syntax for a pipeline and is due to Thompson.

When I first read this, I didn’t really register the enormity of what was being described in this little footnote. Now, many years later, I realise that sentence about concurrent processes must have been pretty jolting for people raised on batch processing and job control languages.

Anyway, to the program. Having replaced tabs with spaces, our authors to take no side in that eternal debate and present entab, a program to replace tabs with spaces. They give some reason for wanting to do this, again relating to formatting for a particular type of printer that seems incredibly arcane in these days of wireless printer/scanner combos (like the miraculous the Canon MG5750 which required absolutely no configuration at all to work with all the various computers scattered round the JezUK High Performance Computing Lab).

Kernighan and Plauger foreshadow the further complexity of the upcoming tasks while providing specific advice for implementing entab

The trick to getting most filters right is to find an orderly way of recognizing the components of the input stream, so that the order can be reflected in the flow of control of the program rather than in a collection of switches and flags …​ If we think of the input to entab as a repetition of the pattern: zero or more blanks, followed by a non-blank character, then this determines the control structure of the program.

I described the operation of detab in terms of a sequences, and Kerninghan and Plauger describe entab in a similar way here. As with detab, I’m using std::transform at the core of my program, plugging in a function object. Writing entab was actually a little more subtle than I’d expected, although with the benefit of hindsight if I’d put together a better set of initial test cases I’d have found it easier.

entab.cpp
class entabber {
public:
    std::string operator()(char c) {
        if (stiX::isspace(c)) { (2)
            ++spaces_;

            if (stiX::is_tab_stop(spaces_)) {
                position_ = spaces_;
                return tab; (3)
            }

            return empty; (4)
        }

        std::string output = (spaces_ != position_)
                ? std::string(spaces_ - position_, ' ')
                : empty; (5)
        output += c;

        if (stiX::isnewline(c)) (6)
            position_ = 0;
        else
            ++position_;
        spaces_ = position_; (7)

        return output;
    }
private:
    size_t position_;
    size_t spaces_;
    std::string const tab = "\t";
    std::string const empty;
};

namespace stiX {
    void entab(std::istream &in, std::ostream &out) {
        filter(in, out, entabber());  (1)
    }
}
  1. filter is a new utility function that wraps up the whole std::tranform/std::istreambuf_iterator<char>(in)/std::ostream_iterator<std::string>(out) mouthful into something a little easier on the eye, and gives it a nice descriptive name. As with detab, we need to track some state as we go, so I need function object. entabber has two member variables - position_ which tracks where we are in the current line, and spaces_ which runs ahead of position_ as new spaces arrive. Initially, both these variables are 0.

  2. So here’s our function object, considering each character in the sequence in turn. First question - is it a space? If it is, bump spaces_ forward one.

  3. Have we now accumulated enough spaces to arrive at a tab stop? If so, advance position_ and return a tab.

  4. We didn’t hit a tab stop, so we’re somewhere in the middle. Return nothing.

  5. We’ll reach here only when we’ve got something other than a space. If spaces_ has advanced ahead of position_ we’ve got some spaces buffered up we need to print first, so pop those into our output, before adding the character itself.

  6. At the end of the line? Reset our position_ to the start of the row.

  7. Whatever we’ve done up to now, we’ve got nothing buffered up so reset spaces_ to the current position.

I chose to advance spaces_ ahead of the current position, rather then simply keeping count of the buffered spaces. It makes little difference either way - at some point in the code there will be an ugly point where you have to do a bit of arithmetic involving position_ and spaces_.

Those with long memories will notice I’ve declared entabber as a class, when previously for detabber I used a struct. This little video from Jonathan Boccara was instrumental in changing my mind.

Source code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. entab is program 1 of Chapter 2.

Even older than me?

If you are even longer in the tooth than me and were working on the kind of chunky machines that loom silently in the background of Software Tools, I’d love to hear about that. You can email me or get in touch on Twitter.

Endnotes

This whole endeavour relies Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They’re both still in print, but new copies are, frankly, just ridiculously expensive. Happily, here are plenty of second-hand copies floating round, or you can borrow them from the The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license. CLion uses CMake to build projects. My previous flirtations with CMake, admittedly many years ago, weren’t a huge success. Not so this time - it’s easy to use and works a treat.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this is my first time really using it. I like it and will use it again.


Tagged code, and software-tools-in-c++
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software generalist
software created
extended or repaired

Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About