Jez Higgins

Freelance software generalist
software created
extended or repaired


Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About

Sunday 03 November 2019 The Forest Road Reader, No 2.35 : STinC++ - compress

PROGRAM

compress compress input by encoding repeated characters

USAGE

compress

FUNCTION

compress copies its input to its output, replacing strings of four or more identical characters by a code sequence so that the output generally contains fewer characters that the input. A run of x’s is encoded as ~nx, where the count n is a character: 'A' calls for a repetition of one x, 'B' a repetition of two x’s, and so on. Runs longer than 26 are broken into several shorter ones. Runs of ~'s of any length are encoded.

EXAMPLE

compress
Item    Name         Value
Item~D Name~I Value
1       car          ~$7,000.00
1~G car~J ~A~$7,000.00
<ENDFILE>

BUGS

The implementation assumes 26 legal characters beginning with an A.

Implementation

compress.cpp
#include "compress.h"
#include "../../lib/filter.h"

class compressor {
public:
  compressor(std::ostream& out)
    : out_(out) {
  }
  ~compressor() {
    out_ << repeated(repeat);
  }

  std::string operator()(char c) {
    if (lastChar == c) { (1)
      ++repeat;
      return empty;
    }

    std::string output = repeated(repeat); (2)
    repeat = 0;

    lastChar = c;

    return output;
  }

private:
  std::string repeated(size_t count) {
    if (!lastChar)
      return empty;

    if (count > maxCount) { (3)
      return repeated(maxCount)
          + repeated(count - (maxCount+1));
    }
    if ((count < 3) && (lastChar != marker)) (4)
      return std::string(count+1, lastChar);

    std::string r; (5)
    r += marker;
    r += countEncoding[count];
    r += lastChar;
    return r;
  }

  std::string const countEncoding =
            "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
  size_t const maxCount = countEncoding.size() - 1;
  std::string const empty;
  char const marker = '~';

  std::ostream& out_;
  char lastChar = 0;
  size_t repeat = 0;
};

namespace stiX {
  void compress(std::istream &in, std::ostream &out) {
    filter(in, out, compressor(out));
  }
}

The general shape of this code is probably pretty familiar now. It reads the input a character at a time, and spits outs a string in response - the string might be empty, a single character, or several characters.

Given the description in the little man page above, I hope the implementation is straightforward.

  1. If the newly arrived character is the same as the previous character, well, that’s a repetition, so keep a note.

  2. If the new character is different to the previous, then I need to output what ever the previous character was. The repeat member function handles that.

  3. If the number of repetitions is more than the encoding scheme accommodates, encode the longest run possible then recurse to deal with the remaining repetitions.

  4. If the encoded output would be longer than the original - ie there were only one, two, or three repetitions - simply output the original characters. For any kind of normal text, this is the most common case.

  5. After all that, when we reach here there must be encoding a repetition of at least length four.

Kernighan and Plauger’s compress isn’t, by their own lengthy admission, particularly effective, and perhaps the window in which it was ever of use has passed. It is a nice little introduction to run-length encoding schemes though. One feature streaming compression schemes like RLE is that you can’t start to generate output until you’ve consumed at least some of the input. In this case, compress needs to see at least two input characters before any output can be generated. If fed a stream of, say, X after X after X, it will generate no output at all until a different character arrives. There is, of necessity, a certain amount of delay, a degree of implicit buffering. When the end of the input is reached, therefore, there may be a little bit of remaining output that has to be flushed. This is why, for the first time, my function object has a constructor and destructor. It’s great when that happens. Makes me feel like I do have some idea of what I’m doing.

The line

    filter(in, out, compressor(out));

creates a temporary compressor object. It’s constructed immediately before the call to the filter function, and destroyed immediately after the call to filter completes. (In reality, it’s slightly more complex than that, but that’s the essence of it.) The destructor is, therefore, the place to do any clean up needed at the end of the processing, in this case a final call to repeated to flush the last bit of output.

Deterministic destruction is much overlooked feature of C++, at least by those less familiar with the language, and this is an example of quite how powerful they are. In this example, the compressor constructor and destructor provide customisation points either side of the call to filter. I’ve been able to use my existing library code, and my existing approach, but extend it to include a new piece of functionality. Without that ability, I’d have had to discard my existing approach and find something new. When I started working on https://www.jezuk.co.uk/blog/2019/09/sticpp-detab.html`detab` back in August I wasn’t looking ahead to what I might need in the future, so this isn’t the result of some long term plan. It was just there when I needed it. My tests were telling me the last bit of output was missing, and the destructor was just the thing to solve the problem.

Source code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. compress is program 3 of Chapter 2.

Library Reference

In a similiar vein

In a similar spirit to this undertaking, here’s a rapid-fire screencast reimplementing the BSD’s cat command in 1994-style C.


Tagged code, and software-tools-in-c++


Jez Higgins

Freelance software generalist
software created
extended or repaired

Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About