Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Monday 28 December 2020 The Forest Road Reader, No 2.58 : STinC++ - in-memory text sort

STinC++ rewrites the programs in Software Tools in Pascal using C++

After a couple of warm-ups (from which, if we’re honest, I’ve probably cooled back down from), the first proper program in Chapter 4 of Software Tools In Pascal is sort, a program that sorts a text file line by line into increasing lexicographic order. Kernighan and Plauger present this as a filter even though, they point out, it isn’t really a filter because it can’t output anything until its input is complete. (That little note, along with phrases like increasing lexicographic order, is the kind of precise language that just fills me with reading pleasure.)

The first iteration of sort assumes the data fits in memory. We’ll read the whole of the input in one pass, sort it, then spit it out again.

Data Representation

Pascal’s lack of a proper string type has hampered Kernighan and Plauger throughout, and here it presents a real issue. Using fixed sized character arrays as a substitute string type just isn’t going to cut it. For short strings they’ll waste memory, which is a scarce resource, and for longer strings present a further problem they would need to resolve somehow.

For them, then, their first task is come up with a suitable data representation. They choose to read the input into a single large array, then keep a second array of indexes pointing the start of each line.

Here in the future, I can just write:

auto lines = std::vector<std::string> { };
Sort implementation

Having already written a Shell sort earlier in the chapter, Kernighan and Plauger now have to specialise it for their new data structure. It needs to be extended to extract each line under consideration, then further to compare the lines, and finally how to swap lines. This touches almost every part of the sort routine. They do have one small win - swapping two lines only involves switching the indexes rather than fiddling inside the large array.

I need to specialise my sort too …​

std::sort(lines.begin(), lines.end());
Input and Output

Having considered their data structure and sort, Kernighan and Plauger need to write procedures to read standard in and populate their data structures, and also to print it back out again.

Let’s have a go at that too…​

// read
while (std::cin)
  lines.emplace_back(stiX::getline(std::cin));

// write
std::copy(
  lines.begin(),
  lines.end(),
  std::ostream_iterator(std::cout, "\n");
);
  • stiX::getline is a tiny convenience wrapper around std::getline

  • I’ve went back and forth on emplace_back and push_back here. I went with emplace_back seems to better express intention, but I’d be hard pressed to properly explain why.

In memory sort filter

Arranging the snippets above in the proper order gives

std::vector<std::string> read_lines();
void write_lines(std::vector<std::string> const &lines);

int main() {
  auto lines = read_lines();

  std::sort(lines.begin(), lines.end());

  write_lines(lines);
} // main

std::vector<std::string> read_lines() {
  auto lines = std::vector<std::string> { };

  while (std::cin)
    lines.emplace_back(stiX::getline(std::cin));

  return lines;
} // read_lines

void write_lines(std::vector<std::string> const &lines) {
  std::copy(
    lines.begin(),
    lines.end(),
    std::ostream_iterator<std::string>(std::cout, "\n")
  );
} // write_lines

If we ignore function declarations and pretty printing, this entire program is five lines of code.

The main body of Kernighan and Plauger’s code is equally concise

begin
    if (gtext(linpos, nlines, linbuf, STDIN) = ENDFILE) then begin
        shell(linpos, nlines, linbuf);
        ptext(linpos, nlines, linbuf, STDOUT)
    end
    else
        error('sort: input too big to sort')
end;

Unfortunately, they had to supply all the code behind the gtext, quick, and ptext calls too, while I leaned on standard library components. Expanded out, their full program is somewhat heftier.

{ Copyright (C) 1981 by Bell Laboratories, Inc., and Whitesmiths Ltd. }
{ sort -- sort text lines in memory }
procedure inmemsort;
const
    MAXCHARS = 10000;    { maximum # of text characters }
    MAXLINES = 300;        { maximum # of lines }
type
    charbuf = array [1..MAXCHARS] of character;
    charpos = 1..MAXCHARS;
    posbuf = array [1..MAXLINES] of charpos;
    pos = 0..MAXLINES;
var
    linebuf : charbuf;
    linepos : posbuf;
    nlines : pos;

{ gtext -- get text lines into linebuf }
function gtext (var linepos : posbuf; var nlines : pos;
        var linebuf : charbuf; infile : filedesc) : boolean;
var
    i, len, nextpos : integer;
    temp : string;
    done : boolean;
begin
    nlines := 0;
    nextpos := 1;
    repeat
        done := (getline(temp, infile, MAXSTR) = false);
        if (not done) then begin
            nlines := nlines + 1;
            linepos[nlines] := nextpos;
            len := length(temp);
            for i := 1 to len do
                linebuf[nextpos+i-1] := temp[i];
            linebuf[nextpos+len] := ENDSTR;
            nextpos := nextpos + len + 1  { 1 for ENDSTR }
        end
    until (done) or (nextpos >= MAXCHARS-MAXSTR)
            or (nlines >= MAXLINES);
    gtext := done
end;

{ shell -- ascending Shell sort for lines }
procedure shell (var linepos : posbuf; nlines : integer;
        var linebuf : charbuf);
var
    gap, i, j, jg : integer;
{ cmp -- compare linebuf[i] with linebuf[j] }
function cmp (i, j : charpos; var linebuf : charbuf)
        : integer;
begin
    while (linebuf[i] = linebuf[j])
      and (linebuf[i] <> ENDSTR) do begin
        i := i + 1;
        j := j + 1
    end;
    if (linebuf[i] = linebuf[j]) then
        cmp := 0
    else if (linebuf[i] = ENDSTR) then    { 1st is shorter }
        cmp := -1
    else if (linebuf[j] = ENDSTR) then    { 2nd is shorter }
        cmp := +1
    else if (linebuf[i] < linebuf[j]) then
        cmp := -1
    else
        cmp := +1
end;

{ exchange -- exchange linebuf[lp1] with linebuf[lp2] }
procedure exchange (var lp1, lp2 : charpos);
var
    temp : charpos;
begin
    temp := lp1;
    lp1 := lp2;
    lp2 := temp
end;
begin
    gap := nlines div 2;
    while (gap > 0) do begin
        for i := gap+1 to nlines do begin
            j := i - gap;
            while (j > 0) do begin
                jg := j + gap;
                if (cmp(linepos[j],linepos[jg],linebuf)<=0) then
                    j := 0    { force loop termination }
                else
                    exchange(linepos[j], linepos[jg]);
                j := j - gap
            end
        end;
        gap := gap div 2
    end
end;

{ ptext -- output text lines from linebuf }
procedure ptext (var linepos : posbuf; nlines : integer;
        var linebuf : charbuf; outfile : filedesc);
var
    i, j : integer;
begin
    for i := 1 to nlines do begin
        j := linepos[i];
        while (linebuf[j] <> ENDSTR) do begin
            putcf(linebuf[j], outfile);
            j := j + 1
        end
    end
end;

begin
    if (gtext(linepos, nlines, linebuf, STDIN)) then begin
        shell(linepos, nlines, linebuf);
        ptext(linepos, nlines, linebuf, STDOUT)
    end
    else
        error('sort: input too big to sort')
end;

At around 100 lines it’s not a massive program, but it’s still an undertaking. I’d be feeling pretty pleased with myself to have written something of this size and be confident in it inside an intense day.

Joy To The World

What I actually did was write my program directly into the text of this article. I then cut and pasted it across to my IDE, and it turned out I was correct pretty much first go. The only change I needed to make was to add the newline separator into the ostream_iterator.

Taking pieces you have, which you know to be correct and efficient, and plugging them together is almost the platonic ideal of programming. Join me now in sending good thoughts to the language designers, compiler writers, library implementors, and all those involved in providing the fantastic programming tools we have at our fingertips. Thanks friends!

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, just ridiculously expensive. Happily, there are plenty of second-hand copies floating round, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and renew a license for, because it’s gone a year already.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it’ll be my first choice in the future.


Tagged code, and software-tools-in-c++


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed