Jez Higgins

Freelance software generalist
software created
extended or repaired


Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About

Thursday 05 March 2020 The Forest Road Reader, No 2.43 : STinC++ - compare

STinC++ rewrites the programs in Software Tools in Pascal using C++

Here we are at the start of Software Tools in Pascal’s third chapter, entitled Files. It’s all very well reading one input and writing one output, but eventually we’re going to want to access “some kind of permanent file system where information can be kept for extended periods on secondary storage”.

In their opening paragraphs Kernighan and Plauger outline the territory - each operating system has its own jargon for describing file system operations, each has its own capabilities and limitations, some operations are easier on this system than that. This is still true, perhaps even more so given that there isn’t a one-to-one mapping of operating systems to filesystems, although happily most of us are able to ignore that in our daily programming lives.

System variations, while a pain, were not Kernighan and Plauger’s biggest obstacle - file system access was just not a thing Pascal concerned itself with. When I was taught Pascal some years after Software Tools in Pascal was written, the language support for file access could best be described as grudging. K&P handle the problem in the same way as they handled console i/o in earlier chapters, presenting a library of file access primitives. The primitives are, unsurprisingly, very similar to those provided in C, and much of the chapter is spent on describing their operation and the operation of file systems in general.

In keeping with their step-by-step approach, Chapter 3’s first program is compare, a file comparison program. Our program needs to read two files, but we can reuse the console output we’re familiar with.

Comparing two files

Because this is Software Tools in Pascal and not, say, The Art of Computer Programming, and the point of the chapter is file handling not algorithm development, the approach to file comparison is not sophisticated.

As long as the two lines are identical, no output will be produced; if [the files] end together, all is well, even if both are empty. If one file is a prefix of the other, then a message to that effect is printed. Finally, if the files differ anywhere in some line, the differing lines will be printed.

compare.cpp
void compare(
    std::string left_file_name,
    std::istream& left,
    std::string right_file_name,
    std::istream& right,
    std::ostream& out
) {
    size_t line_count = 1;
    do {
        auto left_line = getline(left);
        auto right_line = getline(right);

        if (left.eof() && !right.eof()) {
            out << end_of_file << left_file_name << '\n';
            return;
        }
        if (!left.eof() && right.eof()) {
            out << end_of_file << right_file_name << '\n';
            return;
        }

        if (left_line != right_line) {
            out << line_count << ':' << '\n'
                << left_line << '\n'
                << right_line << '\n';
        }

        ++line_count;
    } while (left && right);
} // compare

std::string getline(std::istream& in) {
    std::string line;
    std::getline(in, line);
    return line;
} // getline

It’s not often you get to write a do while loop. Exciting times! You can see this lines up exactly with Kernighan and Plauger’s description:

  • read a line from each input stream

  • check to see if one or other stream is at the end

  • compare the two lines, and output a message if they differ

  • repeat while there is still input to read

You’ll also note, file names aside, there’s no mention of files anywhere in the code. The input stream abstract, familiar from out earlier endeavours shields us from it. If we really needed to get down into the details of a file - how big it was, or where are we in the file say - then, sure, we’d need to be dealing with a some kind of handle to the file itself, but here we’re not. We just need to be able to read each file until there’s nothing left, and the stream abstraction gives us exactly that.

I used the non-member std::getline in preference to the std::istream::getline member function because it deals in std::string rather than a char* and a count, because we’re living in the 21st century and there’s no need to be messing around with raw memory pointers in a program like this. I then threw a little wrapper around std::getline because the return value by reference feels ugly.

To connect the compare function up to a couple of files, we just need a little bit of extra set up.

1_compare.cpp
int main(int argc, char const* argv[]) {
    auto arguments = stiX::make_arguments(argc, argv);
    if (arguments.size() != 2) {
        std::cout << argv[0] << " file1 file2" << std::endl;
        return 0;
    }

    auto left = openfile(arguments[0]);
    auto right = openfile(arguments[1]);

    if (!left || !right)
        return 0;

    stiX::compare( (3)
        arguments[0],
        left,
        arguments[1],
        right,
        std::cout
    );
}

std::ifstream openfile(std::string const& filename) {
    std::ifstream f(filename); (1)
    if (!f) (2)
        std::cout << "Could not open '" << filename << "'\n";
    return f;
}
  1. An std::ifstream is an input file stream. Passing a filename to the constructor attempts to open the file, ready for reading.

  2. An istream is good if it is readable and implicitly convertible to true. If our newly created stream is not good, we just give up and display an error message. (Note, however, that an istream that is not good is not necessarily bad.)

  3. With two open files, hand off to our compare function.

It does a job, but it’s no diff

The comparison algorithm used here is rudimentary at best. The program compare is mimicking this time is, of course, diff, which is another of Doug McIlroy’s many contributions to programmers the world over. The algorithm McIlroy and his colleague William Hunt developed, later described in (the short and very readable) An Algorithm for Differential File Comparison, is the basis of pretty much all file differencing and source code control tools, at least as far as I’m aware, as well as finding applications in funky things like analysing DNA sequences.

I once developed a three-way XML document diff and merge tool and it too used the Hunt-McIlroy algorithm, although I went the long way round to get it. I found a diff implementation written in Perl, which itself might have been a port from Smalltalk, and without any real understanding of how it worked translated that into Java. I then jabbed at for several days until it produced reasonable outputs. The end result was, I thought, a pretty decent bit of kit, although it absolutely failed to make my fortune.

Source code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. compare is first program of Chapter 3.

Library Reference

Endnotes

This whole endeavour relies Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They’re both still in print, but new copies are, frankly, just ridiculously expensive. Happily, here are plenty of second-hand copies floating round, or you can borrow them from the The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license. CLion uses CMake to build projects. My previous flirtations with CMake, admittedly many years ago, weren’t a huge success. Not so this time - it’s easy to use and works a treat.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this is my first time really using it. I like it and will use it again.


Tagged code, and software-tools-in-c++

Wednesday 01 January 2020 The Forest Road Reader, No 2.42 : First Foot

At the beginning of last year, my humble missive was Tharg’s Letter of the Week in Prog 2114 of the Galaxy’s Greatest Comic, 2000AD. It was a real thrill, which continues to bring me joy.

Towards the end of the year, I was called "a total moron" by an anonymous Redditor, in what was their first post in five years and only their second ever. I’m delighted my talk at the ACCU Conference moved them so deeply.

I hope you found similar little personal high points in an otherwise globally gloomy year. Hey, let me know if you like.

Eyes on the horizon. Look after yourselves.

Friday 27 December 2019 The Forest Road Reader, No 2.41 : In Clown’s Clothing

Fluffy-haired and youthful at forty-eight, and with a vocabulary peppered with archaic expostulations - Balderdash! Tommy-rot!! Oh my giddy aunt!!! - Peter Judd has long established himself as the unthreatening face of the old-school right, popular enough with the Great British Public, which thought him an amiable idiot, to make a second living outside Parliament as a rent-a-quote-media-whore-cum-quiz-show-panel-favourite, and to get away with minor peccadilloes like dicking his kids' nanny, robbing the taxman blind, and giving his party leader conniptions with off-script flourishes. ('Damn fine city,' he’d remarked on a trip to Paris. 'Probably worth defending next time.') Not everyone who’d worked with him thought him a total buffoon, and some who’d witnessed him lose his temper suspected him of political savvy, but by and large PJ seemed happy with the image he’d either fostered or been born with: a loose cannon with a floppy haircut and bicycle.

That’s from Mick Herron’s Slow Horses, a thriller about washed-up spies. Peter Judd, clearly a Boris Johnson analogue, plays a small but key role, as a go-between able to tip off an extreme nationalist cell that they’ve been infiltrated by the security services.

'Because we both know the tide’s turning. The decent people of this country are sick to death of being held hostage by mad liberals in Brussels, and the sooner we take control over our own future, our own borders …​'

'Are you seriously lecturing me?'

'It’ll happen, and within the lifetime of your government. We both know that. Not this Parliament, but probably the next. By which time we both know where you expect to be living, and it won’t be Islington, will it? …​ It’ll be Downing Street.'

'Yes. Well.' The effing and blinding PJ of ten minutes ago …​ left the room; in his place was the bumbly figure familiar from countless broadcasts and not a few YouTube moments. 'Obviously, if called upon to serve, I’ll leave my plough.'

'And you’ll want to take your party further to the right …​'

Despite being too toxic to even stand as leader of the Conservative Party just three years ago, Boris Johnson, of course, become Prime Minister earlier this year and he and his party have just won the General Election by a first-past-the-post distorted distance. The current Conservative Party seems be as predicted by Herron, with a number of new MPs following Johnson’s what-the-papers-call-controversial lead, and far-right nutjobs queuing up to try and get a party card.

Slow Horses was written in 2010, so full marks to Herron for his grim prognostication. Look, I’m not saying it’s his fault the country’s being driving towards the cesspit by a bunch of deeply unpleasant shitsacks or anything, but fingers-bloody-crossed PJ gets his comeuppance in one of Herron’s subsequent books.

Hold on tight.

Thursday 28 November 2019 The Forest Road Reader, No 2.40 : STinC++ - Chapter Two Wrap Up

STinC++ rewrites the programs in Software Tools in Pascal using C++

  • entab - replacing spaces with tabs

  • overstrike - crazy old school print preprocessing

  • compress - simple run length encoding

  • expand - and simple run length decoding

  • echo - command line argument handling

  • translit - a positive hydra of character substitution and deletion

Each program in chapter one was progressively a little more complex than the one before. This is not the case in Chapter Two, where the various programs form more of a loose map of the territory. Here we have a program that replaces many characters with a single character, while over there we have a program that might produce many lines of output for a single line of input, and so on.

This chapter is, without directly saying so, pretty explicitly about the Unix philosophy. While Unix systems (or at least Unix-ish systems) are basically the substrate of most modern computing (yes, even if you’re a Windows, Android, or iOS user) in 1981, when Software Tools in Pascal was published, there were “800 systems …​ in use outside [Bell Labs]”. Kernighan and Pike have to take time to describe what’s now commonplace - combining small programs with pipelines, text as common currency, indeed the whole idea of software tools and of the programmer as tool user.

When I first read Software Tools in Pascal, I’d just started my first substantial job on a Unix system (in fact four Unix systems, all slightly different) and I was only passingly familiar with a handleful of command line tools. The idea that these small programs not only did simple things, but could be combined like some kind of giant tokusatsu robot to do very powerful things was really quite startling. It certainly changed, and continues to influence, how I think about programming and I have no doubt I’m a better programmer for that. I’m not sure I’ll ever touch the heights achieved by my friend Thomas in his masterful He Sells Shell Scripts To Intersect Sets, in which he combines standard Unix tools to do full on computer science, but I’m nearer than I otherwise would be.

Next up, files!

Endnotes

This whole endeavour relies Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They’re both still in print but are ruinously expense. Happily, there are plenty of second-hand editions floating round.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license. CLion uses CMake to build projects. My previous flirtations with CMake, admittedly many years ago, weren’t a huge success. Not so this time - it’s easy to use and works a treat.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this is my first time really using it. I like it and will use it again.


Tagged code, and software-tools-in-c++

Tuesday 26 November 2019 The Forest Road Reader, No 2.39 : STinC++ - translit

STinC++ rewrites the programs in Software Tools in Pascal using C++

After the relative calm of echo, it all kicks off (or, if you prefer, all comes together) in translit, the finale of Chapter Two. As its name suggests translit transliterates certain characters on their way through. “We would like”, say Kernighan and Plauger, “to have a program translit so that we can write”

  translit x y

“and have all occurrences of x in the standard input be replaced by y on the standard output.”

Ah, so this is why we took that detour into command line arguments. Given what we’ve been through to get here, though, this task is pretty trivial. But hold on, they’re just getting started.

“Multiple translations are also handy:”

  translit xyx XYZ

“would change all lower case x, y and z to the corresponding upper case letter.”

Ok, that’s slightly more involved but not particularly tricky.

“It would be nice”, they continue breezily, “to have a shorthand for alphabets, so”

  translit a-z A-Z

“would translate all lower case letters to upper case.”

All right, this still isn’t too difficult. We can deal with the arguments easily enough, and then the transliteration is routine

    std::string replace ...
    std::string with ...

    std::transform(
        in,
        out,
        [&replace, &with](char c) {
            auto replacement = replace.find(c);
            return replacement != std::string::npos
                ? with[replacement]
                : c;
        }
    );

We don’t even need to write a function - we do all it all in line with a simple lambda. This seems a touch anticlimactic. But Kernighan and Plauger haven’t finished. Not by a long shot. “There are times when we would like to translate a whole class of characters into just one character, and even to squash runs of that translation into just one instance …​ We can specify this squashing operation by giving a second argument that is shorter than the first …​ we could also take [the case when the second argument is not given] as a request to delete all occurrences of characters in the first string …​” Easy there fellows “…​ One final capability is worth adding: sometimes we would like to be able to translate and compress all but a set of characters. For instance”

  translit ^a-z -

“would replace strings of non-letters with a dash. And”

  translit ^a-z

“would delete all but lower case letters.”

Wait, wait, wait.

When I write my C++ versions of the programs in Software Tools, I don’t translate from the Pascal. Of course, I do read the code along with the commentary, but then I start by writing unit tests to capture the behaviour. Leaving aside argument expansion, for translit I finished up with five distinct sets of tests -

  • replacement - where the two inputs, the from-set and the to-set, are the same size

  • replacement and squashing - where the from-set is longer that the to-set

  • deletion - where everything in the from-set is removed

  • negated squashing - where the everything not in the from-set is matched to the last item in the to-set

  • negated deletion - everything not in the from-set is removed

Obviously the two squashing cases and the two deletions cases are pretty similar, but that still leaves, as far as I was concerned, with three distinct operations. Kernighan and Plauger implement them in one loop

extract from translit.pas
    lastto := length(toset)
    squash := (length(fromset) > lastto) or (allbut);
    repeat
        i := xindex(fromset, getc(c), allbut, lastto);
        if (squash) and (i>=lastto) and (lastto>0) then begin
            putc(toset[lastto]);
            repeat
                i := xindex(fromset, getc(c), allbut, lastto);
            until (i < lastto)
        end;
        if (c <> ENDFILE) then begin
            if (i > 0) and (lastto > 0) then  { translate }
                putc(toset[i])
            else if (i = 0) then { copy }
                putc(c)
            { else delete }
        end
    until (c = ENDFILE)

As you read this section of the book, you see the program grow a little bit at a time from an initial half a dozen lines to the code above, along with all the preamble to handle the command line arguments, the inner works of xindex and all the rest of it. By the time you get to it, this piece of code isn’t difficult at all because you know it inside out. Coming to it cold, it’s pretty daunting. There are two different flags, nested loops, an if-ladder nested within another if, and assignment through out parameters. It’s compact and dense and not entirely straightforward.

Partly because I was starting from a different place, partly because I wanted to keep the std::transform model I’d adopted for the preceding filters, and partly because I wanted to try and hoist as many of the conditions up and away from the main body of the code, I finished up with something rather different.

extract from translit.cpp
    void stiX::translit(
        const std::string& replace,
        const std::string& with,
        stiX::Selection selectionMode,
        std::istream& in,
        std::ostream& out
    ) {
        auto matcher = (selectionMode == stiX::Selection::Normal)
            ? in_range
            : out_of_range;

        if (with.empty())
            removeCharacters(replace, matcher, in, out);
        else if (replace.size() > with.size())
            transliterateWithSquash(replace, with, matcher, in, out);
        else
            transliterate(replace, with, in, out);
    }

The argument expansion populates the replace and with strings. It’s also sets selectionMode based on the presence of the leading ^ character. Not shown here are the implementations of the in_range and out_of_range functions, which determine whether a particular character is within or not within a given string. The variable matcher then is a function pointer, and I implement the normal or negated search by pointing it to the appropriate function.

The if-ladder then determines which mode translit has been invoked in - deletion, squashing, or simple replacement. In the Pascal implementation, this if-ladder is at the heart of the read loop, traversed for every character. I’ve hoisted it right up and out - we decide here, then never have to worry about it again.

The three modes are then implemented in three separate functions. This turned out pretty well. Two of the cases - replacing and removing - can be implemented using functions provided by the standard library. Replacement uses std::transform, as shown above, while removing uses the catchily named std::remove_copy_if.

extract from translit.cpp
    void removeCharacters(
        const std::string& remove,
        Matcher matcher,
        std::istream& in, std::ostream& out
    ) {
        std::remove_copy_if(
            std::istreambuf_iterator(in),
            std::istreambuf_iterator<char>(),
            std::ostreambuf_iterator(out),
            [&remove, &matcher](char c) {
                return matcher(remove.find(c));
            }
        );
    }

std::remove_copy_if copies the elements in the range, except those elements for which predicate returns true. In this case, the predicate is a little lambda wrapped around the matcher function pointer which we set up earlier.

That leaves squashing. The implementation is similar to that of entab. When entab encounters a space in the input, it switches into entabbing mode until it encounters a non-space or hits a tab stop. For squashing, when we encounter one of our from range, we flip into squash mode until we see a character that isn’t in the range. There’s a bit of house keeping to do then, but it’s straightforward enough.

extract from translit.cpp
    std::string operator()(char c) {
        auto replacement = replace_.find(c);

        if (!matcher(replacement)) {
            squashing_ = false;
            return std::string(1, c);
        } // nothing to do

        if (
            (replacement >= boundary_)
            && squashing_
        ) {
            return std::string();
        } // already squashing - do nothing

        squashing_ = (replacement >= boundary_);
        return std::string(
            1,
            squashing_ ? squash_ : with_[replacement]
        );
    } // operator

I Don’t Know About You, But I’m Exhausted

translit was, by a distance, the most difficult program so far. Even with a reference program to read, I found it hard work. For a while, I doubted my dogmatic approach to filters, to sticking to a core built around std::transform, but I think in the end it paid off.

Kernighan and Plauger acknowledge that translit is based on “a program written originally by M.D. McIlroy” (that program is tr and M.D. McIlroy is better known as Doug), and there’s an argument to be made that translit suffers a little from feature envy, that it pulls in too much, and that its various modes of operation are not quite as overlapping as they’re presented. On the other hand, translit does bring together the various different types of filters examined in the chapter and uses them in a single program.

By splitting out those different modes into distinct functions, I feel the code is clearer and simpler, and the cognitive load on the reader is reduced. The simple cases are simple, and the complicated case is made simpler because it’s not competing with the other two. And should Kernighan and Plauger produce a third revision of Software Tools adding another mode of operation to translit (after all, tr has four!), I’ll be in a good place to safely implement it.

Source Code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. translit is program 6 of Chapter 2.


Tagged code, and software-tools-in-c++
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software generalist
software created
extended or repaired

Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About