translit x y
Freelance software grandad
software created
extended or repaired
Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations
STinC++ rewrites the programs in Software Tools in Pascal using C++
After the relative calm of echo
, it all kicks off (or, if you prefer, all comes together) in translit
, the finale of Chapter Two. As its name suggests translit
transliterates certain characters on their way through. “We would like”, say Kernighan and Plauger, “to have a program translit
so that we can write”
translit x y
“and have all occurrences of x
in the standard input be replaced by y
on the standard output.”
Ah, so this is why we took that detour into command line arguments. Given what we’ve been through to get here, though, this task is pretty trivial. But hold on, they’re just getting started.
“Multiple translations are also handy:”
translit xyx XYZ
“would change all lower case x
, y
and z
to the corresponding upper case letter.”
Ok, that’s slightly more involved but not particularly tricky.
“It would be nice”, they continue breezily, “to have a shorthand for alphabets, so”
translit a-z A-Z
“would translate all lower case letters to upper case.”
All right, this still isn’t too difficult. We can deal with the arguments easily enough, and then the transliteration is routine
std::string replace ...
std::string with ...
std::transform(
in,
out,
[&replace, &with](char c) {
auto replacement = replace.find(c);
return replacement != std::string::npos
? with[replacement]
: c;
}
);
We don’t even need to write a function - we do all it all in line with a simple lambda. This seems a touch anticlimactic. But Kernighan and Plauger haven’t finished. Not by a long shot. “There are times when we would like to translate a whole class of characters into just one character, and even to squash runs of that translation into just one instance … We can specify this squashing operation by giving a second argument that is shorter than the first … we could also take [the case when the second argument is not given] as a request to delete all occurrences of characters in the first string …” Easy there fellows “… One final capability is worth adding: sometimes we would like to be able to translate and compress all but a set of characters. For instance”
translit ^a-z -
“would replace strings of non-letters with a dash. And”
translit ^a-z
“would delete all but lower case letters.”
Wait, wait, wait.
When I write my C++ versions of the programs in Software Tools, I don’t translate from the Pascal. Of course, I do read the code along with the commentary, but then I start by writing unit tests to capture the behaviour. Leaving aside argument expansion, for translit
I finished up with five distinct sets of tests -
replacement - where the two inputs, the from-set and the to-set, are the same size
replacement and squashing - where the from-set is longer that the to-set
deletion - where everything in the from-set is removed
negated squashing - where the everything not in the from-set is matched to the last item in the to-set
negated deletion - everything not in the from-set is removed
Obviously the two squashing cases and the two deletions cases are pretty similar, but that still leaves, as far as I was concerned, with three distinct operations. Kernighan and Plauger implement them in one loop
lastto := length(toset)
squash := (length(fromset) > lastto) or (allbut);
repeat
i := xindex(fromset, getc(c), allbut, lastto);
if (squash) and (i>=lastto) and (lastto>0) then begin
putc(toset[lastto]);
repeat
i := xindex(fromset, getc(c), allbut, lastto);
until (i < lastto)
end;
if (c <> ENDFILE) then begin
if (i > 0) and (lastto > 0) then { translate }
putc(toset[i])
else if (i = 0) then { copy }
putc(c)
{ else delete }
end
until (c = ENDFILE)
As you read this section of the book, you see the program grow a little bit at a time from an initial half a dozen lines to the code above, along with all the preamble to handle the command line arguments, the inner works of xindex
and all the rest of it. By the time you get to it, this piece of code isn’t difficult at all because you know it inside out. Coming to it cold, it’s pretty daunting. There are two different flags, nested loops, an if-ladder nested within another if, and assignment through out parameters. It’s compact and dense and not entirely straightforward.
Partly because I was starting from a different place, partly because I wanted to keep the std::transform
model I’d adopted for the preceding filters, and partly because I wanted to try and hoist as many of the conditions up and away from the main body of the code, I finished up with something rather different.
void stiX::translit(
const std::string& replace,
const std::string& with,
stiX::Selection selectionMode,
std::istream& in,
std::ostream& out
) {
auto matcher = (selectionMode == stiX::Selection::Normal)
? in_range
: out_of_range;
if (with.empty())
removeCharacters(replace, matcher, in, out);
else if (replace.size() > with.size())
transliterateWithSquash(replace, with, matcher, in, out);
else
transliterate(replace, with, in, out);
}
The argument expansion populates the replace
and with
strings. It’s also sets selectionMode
based on the presence of the leading ^
character. Not shown here are the implementations of the in_range
and out_of_range
functions, which determine whether a particular character is within or not within a given string. The variable matcher
then is a function pointer, and I implement the normal or negated search by pointing it to the appropriate function.
The if-ladder then determines which mode translit
has been invoked in - deletion, squashing, or simple replacement. In the Pascal implementation, this if-ladder is at the heart of the read loop, traversed for every character. I’ve hoisted it right up and out - we decide here, then never have to worry about it again.
The three modes are then implemented in three separate functions. This turned out pretty well. Two of the cases - replacing and removing - can be implemented using functions provided by the standard library. Replacement uses std::transform
, as shown above, while removing uses the catchily named std::remove_copy_if
.
void removeCharacters(
const std::string& remove,
Matcher matcher,
std::istream& in, std::ostream& out
) {
std::remove_copy_if(
std::istreambuf_iterator(in),
std::istreambuf_iterator<char>(),
std::ostreambuf_iterator(out),
[&remove, &matcher](char c) {
return matcher(remove.find(c));
}
);
}
std::remove_copy_if
copies the elements in the range, except those elements for which predicate returns true. In this case, the predicate is a little lambda wrapped around the matcher
function pointer which we set up earlier.
That leaves squashing. The implementation is similar to that of entab
. When entab
encounters a space in the input, it switches into entabbing mode until it encounters a non-space or hits a tab stop. For squashing, when we encounter one of our from range, we flip into squash mode until we see a character that isn’t in the range. There’s a bit of house keeping to do then, but it’s straightforward enough.
std::string operator()(char c) {
auto replacement = replace_.find(c);
if (!matcher(replacement)) {
squashing_ = false;
return std::string(1, c);
} // nothing to do
if (
(replacement >= boundary_)
&& squashing_
) {
return std::string();
} // already squashing - do nothing
squashing_ = (replacement >= boundary_);
return std::string(
1,
squashing_ ? squash_ : with_[replacement]
);
} // operator
translit
was, by a distance, the most difficult program so far. Even with a reference program to read, I found it hard work. For a while, I doubted my dogmatic approach to filters, to sticking to a core built around std::transform
, but I think in the end it paid off.
Kernighan and Plauger acknowledge that translit
is based on “a program written originally by M.D. McIlroy” (that program is tr
and M.D. McIlroy is better known as Doug), and there’s an argument to be made that translit
suffers a little from feature envy, that it pulls in too much, and that its various modes of operation are not quite as overlapping as they’re presented. On the other hand, translit
does bring together the various different types of filters examined in the chapter and uses them in a single program.
By splitting out those different modes into distinct functions, I feel the code is clearer and simpler, and the cognitive load on the reader is reduced. The simple cases are simple, and the complicated case is made simpler because it’s not competing with the other two. And should Kernighan and Plauger produce a third revision of Software Tools adding another mode of operation to translit
(after all, tr
has four!), I’ll be in a good place to safely implement it.
Source code for this program, indeed the whole project, is available in the stiX GitHub repository. translit
is program 6 of Chapter 2.
Freelance software grandad
software created
extended or repaired
Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations