String previousLine = null;
while ( /* there are more lines */ ) {
String currentLine = readLine(...);
if (!currentLine.equals(previousLine)) {
/* print out */
}
previousLine = currentLine;
}
Freelance software grandad
software created
extended or repaired
Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations
STinC++ rewrites the programs in Software Tools in Pascal using C++
One common reason for sorting is to bring together all occurrences of a particular item so they can be treated as a group. Sometimes this is done just to discard all but one occurrence in a group … It’s certainly easy to add an option to
sort
which discards multiple occurrences of an input line.
This is true …
But should this code be part of
sort
at all?
Kernighan and Plauger argue, quite strongly, that in this case the answer is no. They don’t quite go to premature optimisation is the route of all evil but they do say
Combining functions too early is a mistake.
They also lean heavily on the possibility of reuse.
By separating the function of stripping duplicates from that of sorting, we can do things which are not possible when they are combined.
I’m often wary of arguments about enabling reuse, but in this case it’s really relevant and what’s more actually likely to happen. We’re building tools, and tools can be combined in ways we can’t necessarily anticipate. By keeping our duplicate stripping functionality separate from sorting, we allow people to do things which might not be possible were they squished together in one package.
PROGRAM |
|
USAGE |
|
FUNCTION |
|
EXAMPLE |
To eliminate duplicate lines in the output of a program: |
program | sort | unique |
We’re back in familiar filtering ground here. We’re reading a sequence of lines of text, doing some processing on each line, and then spitting the result back out. The processing here is removing duplicates, and that’s straightforward:
Read a line
If this line is different from the previous line, print it out
Loop back around until there’s nothing more to read
We do have small bootstrapping problem though - what do we do for the very first line? What do we compare the first line we read with? In a language like, say, Java we can lean on null
and do something like this
String previousLine = null;
while ( /* there are more lines */ ) {
String currentLine = readLine(...);
if (!currentLine.equals(previousLine)) {
/* print out */
}
previousLine = currentLine;
}
In Java, we’re never going to read a null
string from the input, so we can safely use that as our initial value. For the first line we read currentLine.equals(previousLine)
will never be true, so it gets printed out and the program proceeds properly.
In C++ strings are values rather than references. If I declare a string
std::string str;
then str
has a value. The value happens to be the empty string, but it’s still a value. What’s more, it’s perfectly possible that we might read empty strings from our input.
So what to do? Do we need to apply some special processing to the very first line, perhaps using a flag to indicate this is first time through?
auto const eof = std::char_traits<char>::eof();
void stiX::unique(std::istream& in, std::ostream& out) {
auto previous_line = std::string { };
bool first_line = true;
while(in.peek() != eof) {
auto current_line = stiX::getline(in);
if (previous_line != current_line || first_line)
out << current_line << "\n";
previous_line = current_line;
first_line = false;
};
}
This works as we want, but that first_line
flag really grates on me. It’s a wart.
When we write something like
auto str = std::string { "I am a string" };
we’re reaching back into C++'s deep past. C has no real string type, instead kind of bodging through with arrays of characters. A sequence of characters in double quotes, like "I am a string"
, is stored by C as an array of characters containing the characters of the string and, crucially, terminated with an extra '\0'
character as an end of string marker. This behaviour is baked into the C language and library.
C++ preserves this behaviour and these functions - we couldn’t have C interop without that - and a C++ std::string
can happily be initialised with a C style character array. However, a std::string
isn’t just a simple wrapper around char*
, it’s a little more sophisticated than that. Significantly, once it’s initialised and knows its own length, it does not rely on the '\0'
character as a terminator. Even more than that, std::string
is happy to contain multiple '\0'
within its value.
Generally, when we create a std::string
we’re initialising from a char*
, copying another std::string
, or just bringing up a default empty string. But std::string
has a several other constructors, including
std::string(size_type count, char ch);
This constructs a string containing count
copies of the character ch
. std::string(5, 'a')
would be, therefore, "aaaaa"
. Excitingly std::string(1, '\0')
constructs a string containing a single '\0'
character. This is something you wouldn’t ever ordinarily encounter, especially when, for example, reading text from an istream
.
That’s a rather round-the-houses way of saying we can eliminate the first_line
flag using our weird string of '\0'
.
auto const eof = std::char_traits<char>::eof();
void stiX::unique(std::istream& in, std::ostream& out) {
auto previous_line = std::string { 1, '\0' };
while(in.peek() != eof) {
auto current_line = stiX::getline(in);
if (previous_line != current_line)
out << current_line << "\n";
previous_line = current_line;
};
}
I can’t quite decide if this is a neat use of an intended feature, or a crime against code. Possibly both. It’s definitely well-formed C++, but it does feel kind of funky. If it moves you, in a pleasant or uncomfortable direction, I’d be delighted to hear from you.
Back in 1982, AT&T produced a video to promote Unix. A few minutes in Brian Kernighan, his feet up on his desk, demonstrates the power of pipelines by combining various programs, including sort
and unique
, to build a spell checker. It’s really quite a remarkable demo, and he explains it brilliantly. The whole film, which runs just shy of 30 minutes, is a fascinating historical artefact.
The full source code, with tests and what-not, for unique, and indeed the whole project, is available in the stiX GitHub repository.
This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.
For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license for. I use it all the time now.
The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.
Freelance software grandad
software created
extended or repaired
Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations