Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Friday 23 April 2021 The Forest Road Reader, No 2.62 : STinC++ - unique

STinC++ rewrites the programs in Software Tools in Pascal using C++

One common reason for sorting is to bring together all occurrences of a particular item so they can be treated as a group. Sometimes this is done just to discard all but one occurrence in a group …​ It’s certainly easy to add an option to sort which discards multiple occurrences of an input line.

This is true …​

But should this code be part of sort at all?

Kernighan and Plauger argue, quite strongly, that in this case the answer is no. They don’t quite go to premature optimisation is the route of all evil but they do say

Combining functions too early is a mistake.

They also lean heavily on the possibility of reuse.

By separating the function of stripping duplicates from that of sorting, we can do things which are not possible when they are combined.

I’m often wary of arguments about enabling reuse, but in this case it’s really relevant and what’s more actually likely to happen. We’re building tools, and tools can be combined in ways we can’t necessarily anticipate. By keeping our duplicate stripping functionality separate from sorting, we allow people to do things which might not be possible were they squished together in one package.

unique

PROGRAM

unique delete adjacent duplicate lines

USAGE

unique

FUNCTION

unique writes to its output only the first line from each group of adjacent identical input lines. It is most useful for text that has been sorted to bring identical lines together; in this case it passes through only unique instances of input lines.

EXAMPLE

To eliminate duplicate lines in the output of a program:

program | sort | unique
Stripping duplicates

We’re back in familiar filtering ground here. We’re reading a sequence of lines of text, doing some processing on each line, and then spitting the result back out. The processing here is removing duplicates, and that’s straightforward:

  • Read a line

  • If this line is different from the previous line, print it out

  • Loop back around until there’s nothing more to read

We do have small bootstrapping problem though - what do we do for the very first line? What do we compare the first line we read with? In a language like, say, Java we can lean on null and do something like this

String previousLine = null;

while ( /* there are more lines */ ) {
  String currentLine = readLine(...);

  if (!currentLine.equals(previousLine)) {
    /* print out */
  }

  previousLine = currentLine;
}

In Java, we’re never going to read a null string from the input, so we can safely use that as our initial value. For the first line we read currentLine.equals(previousLine) will never be true, so it gets printed out and the program proceeds properly.

In C++ strings are values rather than references. If I declare a string

std::string str;

then str has a value. The value happens to be the empty string, but it’s still a value. What’s more, it’s perfectly possible that we might read empty strings from our input.

So what to do? Do we need to apply some special processing to the very first line, perhaps using a flag to indicate this is first time through?

auto const eof = std::char_traits<char>::eof();

void stiX::unique(std::istream& in, std::ostream& out) {
  auto previous_line = std::string { };
  bool first_line = true;

  while(in.peek() != eof) {
    auto current_line = stiX::getline(in);

    if (previous_line != current_line || first_line)
      out << current_line << "\n";

    previous_line = current_line;
    first_line = false;
  };
}

This works as we want, but that first_line flag really grates on me. It’s a wart.

Filthy but I’m ashamed to say I like it

When we write something like

auto str = std::string { "I am a string" };

we’re reaching back into C++'s deep past. C has no real string type, instead kind of bodging through with arrays of characters. A sequence of characters in double quotes, like "I am a string", is stored by C as an array of characters containing the characters of the string and, crucially, terminated with an extra '\0' character as an end of string marker. This behaviour is baked into the C language and library.

C++ preserves this behaviour and these functions - we couldn’t have C interop without that - and a C++ std::string can happily be initialised with a C style character array. However, a std::string isn’t just a simple wrapper around char*, it’s a little more sophisticated than that. Significantly, once it’s initialised and knows its own length, it does not rely on the '\0' character as a terminator. Even more than that, std::string is happy to contain multiple '\0' within its value.

Generally, when we create a std::string we’re initialising from a char*, copying another std::string, or just bringing up a default empty string. But std::string has a several other constructors, including

std::string(size_type count, char ch);

This constructs a string containing count copies of the character ch. std::string(5, 'a') would be, therefore, "aaaaa". Excitingly std::string(1, '\0') constructs a string containing a single '\0' character. This is something you wouldn’t ever ordinarily encounter, especially when, for example, reading text from an istream.

That’s a rather round-the-houses way of saying we can eliminate the first_line flag using our weird string of '\0'.

auto const eof = std::char_traits<char>::eof();

void stiX::unique(std::istream& in, std::ostream& out) {
  auto previous_line = std::string { 1, '\0' };

  while(in.peek() != eof) {
    auto current_line = stiX::getline(in);

    if (previous_line != current_line)
      out << current_line << "\n";

    previous_line = current_line;
  };
}

I can’t quite decide if this is a neat use of an intended feature, or a crime against code. Possibly both. It’s definitely well-formed C++, but it does feel kind of funky. If it moves you, in a pleasant or uncomfortable direction, I’d be delighted to hear from you.

References

  • The many constructors of std::string

  • Back in 1982, AT&T produced a video to promote Unix. A few minutes in Brian Kernighan, his feet up on his desk, demonstrates the power of pipelines by combining various programs, including sort and unique, to build a spell checker. It’s really quite a remarkable demo, and he explains it brilliantly. The whole film, which runs just shy of 30 minutes, is a fascinating historical artefact.

Source Code

The full source code, with tests and what-not, for unique, and indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


Tagged code, and software-tools-in-c++


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed