Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Friday 23 April 2021 The Forest Road Reader, No 2.62 : STinC++ - unique

STinC++ rewrites the programs in Software Tools in Pascal using C++

One common reason for sorting is to bring together all occurrences of a particular item so they can be treated as a group. Sometimes this is done just to discard all but one occurrence in a group …​ It’s certainly easy to add an option to sort which discards multiple occurrences of an input line.

This is true …​

But should this code be part of sort at all?

Kernighan and Plauger argue, quite strongly, that in this case the answer is no. They don’t quite go to premature optimisation is the route of all evil but they do say

Combining functions too early is a mistake.

They also lean heavily on the possibility of reuse.

By separating the function of stripping duplicates from that of sorting, we can do things which are not possible when they are combined.

I’m often wary of arguments about enabling reuse, but in this case it’s really relevant and what’s more actually likely to happen. We’re building tools, and tools can be combined in ways we can’t necessarily anticipate. By keeping our duplicate stripping functionality separate from sorting, we allow people to do things which might not be possible were they squished together in one package.

unique

PROGRAM

unique delete adjacent duplicate lines

USAGE

unique

FUNCTION

unique writes to its output only the first line from each group of adjacent identical input lines. It is most useful for text that has been sorted to bring identical lines together; in this case it passes through only unique instances of input lines.

EXAMPLE

To eliminate duplicate lines in the output of a program:

program | sort | unique
Stripping duplicates

We’re back in familiar filtering ground here. We’re reading a sequence of lines of text, doing some processing on each line, and then spitting the result back out. The processing here is removing duplicates, and that’s straightforward:

  • Read a line

  • If this line is different from the previous line, print it out

  • Loop back around until there’s nothing more to read

We do have small bootstrapping problem though - what do we do for the very first line? What do we compare the first line we read with? In a language like, say, Java we can lean on null and do something like this

String previousLine = null;

while ( /* there are more lines */ ) {
  String currentLine = readLine(...);

  if (!currentLine.equals(previousLine)) {
    /* print out */
  }

  previousLine = currentLine;
}

In Java, we’re never going to read a null string from the input, so we can safely use that as our initial value. For the first line we read currentLine.equals(previousLine) will never be true, so it gets printed out and the program proceeds properly.

In C++ strings are values rather than references. If I declare a string

std::string str;

then str has a value. The value happens to be the empty string, but it’s still a value. What’s more, it’s perfectly possible that we might read empty strings from our input.

So what to do? Do we need to apply some special processing to the very first line, perhaps using a flag to indicate this is first time through?

auto const eof = std::char_traits<char>::eof();

void stiX::unique(std::istream& in, std::ostream& out) {
  auto previous_line = std::string { };
  bool first_line = true;

  while(in.peek() != eof) {
    auto current_line = stiX::getline(in);

    if (previous_line != current_line || first_line)
      out << current_line << "\n";

    previous_line = current_line;
    first_line = false;
  };
}

This works as we want, but that first_line flag really grates on me. It’s a wart.

Filthy but I’m ashamed to say I like it

When we write something like

auto str = std::string { "I am a string" };

we’re reaching back into C++'s deep past. C has no real string type, instead kind of bodging through with arrays of characters. A sequence of characters in double quotes, like "I am a string", is stored by C as an array of characters containing the characters of the string and, crucially, terminated with an extra '\0' character as an end of string marker. This behaviour is baked into the C language and library.

C++ preserves this behaviour and these functions - we couldn’t have C interop without that - and a C++ std::string can happily be initialised with a C style character array. However, a std::string isn’t just a simple wrapper around char*, it’s a little more sophisticated than that. Significantly, once it’s initialised and knows its own length, it does not rely on the '\0' character as a terminator. Even more than that, std::string is happy to contain multiple '\0' within its value.

Generally, when we create a std::string we’re initialising from a char*, copying another std::string, or just bringing up a default empty string. But std::string has a several other constructors, including

std::string(size_type count, char ch);

This constructs a string containing count copies of the character ch. std::string(5, 'a') would be, therefore, "aaaaa". Excitingly std::string(1, '\0') constructs a string containing a single '\0' character. This is something you wouldn’t ever ordinarily encounter, especially when, for example, reading text from an istream.

That’s a rather round-the-houses way of saying we can eliminate the first_line flag using our weird string of '\0'.

auto const eof = std::char_traits<char>::eof();

void stiX::unique(std::istream& in, std::ostream& out) {
  auto previous_line = std::string { 1, '\0' };

  while(in.peek() != eof) {
    auto current_line = stiX::getline(in);

    if (previous_line != current_line)
      out << current_line << "\n";

    previous_line = current_line;
  };
}

I can’t quite decide if this is a neat use of an intended feature, or a crime against code. Possibly both. It’s definitely well-formed C++, but it does feel kind of funky. If it moves you, in a pleasant or uncomfortable direction, I’d be delighted to hear from you.

References

  • The many constructors of std::string

  • Back in 1982, AT&T produced a video to promote Unix. A few minutes in Brian Kernighan, his feet up on his desk, demonstrates the power of pipelines by combining various programs, including sort and unique, to build a spell checker. It’s really quite a remarkable demo, and he explains it brilliantly. The whole film, which runs just shy of 30 minutes, is a fascinating historical artefact.

Source Code

The full source code, with tests and what-not, for unique, and indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy a license for. I use it all the time now.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


Tagged code, and software-tools-in-c++

Wednesday 31 March 2021 The Forest Road Reader, No 2.61 : STinC++ - sorting large files

STinC++ rewrites the programs in Software Tools in Pascal using C++

After our little diversion through quick sort, we’re back again to tools, writing a sort utility for big files.

'Big' means more data than will fit in memory all at once; this is where life gets complicated. This kind of sorting is often called external sorting, because some of the data has to reside in temporary intermediate files. What we did in the previous section is by contrast called internal sorting.

The passage of time, Moore’s Law, and all that jazz means the big of 1981 is rather different to the big of today. The PDP-11/70 that Software Tools in Pascal was typeset on (and is I assume representative of what they wrote the programs for) had a maximum memory of 4Mb of RAM. The machine I’m typing this on now has 32Gb of RAM plus at least the same in swap, some 16,000 times more. I’m pretty sure a process on the PDP-11 could get anywhere near having 4Mb available, and of course the whole of my 32Gb isn’t available to a single process, but we’re clearly operating at different orders of magnitude and I’ll need a fairly large file to put this machine in any distress.

Perhaps the need for an external sort routine isn’t as pressing, but we can still write the code. The essence of external sorting is

  • Read as many lines as you can

  • Sort them

  • Write them to a new temporary file

  • Repeat until the input is exhausted

We now have a number of files, whose contents is sorted. Now we merge those files back together.

  • Read the first line of each file, remembering where each line came from

  • Sort those lines

  • Write out the first line

  • Read a replacement line from the file that line came from

  • Continue until we’ve read the contents of every file are empty

Reading the input, sorting into intermediate files

Reading the input a chunk at a time is just a slight tweak on our previous code. Rather than reading until the input is exhausted, we read until it’s exhausted or we’ve read our desired number of lines.

using lines_t = std::vector<std::string>;

lines_t read_lines(std::istream& in, int max_lines) {
  auto lines = lines_t { };

  auto count = 0;
  while (in && count++ != max_lines)
    lines.emplace_back(stiX::getline(in));

  return lines;
}

Writing to a bunch of lines back out feels like something we’ve mastered over the course of these articles.

std::filesystem::path write_lines(lines_t const& lines) {
  auto temp_file = new_working_filepath();
  write_lines(temp_file, lines);
  return temp_file;
}

void write_lines(std::filesystem::path const& filename, lines_t const& lines) {
  auto out = std::ofstream(filename);
  std::copy(
    lines.begin(),
    lines.end(),
    std::ostream_iterator<std::string>(out, "\n")
  );
}
Generating the sorted output

Once we have our list of files, they need to be merged together in the proper order to produce the sorted output. That’s a little more of a challenge.

Kernighan and Plauger, forever hampered by Pascal’s lack of a string class read each file’s lines into a known part of a fixed sized buffer, which they index into to sort. If their lowest order string is at index 3, say, they know to read the replacement lien from their third file.

Here and now, we have an actual string type which has served us perfectly well so far. If only it could carry along the file it came from. Well, a std::string can’t, of course, but one of C++'s raison d’etre is defining our own composite types.

Having stashed the input into files, let’s start by opening all those files for reading.

  auto working_files = read_to_files(std::cin);
  auto file_sources = std::vector<std::ifstream> { };

  for (auto const& working_file : working_files)
    file_sources.emplace_back(working_file); (1)
  1. emplace_back constructs an object in place. It takes as its arguments the constructor arguments of the object being created. Here, we’re constructing std::ifstream objects, so we can just pass in the name of the file to we want to read from.

Now let’s read the first line of each of those files.

  auto merge_sources = std::deque<merge_source> { }; (1)
  for (auto& file_source: file_sources)
    merge_sources.emplace_back(file_source);
  1. merge_source pairs up our lines with the files the come from. We’ll get to it in a moment.

And merge them together …​

  while(!merge_sources.empty()) {
    std::sort(merge_sources.begin(), merge_sources.end());

    auto first = merge_sources.begin();
    std::cout << first->line() << "\n";

    if (!first->next())
      merge_sources.pop_front();  // file is empty
  }

We don’t need to close the files, as that will be done for us when the file_sources collection goes out of scope, but we do need to delete those files now we know longer need them.

  for (auto const& working_file : working_files)
    std::filesystem::remove(working_file);
The merge_source class

As is often the case with software, a lot of the work is hidden from view. The work of reading the input files is wrapped up in the merge_source class.

class merge_source {
public:
  explicit merge_source(std::istream& str) : stream_(&str) {
    next();
  }

  std::string const& line() const { return line_; }
  bool next() { (1)
    std::getline(*stream_, line_);
    return stream_->good();
  }

private:
  std::istream* stream_; (2)
  std::string line_;
};

merge_source pairs up the line we’ve read with the file it came from.

  1. When we call next() it reads a line from the file, returning true if the read was successful. For a general file reader this code wouldn’t be sufficient, but because we’ve just written the files we’re reading back, we know the last line of the file is empty. Hence if the stream is exhausted, we must be at the end of the file and stream_→good() will be false.

  2. We hold a pointer to the stream because we need to be able to swap merge_source objects around when we sort them.

We make merge_source sortable by providing an operator <

bool operator <(const merge_source &a, const merge_source &b) {
  return a.line() < b.line();
}

In the merge loop, when a source is finished we discard it.

  auto merge_sources = std::deque<merge_source> { };

  while(!merge_sources.empty()) {
    // ...

    if (!first->next())
      merge_sources.pop_front();  // file is empty
  }

Eventually, all the sources will be discarded causing the loop to terminate. The merge_sources collection is a deque because it allows us to pop objects off the front. I can’t remember the last time I used a deque, but it’s good to have it there when you need it.

You can stress it if you try

Read a bunch lines, sort them, stuff in a file, carry on until you can’t read any more, merge the files back together. Would that life were so simple.

The code presented above will create as many intermediate files as it needs. It turns out if you try hard enough, say by creating a foolishly large input and configuring your external sort to place only a single line in each intermediate, you can create so many intermediate files that you aren’t able to open them all at the same time when you want to read them back.

So what do to? As Kernighan and Plauger point out, if you want to limit the number of intermediate files to, say, N, well you’ve just written code to merge a bunch of files together. If you interleave merging with reading and creating the intermediate files, you’ll never need more than N+1 files for the duration of the sort.

working_files_t read_to_files(std::istream& in) {
  auto working_files = working_files_t { };

  while (in) {
    while (in && working_files.size() < merge_order) { (1)
      auto lines = read_lines(in, max_lines_to_read);

      std::sort(lines.begin(), lines.end());

      auto new_file = write_lines(lines);
      working_files.push_back(new_file);
    };

    if (in) { (2)
      auto merge_filename = new_working_filepath();
      auto merge_file = std::ofstream{merge_filename};
      merge_files(merge_file, working_files);

      working_files.clear(); (3)
      working_files.push_back(merge_filename);
    }
  }

  return working_files;
}
  1. merge_order is the maximum number of intermediate files we want to create.

  2. If there’s more input available, we merge the intermediate files. The if here isn’t strictly necessary as it wouldn’t really matter if we always merged, but this check is a small speed optimization that avoids merging everything only to immediately read back the file and write it to the console.

  3. The call to merge_files has deleted the intermediate working files, leaving us with the merged file. It becomes the first of our new set of working files.

It’s not really about sorting

Despite this chapter’s diversions into different types of sort, Software Tools in Pascal isn’t a book about algorithms, it’s a book about software development. The key lesson here is the technique of using disk space to compensate for lack of memory. The sorting aspect, while a relatable enough application, is kind of by-the-by and it’s best if we don’t examine that too deeply.

Sorting text can be incredibly complicated, and byte by byte comparisions are adequate at best while at worst are absolutely useless. For many applications, we need to consider whole words and ideally what those words represent. When we sort names, for example, we might consider Mc or Mac prefixes to be equivalent. Assuming we’re working English.

Different written languages have different rules for alphabetical orders. Even without stepping outside Latin script, the idea of what constitutes a letter changes between languages - the Welsh ll and the Dutch ij are single letters even though they consist of two characters. It might even depend on when you’re sorting the text. Alphabets change, and perhaps more often that you think. For example, the Spanish alphabet changed in 2010 to remove ch and ll as letters in their own right. To make things even more complex, how you sorted ch and ll as distinct from c and l was changed in 1994 even though they were still letters in their own right at the time.

Text is complicated. If at all possible, let someone else handle it for you.

References

Source Code

Source code for external sort, and indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and renew a license for.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


Tagged code, and software-tools-in-c++

Monday 08 March 2021 First ice cream van of the year …​

A longitudinal study

First confirmed ice cream van of 2021 was, for a remarkable third year in a row, Tony’s Ices on Grove Avenue at 16:52 on the 8th of March. No customers, despite the lovely sunny weather, but would you buy from a bloke without a mask on at this particular point in human history?

8th March sits bang in the middle of the arrival curve, strengthing the second week of March as the ideal time to start hunting out that first van of spring.

Field observations

Analysis

First Ice Cream Van Of The Year 2004-2021
First Ice Cream Van Of The Year 2004-2021
First Ice Cream Van Of The Year 2004-2021

The full ice cream van data is available as a spreadsheet.

Previous Years


Tagged icecream
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed