Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Wednesday 31 March 2021 The Forest Road Reader, No 2.61 : STinC++ - sorting large files

STinC++ rewrites the programs in Software Tools in Pascal using C++

After our little diversion through quick sort, we’re back again to tools, writing a sort utility for big files.

'Big' means more data than will fit in memory all at once; this is where life gets complicated. This kind of sorting is often called external sorting, because some of the data has to reside in temporary intermediate files. What we did in the previous section is by contrast called internal sorting.

The passage of time, Moore’s Law, and all that jazz means the big of 1981 is rather different to the big of today. The PDP-11/70 that Software Tools in Pascal was typeset on (and is I assume representative of what they wrote the programs for) had a maximum memory of 4Mb of RAM. The machine I’m typing this on now has 32Gb of RAM plus at least the same in swap, some 16,000 times more. I’m pretty sure a process on the PDP-11 could get anywhere near having 4Mb available, and of course the whole of my 32Gb isn’t available to a single process, but we’re clearly operating at different orders of magnitude and I’ll need a fairly large file to put this machine in any distress.

Perhaps the need for an external sort routine isn’t as pressing, but we can still write the code. The essence of external sorting is

  • Read as many lines as you can

  • Sort them

  • Write them to a new temporary file

  • Repeat until the input is exhausted

We now have a number of files, whose contents is sorted. Now we merge those files back together.

  • Read the first line of each file, remembering where each line came from

  • Sort those lines

  • Write out the first line

  • Read a replacement line from the file that line came from

  • Continue until we’ve read the contents of every file are empty

Reading the input, sorting into intermediate files

Reading the input a chunk at a time is just a slight tweak on our previous code. Rather than reading until the input is exhausted, we read until it’s exhausted or we’ve read our desired number of lines.

using lines_t = std::vector<std::string>;

lines_t read_lines(std::istream& in, int max_lines) {
  auto lines = lines_t { };

  auto count = 0;
  while (in && count++ != max_lines)
    lines.emplace_back(stiX::getline(in));

  return lines;
}

Writing to a bunch of lines back out feels like something we’ve mastered over the course of these articles.

std::filesystem::path write_lines(lines_t const& lines) {
  auto temp_file = new_working_filepath();
  write_lines(temp_file, lines);
  return temp_file;
}

void write_lines(std::filesystem::path const& filename, lines_t const& lines) {
  auto out = std::ofstream(filename);
  std::copy(
    lines.begin(),
    lines.end(),
    std::ostream_iterator<std::string>(out, "\n")
  );
}
Generating the sorted output

Once we have our list of files, they need to be merged together in the proper order to produce the sorted output. That’s a little more of a challenge.

Kernighan and Plauger, forever hampered by Pascal’s lack of a string class read each file’s lines into a known part of a fixed sized buffer, which they index into to sort. If their lowest order string is at index 3, say, they know to read the replacement lien from their third file.

Here and now, we have an actual string type which has served us perfectly well so far. If only it could carry along the file it came from. Well, a std::string can’t, of course, but one of C++'s raison d’etre is defining our own composite types.

Having stashed the input into files, let’s start by opening all those files for reading.

  auto working_files = read_to_files(std::cin);
  auto file_sources = std::vector<std::ifstream> { };

  for (auto const& working_file : working_files)
    file_sources.emplace_back(working_file); (1)
  1. emplace_back constructs an object in place. It takes as its arguments the constructor arguments of the object being created. Here, we’re constructing std::ifstream objects, so we can just pass in the name of the file to we want to read from.

Now let’s read the first line of each of those files.

  auto merge_sources = std::deque<merge_source> { }; (1)
  for (auto& file_source: file_sources)
    merge_sources.emplace_back(file_source);
  1. merge_source pairs up our lines with the files the come from. We’ll get to it in a moment.

And merge them together …​

  while(!merge_sources.empty()) {
    std::sort(merge_sources.begin(), merge_sources.end());

    auto first = merge_sources.begin();
    std::cout << first->line() << "\n";

    if (!first->next())
      merge_sources.pop_front();  // file is empty
  }

We don’t need to close the files, as that will be done for us when the file_sources collection goes out of scope, but we do need to delete those files now we know longer need them.

  for (auto const& working_file : working_files)
    std::filesystem::remove(working_file);
The merge_source class

As is often the case with software, a lot of the work is hidden from view. The work of reading the input files is wrapped up in the merge_source class.

class merge_source {
public:
  explicit merge_source(std::istream& str) : stream_(&str) {
    next();
  }

  std::string const& line() const { return line_; }
  bool next() { (1)
    std::getline(*stream_, line_);
    return stream_->good();
  }

private:
  std::istream* stream_; (2)
  std::string line_;
};

merge_source pairs up the line we’ve read with the file it came from.

  1. When we call next() it reads a line from the file, returning true if the read was successful. For a general file reader this code wouldn’t be sufficient, but because we’ve just written the files we’re reading back, we know the last line of the file is empty. Hence if the stream is exhausted, we must be at the end of the file and stream_→good() will be false.

  2. We hold a pointer to the stream because we need to be able to swap merge_source objects around when we sort them.

We make merge_source sortable by providing an operator <

bool operator <(const merge_source &a, const merge_source &b) {
  return a.line() < b.line();
}

In the merge loop, when a source is finished we discard it.

  auto merge_sources = std::deque<merge_source> { };

  while(!merge_sources.empty()) {
    // ...

    if (!first->next())
      merge_sources.pop_front();  // file is empty
  }

Eventually, all the sources will be discarded causing the loop to terminate. The merge_sources collection is a deque because it allows us to pop objects off the front. I can’t remember the last time I used a deque, but it’s good to have it there when you need it.

You can stress it if you try

Read a bunch lines, sort them, stuff in a file, carry on until you can’t read any more, merge the files back together. Would that life were so simple.

The code presented above will create as many intermediate files as it needs. It turns out if you try hard enough, say by creating a foolishly large input and configuring your external sort to place only a single line in each intermediate, you can create so many intermediate files that you aren’t able to open them all at the same time when you want to read them back.

So what do to? As Kernighan and Plauger point out, if you want to limit the number of intermediate files to, say, N, well you’ve just written code to merge a bunch of files together. If you interleave merging with reading and creating the intermediate files, you’ll never need more than N+1 files for the duration of the sort.

working_files_t read_to_files(std::istream& in) {
  auto working_files = working_files_t { };

  while (in) {
    while (in && working_files.size() < merge_order) { (1)
      auto lines = read_lines(in, max_lines_to_read);

      std::sort(lines.begin(), lines.end());

      auto new_file = write_lines(lines);
      working_files.push_back(new_file);
    };

    if (in) { (2)
      auto merge_filename = new_working_filepath();
      auto merge_file = std::ofstream{merge_filename};
      merge_files(merge_file, working_files);

      working_files.clear(); (3)
      working_files.push_back(merge_filename);
    }
  }

  return working_files;
}
  1. merge_order is the maximum number of intermediate files we want to create.

  2. If there’s more input available, we merge the intermediate files. The if here isn’t strictly necessary as it wouldn’t really matter if we always merged, but this check is a small speed optimization that avoids merging everything only to immediately read back the file and write it to the console.

  3. The call to merge_files has deleted the intermediate working files, leaving us with the merged file. It becomes the first of our new set of working files.

It’s not really about sorting

Despite this chapter’s diversions into different types of sort, Software Tools in Pascal isn’t a book about algorithms, it’s a book about software development. The key lesson here is the technique of using disk space to compensate for lack of memory. The sorting aspect, while a relatable enough application, is kind of by-the-by and it’s best if we don’t examine that too deeply.

Sorting text can be incredibly complicated, and byte by byte comparisions are adequate at best while at worst are absolutely useless. For many applications, we need to consider whole words and ideally what those words represent. When we sort names, for example, we might consider Mc or Mac prefixes to be equivalent. Assuming we’re working English.

Different written languages have different rules for alphabetical orders. Even without stepping outside Latin script, the idea of what constitutes a letter changes between languages - the Welsh ll and the Dutch ij are single letters even though they consist of two characters. It might even depend on when you’re sorting the text. Alphabets change, and perhaps more often that you think. For example, the Spanish alphabet changed in 2010 to remove ch and ll as letters in their own right. To make things even more complex, how you sorted ch and ll as distinct from c and l was changed in 1994 even though they were still letters in their own right at the time.

Text is complicated. If at all possible, let someone else handle it for you.

References

Source Code

Source code for external sort, and indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and renew a license for.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


Tagged code, and software-tools-in-c++


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed