Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Wednesday 31 March 2021 The Forest Road Reader, No 2.61 : STinC++ - sorting large files

STinC++ rewrites the programs in Software Tools in Pascal using C++

After our little diversion through quick sort, we’re back again to tools, writing a sort utility for big files.

'Big' means more data than will fit in memory all at once; this is where life gets complicated. This kind of sorting is often called external sorting, because some of the data has to reside in temporary intermediate files. What we did in the previous section is by contrast called internal sorting.

The passage of time, Moore’s Law, and all that jazz means the big of 1981 is rather different to the big of today. The PDP-11/70 that Software Tools in Pascal was typeset on (and is I assume representative of what they wrote the programs for) had a maximum memory of 4Mb of RAM. The machine I’m typing this on now has 32Gb of RAM plus at least the same in swap, some 16,000 times more. I’m pretty sure a process on the PDP-11 could get anywhere near having 4Mb available, and of course the whole of my 32Gb isn’t available to a single process, but we’re clearly operating at different orders of magnitude and I’ll need a fairly large file to put this machine in any distress.

Perhaps the need for an external sort routine isn’t as pressing, but we can still write the code. The essence of external sorting is

  • Read as many lines as you can

  • Sort them

  • Write them to a new temporary file

  • Repeat until the input is exhausted

We now have a number of files, whose contents is sorted. Now we merge those files back together.

  • Read the first line of each file, remembering where each line came from

  • Sort those lines

  • Write out the first line

  • Read a replacement line from the file that line came from

  • Continue until we’ve read the contents of every file are empty

Reading the input, sorting into intermediate files

Reading the input a chunk at a time is just a slight tweak on our previous code. Rather than reading until the input is exhausted, we read until it’s exhausted or we’ve read our desired number of lines.

using lines_t = std::vector<std::string>;

lines_t read_lines(std::istream& in, int max_lines) {
  auto lines = lines_t { };

  auto count = 0;
  while (in && count++ != max_lines)
    lines.emplace_back(stiX::getline(in));

  return lines;
}

Writing to a bunch of lines back out feels like something we’ve mastered over the course of these articles.

std::filesystem::path write_lines(lines_t const& lines) {
  auto temp_file = new_working_filepath();
  write_lines(temp_file, lines);
  return temp_file;
}

void write_lines(std::filesystem::path const& filename, lines_t const& lines) {
  auto out = std::ofstream(filename);
  std::copy(
    lines.begin(),
    lines.end(),
    std::ostream_iterator<std::string>(out, "\n")
  );
}
Generating the sorted output

Once we have our list of files, they need to be merged together in the proper order to produce the sorted output. That’s a little more of a challenge.

Kernighan and Plauger, forever hampered by Pascal’s lack of a string class read each file’s lines into a known part of a fixed sized buffer, which they index into to sort. If their lowest order string is at index 3, say, they know to read the replacement lien from their third file.

Here and now, we have an actual string type which has served us perfectly well so far. If only it could carry along the file it came from. Well, a std::string can’t, of course, but one of C++'s raison d’etre is defining our own composite types.

Having stashed the input into files, let’s start by opening all those files for reading.

  auto working_files = read_to_files(std::cin);
  auto file_sources = std::vector<std::ifstream> { };

  for (auto const& working_file : working_files)
    file_sources.emplace_back(working_file); (1)
  1. emplace_back constructs an object in place. It takes as its arguments the constructor arguments of the object being created. Here, we’re constructing std::ifstream objects, so we can just pass in the name of the file to we want to read from.

Now let’s read the first line of each of those files.

  auto merge_sources = std::deque<merge_source> { }; (1)
  for (auto& file_source: file_sources)
    merge_sources.emplace_back(file_source);
  1. merge_source pairs up our lines with the files the come from. We’ll get to it in a moment.

And merge them together …​

  while(!merge_sources.empty()) {
    std::sort(merge_sources.begin(), merge_sources.end());

    auto first = merge_sources.begin();
    std::cout << first->line() << "\n";

    if (!first->next())
      merge_sources.pop_front();  // file is empty
  }

We don’t need to close the files, as that will be done for us when the file_sources collection goes out of scope, but we do need to delete those files now we know longer need them.

  for (auto const& working_file : working_files)
    std::filesystem::remove(working_file);
The merge_source class

As is often the case with software, a lot of the work is hidden from view. The work of reading the input files is wrapped up in the merge_source class.

class merge_source {
public:
  explicit merge_source(std::istream& str) : stream_(&str) {
    next();
  }

  std::string const& line() const { return line_; }
  bool next() { (1)
    std::getline(*stream_, line_);
    return stream_->good();
  }

private:
  std::istream* stream_; (2)
  std::string line_;
};

merge_source pairs up the line we’ve read with the file it came from.

  1. When we call next() it reads a line from the file, returning true if the read was successful. For a general file reader this code wouldn’t be sufficient, but because we’ve just written the files we’re reading back, we know the last line of the file is empty. Hence if the stream is exhausted, we must be at the end of the file and stream_→good() will be false.

  2. We hold a pointer to the stream because we need to be able to swap merge_source objects around when we sort them.

We make merge_source sortable by providing an operator <

bool operator <(const merge_source &a, const merge_source &b) {
  return a.line() < b.line();
}

In the merge loop, when a source is finished we discard it.

  auto merge_sources = std::deque<merge_source> { };

  while(!merge_sources.empty()) {
    // ...

    if (!first->next())
      merge_sources.pop_front();  // file is empty
  }

Eventually, all the sources will be discarded causing the loop to terminate. The merge_sources collection is a deque because it allows us to pop objects off the front. I can’t remember the last time I used a deque, but it’s good to have it there when you need it.

You can stress it if you try

Read a bunch lines, sort them, stuff in a file, carry on until you can’t read any more, merge the files back together. Would that life were so simple.

The code presented above will create as many intermediate files as it needs. It turns out if you try hard enough, say by creating a foolishly large input and configuring your external sort to place only a single line in each intermediate, you can create so many intermediate files that you aren’t able to open them all at the same time when you want to read them back.

So what do to? As Kernighan and Plauger point out, if you want to limit the number of intermediate files to, say, N, well you’ve just written code to merge a bunch of files together. If you interleave merging with reading and creating the intermediate files, you’ll never need more than N+1 files for the duration of the sort.

working_files_t read_to_files(std::istream& in) {
  auto working_files = working_files_t { };

  while (in) {
    while (in && working_files.size() < merge_order) { (1)
      auto lines = read_lines(in, max_lines_to_read);

      std::sort(lines.begin(), lines.end());

      auto new_file = write_lines(lines);
      working_files.push_back(new_file);
    };

    if (in) { (2)
      auto merge_filename = new_working_filepath();
      auto merge_file = std::ofstream{merge_filename};
      merge_files(merge_file, working_files);

      working_files.clear(); (3)
      working_files.push_back(merge_filename);
    }
  }

  return working_files;
}
  1. merge_order is the maximum number of intermediate files we want to create.

  2. If there’s more input available, we merge the intermediate files. The if here isn’t strictly necessary as it wouldn’t really matter if we always merged, but this check is a small speed optimization that avoids merging everything only to immediately read back the file and write it to the console.

  3. The call to merge_files has deleted the intermediate working files, leaving us with the merged file. It becomes the first of our new set of working files.

It’s not really about sorting

Despite this chapter’s diversions into different types of sort, Software Tools in Pascal isn’t a book about algorithms, it’s a book about software development. The key lesson here is the technique of using disk space to compensate for lack of memory. The sorting aspect, while a relatable enough application, is kind of by-the-by and it’s best if we don’t examine that too deeply.

Sorting text can be incredibly complicated, and byte by byte comparisions are adequate at best while at worst are absolutely useless. For many applications, we need to consider whole words and ideally what those words represent. When we sort names, for example, we might consider Mc or Mac prefixes to be equivalent. Assuming we’re working English.

Different written languages have different rules for alphabetical orders. Even without stepping outside Latin script, the idea of what constitutes a letter changes between languages - the Welsh ll and the Dutch ij are single letters even though they consist of two characters. It might even depend on when you’re sorting the text. Alphabets change, and perhaps more often that you think. For example, the Spanish alphabet changed in 2010 to remove ch and ll as letters in their own right. To make things even more complex, how you sorted ch and ll as distinct from c and l was changed in 1994 even though they were still letters in their own right at the time.

Text is complicated. If at all possible, let someone else handle it for you.

References

Source Code

Source code for external sort, and indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and renew a license for.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it is my first choice for future work.


Tagged code, and software-tools-in-c++

Monday 08 March 2021 First ice cream van of the year …​

A longitudinal study

First confirmed ice cream van of 2021 was, for a remarkable third year in a row, Tony’s Ices on Grove Avenue at 16:52 on the 8th of March. No customers, despite the lovely sunny weather, but would you buy from a bloke without a mask on at this particular point in human history?

8th March sits bang in the middle of the arrival curve, strengthing the second week of March as the ideal time to start hunting out that first van of spring.

Field observations

Analysis

First Ice Cream Van Of The Year 2004-2021
First Ice Cream Van Of The Year 2004-2021
First Ice Cream Van Of The Year 2004-2021

The full ice cream van data is available as a spreadsheet.

Previous Years


Tagged icecream

Thursday 31 December 2020 The Forest Road Reader, No 2.60 : STinC++ - quick sort, part 2

STinC++ rewrites the programs in Software Tools in Pascal using C++

When I headed off to bed last night, our quick sort implementation looked pretty good.

template<class Iterator>
void quick_sort(Iterator begin, Iterator end) {
  if(std::distance(begin, end) <= 1)
    return;

  auto pivot = *begin;
  auto left = std::next(begin);
  auto right = std::prev(end);

  while (left != right) {
    while ((left != right) && (*left < pivot))
      left = std::next(left);

    while ((left != right) && (*right >= pivot))
      right = std::prev(right);

    if (left != right)
      std::iter_swap(left, right);
  }

  if (*left >= pivot)
    left = std::prev(left);
  std::iter_swap(left, begin);

  quick_sort(begin, left);
  quick_sort(right, end);
} // quick_sort

Today I want to fill that out so we can pass in a custom comparator, extending the interface to match std::sort [1].

template<class RandomIt, class Compare>
void sort(RandomIt first, RandomIt last, Compare comp);

I’m still on my laptop, but at the dining table this morning. I have a cup of coffee on my right, and a teenager who might ask me to help with his A-level physics on my left. I’m all set.

Let’s start by getting the default comparator, std::less, into place. If the tests all still run - and, you know, they should - that’s a happy start.

Add it to the function definition first.

template<class Iterator, class Comparator = std::less<>>
void quick_sort(Iterator begin, Iterator end, Comparator comparator = Comparator()) {
  ...

And then hook it up in the body of the function, where

  while ((left != right) && (*left < pivot))
    left = std::next(left);

  while ((left != right) && (*right >= pivot))
    right = std::prev(right);

becomes

  while ((left != right) && comparator(*left, pivot))
    left = std::next(left);

  while ((left != right) && !comparator(*right, pivot))
    right = std::prev(right);

Tests pass. No A-level questions. Looking good so far.

The tests I dropped in yesterday included a set with a custom comparator. I’ll enable those and see what happens.

Sorts up, but not down

Sigh

Oh wait. I’m a twit.

  if (*left >= pivot)
    left = std::prev(left);
  std::iter_swap(left, begin);

The pivot point shuffling is no longer correct now we’ve got a comparator.

  if (!comparator(*left, pivot))
    left = std::prev(left);
  std::iter_swap(left, begin);

That helps a bit, but it wasn’t a miracle cure.

Sorts a bit more down. 10 of 17 tests now passing."
Look, Then Look Again

When I added the comparator parameter, I defaulted the value. I wanted to make the smallest possible change I could, because if I’ve learned anything over the past thirty years it’s the less I type the fewer bugs I put in the code.

Sometimes though, not typing enough will also put bugs in. Here’s one of the test failures.

---------------------------------------------------------------------------
Chapter 4 - quick sort
  reverse sort { 3, 10, 9, 12, 8, 4, 1, 6, 5, 13, 2, 7, 11 }
---------------------------------------------------------------------------
/home/jez/work/jezuk/stiX/c++/chapter_4/0_sorts/quick_sort.cpp:81
...........................................................................

/home/jez/work/jezuk/stiX/c++/chapter_4/0_sorts/quick_sort.cpp:88: FAILED:
  REQUIRE( under_test == expected )
with expansion:
  { 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 3, 1, 2 }
  ==
  { 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 }

If we momentarily set aside the last three values, this sequence is beautifully sorted. In ascending order. Where’s that less-than comparator coming from? It’s coming from exactly where I asked the compiler to add it for me.

  quick_sort(begin, left);
  quick_sort(right, end);

Quick sort is a recursive algorithm. If there’s a custom comparator passed in at the top, I need to pass that on down as we recurse.

  quick_sort(begin, left, comparator);
  quick_sort(right, end, comparator);
/home/jez/work/jezuk/stiX/c++/cmake-build-debug/0_quick_sort
===========================================================================
All tests passed (28 assertions in 1 test case)

Process finished with exit code 0

Hurrah. Tests - they save you from yourself.

Quick Sort

template<class Iterator, class Comparator = std::less<>>
void quick_sort(Iterator begin, Iterator end, Comparator comparator = Comparator()) {
  if(std::distance(begin, end) <= 1)
    return;

  auto pivot = *begin;
  auto left = std::next(begin);
  auto right = std::prev(end);

  while (left != right) {
    while ((left != right) && comparator(*left, pivot))
      left = std::next(left);

    while ((left != right) && !comparator(*right, pivot))
      right = std::prev(right);

    if (left != right)
      std::iter_swap(left, right);
  }

  if (!comparator(*left, pivot))
    left = std::prev(left);
  std::iter_swap(left, begin);

  quick_sort(begin, left, comparator);
  quick_sort(right, end, comparator);
} // quick_sort

This is, as far as my tests are telling me, a complete quick sort implementation. There’s more that could be done, especially around selecting a good pivot value. Just grabbing the first value works, but for sequences which are already sorted it’s the worst case, and it’s not great for partially sorted sequences. Robert Sedgewick, who wrote his PhD on quick sort and whose Algorithms is pretty much the standard text, proposes the median of the first, middle, and last element.

I am leaving all that good stuff to the library implementers. This is an exercise, and I am exercised. See you in the New Year, friends.

Software Tools in Pascal’s code

The code presented in the book was contributed by Doug McIlroy, inventor of the Unix pipeline, author of any number of Unix utilities of precisely the kind we’re exploring here, and general all-around computing big brain.

{ quick -- quicksort for lines }
procedure quick (var linepos : posbuf; nlines : pos; var linebuf : charbuf);
{ rquick -- recursive quicksort }
procedure rquick (lo, hi: integer);
var
  i, j : integer;
  pivline : charpos;
begin
  if (lo < hi) then begin
    i := lo;
    j := hi;
    pivline := linepos[j];  { pivot line }
    repeat
      while (i < j)
        and (cmp(linepos[i],pivline,linebuf) <= 0) do
          i := i + 1;
      while (j > i)
        and (cmp(linepos[j],pivline,linebuf) >= 0) do
          j := j - 1
      if (i < j) then     { out of order pair }
        exchange(linepos[i], linepos[j])
    until (i >= j)
    exchange(linepos[i], linepos[hi]); { move pivot to i }
    if (i - lo < hi -i) then begin
      rquick(lo, i-1);
      rquick(i+1, hi)
    end
    else begin
      rquick(i+1, hi);
      rquick(lo, i-1)
    end
  end
end;
begin
  rquick(1, nlines)
end;

Now I’ve written my own implementation, this is reassuring familiar.

I think the reason McIlroy can swap the pivot into place without adjustment is because they use the rightmost value as the pivot [2].

Kernighan and Plauger don’t explain the little if at the bottom selecting which order to recurse into the two subsequences, but I believe it’s to minimise the call stack. By recursing into the smaller subsequence first, the depth of the call tree is kept to its minimum. This isn’t something we would think about now, except in exceptionally constrained environments. However, as I was reminded by John Cowan about this time last year, by modern standards the machines Kernighan and Plauger used were constrained environments. There wasn’t a lot of memory to chuck around, so code size was important and stack space was limited. Working within those boundaries was a natural and normal thing to do.

Source Code

Source code for this little routine, indeed the whole project, is available in the stiX GitHub repository.

Endnotes

This whole endeavour relies on Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are, frankly, just ridiculously expensive. Happily, there are plenty of second-hand copies floating round, or you can borrow them from The Internet Archive using the links above.

For this project I’ve been using JetBrain’s CLion, which I liked enough to buy and then renew a license for.

The test harness I’m using is Catch. I’ve been aware of Catch pretty much since it was first released, but this project is the first time really getting in and using it. I like it, and it’s what I use by default now.


2. Worked a couple of examples by hand, and this is the reason. Index i ends up pointing to a value equal to or greater than the pivot. The pivot in on the right, so the value being swapped ends up in the proper subsequence

Tagged code, and software-tools-in-c++
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Twitter
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed