Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

STinC++ - Format, a typesetter of my own : Yr glygfa o Hescwm Uchaf, rhif 14

STinC++ rewrites the programs in Software Tools in Pascal using C++

If you’ve been here for any length of time, you’ll be aware of my mild obsession with Brian Kernighan’s typesetting. In addition to the various his various books, including, of course, Software Tools in Pascal, he even typeset his own PhD thesis using a program he wrote in Fortran. Indeed, Software Tools in Pascal's precursor, simply Software Tools, used Fortran as its implementation language. More precisely, it used Ratfor, a language which layered structured programming control structures (eg if/else, for loops, etc) over Fortran. The original version of Ratfor was developed by Kernighan in order to write the typesetter with which he produced his thesis. Even in the 60s, people knew to solve problems with an extra level of indirection.

Having spend chapter 6 (and in my own case nearly two years) developing a text editor, chapter 7 of Software Tools in Pascal opens with a brisk “Our next task is to write a text formatter”. I feel like at this point in the book, Brian Kernighan is very firmly at the keyboard. A text formatter is “a program for neatly formatting a document on a suitable printer”.

Today we would tend to think of formatting on screen as the default, and would perhaps talk about markup rather than formatting or typesetting, with print formatting as something that comes for free as a bonus[1] Often we even format by going from markup language to another, which is in turned rendered on screen. This blog, for example, is marked up using AsciiDoc, from which I generate the HTML served up to your browser, which has done a lovely job of displaying it on screen.

The format program Kernighan and Plauger describe is a very lightweight version of nroff, the “new runoff” text formatter. It still exists, as GNU roff groff, a piece of software that traces its lineage back beyond even those early days of UNIX at Bell Labs. It’s entirely likely you’ve never directly used groff, but if you’ve looked at a man page, then it was probably there in the background. That just scratches the surface of its capabilities, and it’s entirely possible to produce camera-ready copy for entire books using groff.[2]

Kernighan and Plauger, as I’ve noted before, were working at a time when business was paper-based. Typesetting patent applications was a key part of the business case for continuing work on UNIX.

Machine formatting eases the typing job, since margin alignment, centering, underlining and similar tedious operations are handled by the computer, not by the typist. It also permits drastic format changes in a document without altering any text. But perhaps most important, it seems to encourage writers to improve their product, since the overhead of making an improvement is small and there is an esthetic satisfaction in having a clean copy.

Plus ça change, plus c’est la meme chose and all that.

Kernighan and Plauger describe format's markup as “quite conventional”

It accepts text to be formatted, interspersed with formatting commands telling format what the output is to look like. A command consists of a period, a two-letter name, and perhaps some optional information. Each command must appear at the beginning of a line, with nothing on the like but the command and its arguments. For instance,

.ce

centers the next line of output, and

.sp 3

generates three spaces (blank lines).

format allows for changing the page width and length, setting headers and footers with optional page numbers, line and page breaks, centering text, indenting, underlining, and whether the text should be right justified or not.

PROGRAM

format produce formatted output

USAGE

edit

FUNCTION

format reads its input a line at a time and writes a neatly formatted version of the input text, with page headers and footers and with output lines filled to a uniform right margin. Input text line may have interspersed among them command lines that alter the default mode of formatting. A command line consists of a leading period, followed by a two letter code, possibly with optional arguments following the first sequence of blanks and tabs.

Certain commands cause a "break" in the processing of the input text lines, i.e., any partially filled line is output and a new line is begun. In the following command summary, the letter n stands for an optional numeric argument. If a numeric argument is preceded by a + or -, the current value is changed by this amount; otherwise the argument represents the new value. If no argument is given, the default value is used.

command

break?

default

function

.bp n

yes

n=+1

begin page numbered n

.br

yes

cause break

.ce n

yes

n=1

center next n lines

.fi

yes

start filling

.fo str

no

empty

footer title

.he str

no

empty

header title

.in n

no

n=0

indent n spaces

.ls n

no

n=1

line spacing is n

.nf

yes

stop filling

.pl n

no

n=66

set page length to n

.rm n

no

n=60

set right margin to n

.sp n

yes

n=1

space down n lines or to bottom of page

.ti n

yes

n=0

temporary indent of n

.ul n

no

n=1

underline words from next n lines

A blank input line causes a break and is passed to the output unchanged. Similarly, an input line that begins with blanks causes a break and is written to the output with the leading blanks preserved. Thus a document formatted in the conventional manner by hand will retain its original paragraph breaks and indentation.

Writing format

From the description, you’ve probably already got the rough shape of the code in your head. Read the input a line at a time, see if it’s a command or text, and react accordingly. And, yes, that parts pretty straightforward.

format()
void stiX::screen_formatter::format() {
  while (in_.peek() != eof) {
    auto line = stiX::getline(in_);

    if (is_command(line))
      handle_command(line);
    else
      handle_text(line);
  }

  page_end();
}

In a way, we’ve already been here before with print. print doesn’t apply any formatting, but it does break text files up into pages, adding headers and footers at the top and bottom of each page. I think it would have been entirely valid to start with print and work in from pages, down to the line formatting.

I didn’t start there, though. I worked inside out, initially just taking plain text in and producing format's "filled text". This is actually quite a large step. We need to buffer the input, checking to see if we should wrap, then apply any justifying we might need before outputting.

buffer_line
void stiX::screen_formatter::buffer_line(std::string const& line) {
  for (auto const& word: split_into_words(line)) {
    if (count_width(buffer_) + word.width >= fillable_width())
      fill_and_flush();

    buffer_ += buffer_.empty() ? null : space;
    buffer_ += word.word;
  }
}

From there, I worked backwards to unfilled text, and being able to switch from one to the other on demand.

switching to fill and no fill modes
void stiX::screen_formatter::nf_no_fill() {
  set_fill_mode(false);
}
void stiX::screen_formatter::fi_fill_on() {
  set_fill_mode(true);
}
void stiX::screen_formatter::set_fill_mode(bool on) {
  flush();
  fill_ = on;
  set_output_mode();
}
void stiX::screen_formatter::set_output_mode() {
  output_mode_ = fill_
    ? &screen_formatter::buffer_line
    : &screen_formatter::print_line;
}

From there, I progressively enhanced, setting page length, line spacing, and the break commands, followed by centred lines and underlining. I actually came to pagination, headers & footers last.

While the process of formatting the text is easy enough to understand - you can read the input and picture the formatted output without difficulty - it’s a little fiddlier to implement than you’d hope. There’s a lot of potentially overlapping state and boundaries. You might encounter a page break, requiring you to output a page footer and then header, when processing line spacing, for example, while the output should also be indented and underlined.

As a consequence there’s a lot of state to keep hold off, and this is reflected in the code. There are lots of flags and counters to track the various modes, and far fewer pure functions than I’d usually like to write.

formatter member variables
    size_t right_margin_;
    size_t page_length_;

    size_t line_space_;
    size_t centering_;
    size_t underline_;
    size_t bold_;
    size_t italic_;
    size_t strikethrough_;

    std::string header_;
    std::string footer_;

    bool fill_;
    size_t indent_;
    std::optional<size_t> next_indent_;

    size_t current_line_;
    size_t current_page_;
    std::string buffer_;

    using output_fn = void(screen_formatter::*)(std::string const&);
    output_fn output_mode_;

Within the code, I’ve made a conscious effort to try and reduce the number of explicit if/else statements, for example by using a std::optional to return a default value. I’m sure there’s more I could do in this direction, but if so can’t see it at the moment.

Tracking the current output line, and handling the page starts and ends is the most tricky thing to track. At one point, I inadvertently ended up making the whole of the output process recursive, which could have been quite unfortunate had it not been flagged by my IDE before it did too much damage. I was able to rework things so that, ultimately, there’s only one place where the current line counter is incremented, and one place where we check for page start and end.

I can’t say I’m entirely happy, and I might well worry away at it another time, but it’s good enough for now[3]. And hey, it works!

Swerve!

In contrast to all my previous work on STinC++, I didn’t stick strictly to the program behaviour described in Software Tools in Pascal.

We no longer live in an age of big clunky line printers. We’ve even gone beyond glass teletypes. We have pseudoterminals!

Our programs can talk to pseudoterminals and ask it questions like how many characters wide and tall are you?” and it will tell you. Rather than defaulting to a page size of 66 characters wide and 60 lines long, my format queries the terminal it’s running in and, assuming it gets a sensible answer[4], defaults to that.

You can also issue instructions to the terminal like underline text until I tell you to stop. Kernighan and Plauger’s backspacing and overstriking to underline has gone the way of the line printers it worked on, so it would be absurdly anachronistic to not use the control codes.

The control codes also include bolding and italicising, so I added extensions for those, although I drew the line at blinking text.

And then, so I could properly enjoy the fruits of my formatting labour, I added a little callback function so the program waits for you to hit enter at the end of each page.

format in action

Going Heavyweight

Kernighan and Plauger suggest a number of extension exercises. Some are straightforward - different titles for odd and even pages, an include mechanism, bold text - and others are, well, I’d say challenging.

While noting the next chapter covers a macro processor, they raise the idea of incorporating a macro capability. The first step, they suggest, is a simple text expansion, where for example

.de pp
.sp
.ti +5
.en

defines a new command .pp. Whereever that occurred, it would be replaced with the space and temporary indent. “And of course it should be possible to redefine any of the built-in names like .sp.” they continue breezily, “This much is easy”, before going onto propose a macro capability that takes parameters.

Another extension, which they refrain from describing as easy, is hyphenation. They do provide a strong implementation suggestion, and also refer to TeX[5]'s hyphenation algorithm.

Their final extension, which a “truly powerful formatter needs”, is conditional formatting.

There is no limit to how extensive this can be made, but as a bare minimum, you will want to be able to test parameter values like output page number, line and page lengths, current position on the page, and whether or not you are in fill mode. You will also need arithmetic, variables to hold text and numbers, and string comparison operations. For example, both the section headings and exercises in this book were numbered automatically by the formatter, using numeric variables and operations.

Wow. I can see where hyphenation would fit, quite neatly, into the program. A macro capability would be a relatively significant piece of work, but the disruption to the existing code would be comparatively small. But this, wow, it’s big. Plenty of new code, and feels like lots of touch points in the existing code.

One of the best tests of whether you have enough tools in your formatter is whether you can construct with them a general footnote mechanism, where there can be multiple footnotes per pages, and where footnores are carried forward onto as many pages as are needed. Another test is whether you can do multi-column output. If you can do these cases well, you have enough power for most formatting situations.

I’ll say. The destination we’re being steered towards is, I guess, full-blown nroff.

I don’t see myself heading there anytime in the immediate future, but you never know. As my friend Andy said while in the midst of squashing version-update-induced compile errors on a project of his own, there is an attraction in working on something that’s completely specified.

Source code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. format is the only program in Chapter 7.

Endnotes

This whole business springs from Software Tools in Pascal and Software Tools, both by Brian W Kernighan and PJ Plauger. I love these books and commend them to you. They are both still in print, but new copies are ridiculously expensive. Happily, there are plenty of second-hand copies around, or you can borrow them from The Internet Archive using the links above.

I started using JetBrain’s CLion IDE back at the start of all this, and am currently using the new CLion Nova. It certainly feels a quicker than the CLion Classic which, since I was happy enough to spend money on that, is more than enough for me.


1. I mean it more or less does from a user point of view, but trust me when I say there’s a whole pile of software doing a ton of work in the background.
2. The AWK Programming Language, 2nd Edition, for example, was typeset using groff by, erm, Brian Kernighan
4. Which isn’t the case when you’re running as part of a pipeline, for instance
5. First released in 1978, TeX was the new kid on the typesetting block at the time Software Tools in Pascal was written

Tagged code, software-tools-in-c++, and typesetting

Make The Call : Yr glygfa o Hescwm Uchaf, rhif 13

At the back end of last year, I was talking with GeePaw about Friday Geek’s Night Out. FGNO is a regular call GeePaw and his friends have to just talk programming nonsense. I sounded like a lot of fun, and although I didn’t say so I was a teensy bit jealous. He just said I should start my own. “You know people, right?” he said, turning his innocent gaze upon me.

So I did, and it’s unbelievably great, and encourage you to do the same.


I, like I hope all of us, have friends that I know through conferences, email lists, meetups, etc, etc (although generally not through work - not sure what’s going on there) that I’ve known for years, if not decades. In our case, there’s a conference that most of us would attend most years.

Well, obviously that hasn’t been true quite so much recently

And I missed them.

So, prompted by GeePaw, I emailed them and laid out what I was thinking …​

That’s is a regular Zoom call (every Tuesday!) that GeePaw and a group of his software friends have every week. There’s no particular agenda - sometimes they look at piece of code someone brings, or talk about a work situation, or “sometimes we just grouch about stuff”. Not everyone in the group is there every time, but every week there’s a group and they talk for an hour, or two, or until everyone’s had enough. It sounds good.

… You can maybe see where this is going. …

The “just the bar part” of a convention is familiar to most of us, even if in public people prefer to say “hallway track”. It’s been the case for some time that I go to software conferences primarily because of the people I’m going to see there, most of whom I originally met through a mailing list or Twitter or some other modern equivalent of the small ads in the back of Comics International magazine. You people, the ones I’ve sent this email to. I’d like to see you more often than I do and much as I would love to invite you all to the pub on a regular basis, the realities of geography militate against it - that was true even before I moved to Pembrokeshire 🙂

But we all have computers, and webcams, and all that good modern stuff, so a Zoom call is entirely possible. So, that’s what I’d like to try - a regular call just to get together and talk about stuff. I don’t know how often - weekly feels a bit too often, and monthly not quite enough, so I’m leaning towards every couple of weeks, on a weekday evening, late enough that you can have had your tea, put the bins out, and all that

I stressed that there wasn’t any obligation, even to reply to this initial email. You could say yes, but weren’t committing to being there every time, you didn’t have to arrive at the start and stay until the end. You could come in and out, however frequently or infrequently, nor did you have to explain absence or presence.

But, I hoped, we could establish that cadence so if you wanted to come, you could.

If that sounds appealing, or even just intriguing, please let me know, perhaps with any scheduling preferences you have. If not, I’d be grateful if you let me know, but don’t feel obligated. You needn’t explain, and I won’t be at all offended - I’m aware there’s quite a lot of projection in what I’ve written above, and indeed the whole urge might just be some kind of crisis brought on by my kids leaving home, because I got hit on the head by a hockey ball on Tuesday[1], or something like that.

To my delight, everybody did like the sound of it. We tried, and have settled, on every second Wednesday, kick off at about 8:30pm and finish whenever.

And it’s just been so fantastic.

So, if you think you have a little space something like this could fill, think about who’d you’d like there with you and ask them, because they might well have a little space too.


1. Yes, I had indeed been hit on the head with a hockey ball, and was sporting an enormous black eye at the time. I don’t recommend it, but on the scale of things to be hit on the head with it’s at the lower end

Tagged the-trade

STinC++ - Ed! Ed! Hooray! : Yr glygfa o Hescwm Uchaf, rhif 12

STinC++ rewrites the programs in Software Tools in Pascal using C++

Close on two years, on 29 August 2022, I started work on edit, the program described in Chapter 6 of Software Tools In Pascal. I was, I declared, excited to be getting in it. I meant it too, and so it might be reasonable to think I’d get on with things with a shade more urgency. Life is what it is, though, and so work proceeded rather in fits and starts. Indeed, at one point I thought I might just poke away it kind of forever but I’m only 55[1] so I’m a little young to be thinking like that. Twenty one months later I wrapped up edit a thousand lines of code and 270 commits after I started. The world absolutely doesn’t need another line-oriented text editor, but nonetheless I’ve really enjoyed writing one.

A Refresher

Even if you’re the most dedicated reader, you’ve probably forgotten what I was setting out to do. Hell, I forgot several times along the way and I was actually writing the code. Chapter 6 of Software Tools in Pascal opens with brief rational for building an editor - since we have pattern matching and a simple stream editor, we’ve got the building foundations of a more general text editor.

The editor we present here is modeled after the latest in a long family of conversational text editors that have achieved wide acceptance …​ Because edit is primarily intended for interactive use, it is streamlined and terse, but easy to use. This is especially important for a text editor: for most users it is the primary interface to the system …​ edit is not confined to conversational editing, however. It can be driven from prepared scripts and from other programs …​ It is a tool.

Kernighan and Plauger briefly discuss their approach to error handling. Since edit is used to maintain files, it can’t just core dump if something goes wrong, it needs to be robust, indeed graceful, to avoid the loss of valuable work. Since an editor is unavoidably a big program (their implementation runs to more than 1200 lines of Pascal) it’s important that program is well structured or it will all get “utterly out of hand”.

There follows a few pages describing how you would use edit. If you cross out edit and write in vim, it stands up as a nice concise introduction to line-oriented editing commands. I certainly learned things which had eluded me during 30+ years of exposure to vi. They finish this section with the manual.

PROGRAM

edit edit text files

USAGE

edit [file]

FUNCTION

edit is an interactive text editor that reads command lines from its input and writes display information, upon command, to its output. It works by reading text files on command into an internal "buffer" (which may be quite large), displaying and modifying the buffer contents by other commands, then writing all or part of the buffer to text files, also on command. The buffer is organisated as a sequence of lines, numbered from 1; lines are implicitly renumbered as text is added or deleted.

Context searches and substitutions are specified by writing text patterns, following the same rules for building patterns as used by find. Substitutions specify replacement text following the same rules as used by the program change.

Line numbers are formed by the following components:

n a decimal number

. the current line ("dot")

$ the last line

/pattern/ a forward context search

\pattern\ a backward context search

Components may be combined with + or -, as in, for example

.+1 sum of . and 1

$-5 five lines before $

Line numbers are separated by commas or semicolons; a semicolon sets the current line to the most recent line number preceding.

Commands may be preceded by an arbitrary number of lines numbers (except for e, f and q, which require none be present). The last one or two are used as needed. If two line numbers are needed and only one is specified, it is used for both. If no line numbers are specified, a default rule is applied:

(.) use the current line

(.+1) use the next line

(.,.) use the current line for both line numbers

(1,$) use all lines

In alphabetical order, the command and their default line numbers are

(.) a append text after line (text follows)

(.,.) c change text (text follows)

(.,.) dp delete text

e file edit file after discarding all previous text, remember file name

f file print file name, remember file name

(.) i insert text before line (text follows)

(.,.) m line3 p move text to after line3

(.,.) p print text

q quit

(.) r file read file, appending after line

(.,.) s/pat/new/gp substitute new for occurrence of pat (g implies for each occurrence across line)

(1,$) w file write file (leaves current state unaltered)

(.) =p print line number

(.,+1) newline print one line

The trailing p, which is optional, causes the last affected line to be printed. Dot is set to the last affected line, except for f, w, and =, for which it is unchanged.

Text entered with a, c and i is terminated with a line containing just a ..

The global prefixes cause repeated execution of a command, one for each line that matches (g) or does not match (x) a specified text pattern:

(1,$) g/pattern/command

(1,$) x/pattern/command

command can be anything but a, c, i or q, and may be preceded by line numbers as usual. Dot is set to the matched line before command is done.

If the command line argument file is present, then the editor behaves as if its input began with the command e file. The first filename used is remembered, so that a subsequent e, f, r, or w command can be written with no filename to refer to the remembered filename. A filename given with e or f replaces any remembered filename.

EXAMPLE

Don’t be silly.

Let’s take a moment

The remaining 45 pages of the chapter describe the edit program itself. I won’t, I hope, take that quite that long, but equally I’m not going to dig quite as deeply into it.

Reading the manual, it’s pretty clear, I think, that the core of the editor isn’t particular complex. We can add new lines into the editor and we can remove existing lines. The difference between inserting text before a particular line or appending it after is, in the grand scheme of things, barely worth worrying about. Moving lines about or changing existing lines are a fancy ways of writing a delete followed by an insert. The basic operations of edit are entirely straightforward and no danger lurks there. Even for a very large document, a standard container is going to be entirely fine.

For me, all the interest, all the excitement is in the conversation, in the commands the user enters to manipulate the text, particularly in line numbers. We’ve got integer literals, . as shorthand for the current line, $ for the last line, forwards or backwards search, along with basic arithmetic on any or all of those. Line indexes are optional, and the defaults differ from command to command. In addition, some commands have additional parameters. There’s lots going on there.

There is one final consideration - interpret or compile. Kernighan and Plauger’s edit interprets. It processes line numbers and commands immediately as it parses. Nothing wrong with that at all, and with the oft-discussed contraints of Pascal[2] an entirely reasonable one. However, this does mean you might get half way through processing some user input and then hit an error, leaving the program in a valid but unknown state. In the case of edit the worst that can happen is that current line or remembered search pattern is no longer what you expect, so it’s hardly catastrophic, but, particularly as a long time C++ programmer[3], it’s something I’d like to avoid.

Even before I put finger to keyboard therefore, I’d been leaning (because TDDing doesn’t mean you’re not allowed to think) towards a compilation model - parse out the command to some intermediate state, compile it into something that will do the work, and only then, if everything’s gone ok, pull the trigger.

It’s right there in the code.

void editor::handle_command(
    std::string_view line,
    std::istream&in,
    std::ostream&out) {
  auto parsed_command = parse_command(line);
  auto command = parsed_command.compile(buffer_);

  command(in, out, buffer_);
}

"But you said only pull the trigger if everything was ok" you might reasonably respond "looks like you’re just carrying on regardless". Well, yes and also no. In the event of an error, edit does nothing except produce some output, a single question mark.

$ ./edit
huh
?

In the code above, if parsing or compilation fails, compile returns an error command. When we run the error command, it’s responsible for popping out the ?. More successful parses produce more useful commands like inserting or deleting.

Three Steps To Ed-en

Step One: Parsing

Parsing can be something of dark art, but the edit commands are pretty straightforward and so we don’t need anything sophisticated. The commands have the general form of 0, 1, or N line indexes followed by a single letter command, and then, potentially, a bit extra on the end. Honestly, I put off dealing with the extra bits for quite some time, before just knuckling down.

  auto parsed_command = parse_command(line);

The parse_command is a free function and hands back the parsed result by value, so we don’t have any lifetime or memory management issues to worry about. parse_command itself simply hands over the work to a command_parser object. The sole purpose of command_parser is the bundle up all the moving parts - how much of the input have we consumed, what have we found so far, have we finished - so that we don’t have to pass half-a-dozen parameters down through the call chain.

parse_command(std::string_view const input) {
  auto parser = command_parser(input);

  return parser.parse();
}
parsed_command parse() {
  parse_line_numbers();

  parse_command_code();

  has_line_numbers_when_forbidden();

  line_number_defaults();

  return parse_result();
}

parse_line_numbers builds up a list of line_expression objects. A literal index, say 1 becomes an int_index(1), while something like /next/+1 becomes add_offset(forward_search('next'), 1). I’ve said objects, but there’s no class hierarchy here. Everything is captured in little lambda functions. Some, like dot_index_fn just return constants, int_index captures the integer value. More complex operations, like addition and subtraction add_offset capture other line expressions within themselves. It’s actual function composition in action.

We’re building up the little expression functions rather than working out what, for example, .;$ means straightaway for a couple of reasons. First, the value of line expressions depends on the state of the editor, and secondly, evaluating the line expressions has side effects. The current line can be updated, and any search patterns will be remembered. We don’t want to be changing the editor state when we still don’t know if the whole command makes sense. So, at this stage, build the line expressions, and we’ll work them out later.

Once the line index parsing is taken care off, everything else is pretty straightforward. We expect to next see a command letter, and that letter tells us what to look for after that. If we see p, then we’re looking for the end of the input. On the other hand if we have an s we look for a search pattern, replacement text, and perhaps the g flag.

stiX::parsed_command parse_result() const {
  if (is_error())
    return parse_error();

  return { indicies, code, extras };
}

Once everything’s parsed out, we’re ready to compile.

Step Two: Compiling

Compiling the command is, as the parsing result might suggest, a two phase process. First the line expressions are evaluated against the edit buffer. This is slightly more involved that might first appear as we need to propagate updates to the current line and the saved search pattern through the evaluation chain. It is the case, although I think most people would never use it, that one line expression might affect another through the side effect of updating the current line. I suspect this kind of manoeuvre was perhaps more useful when working with line printers, noisily carving output into fanfold paper[4]. In this modern age of when everyone has multiple virtual glass teletypes and screen editors, we’ve rather lost the art.

std::tuple<size_t, size_t, size_t, std::string>
eval_line_expressions(
  std::vector<stiX::line_expression_step> const& line_expressions,
  stiX::edit_buffer const& buffer
) {
  auto dot = buffer.dot();
  auto last_pattern = std::string { buffer.pattern() };
  auto line_numbers = std::vector<size_t> { };

  for (auto const& [expr, separator] : line_expressions) {
    auto [index, pattern] = index_or_error(expr, buffer, dot, last_pattern);
    dot = (separator == stiX::expression_separator::update) ? index : dot;
    last_pattern = !pattern.empty() ? pattern : last_pattern;
    line_numbers.push_back(index);
  }

  if (line_numbers.size() > 2)
    line_numbers.erase(line_numbers.begin(), line_numbers.end()-2);

  return { line_numbers.front(), line_numbers.back(), dot, last_pattern };
}

Using a tuple to return several values from a function still feels like slightly stinky C++, even though I wouldn’t think twice about it in a language like Python or JavaScript. I’d still be unlikely to use it in a function that formed part of an interface or API, a public member function say, but in this context it’s perfectly fine[5].

Line expressions evaluated, yielding actual numbers, we can now compile the main command.

stiX::action command_for_code(
  char const code,
  size_t const from_index,
  size_t const to_index,
  size_t const destination,
  stiX::command_extras const& extras) {

  auto const fn = command_map.find(code);

  return fn != command_map.cend()
    ? fn->second(from_index, to_index, destination, extras)
    : stiX::command::error;
}

The fn->second(from_index, to_index, destination, extras) call does the work of compiling the command. These compilation functions do any extra setup work that might be needed, then returns, as you’ve probably guessed, a lambda function wrapping everything we’ll need when we come to run the command. Again, everything the lambda needs is captured by value, the lambda function itself is a value, so yet again we’ve avoided any lifetime issues.

using action =
    std::function<void(std::istream&, std::ostream&, edit_buffer&)>;

action stiX::make_insert_action(
    size_t const,
    size_t const to_index,
    size_t const,
    command_extras const&) {
  return [to_index](std::istream& in, std::ostream&, edit_buffer& buffer) {
    insert_action(in, to_index, buffer);
  };
}

Under the covers, there are actually more actions available beyond those implementing the commands described in the edit manual. Updating the current line and storing a search pattern are both implemented as "hidden" actions. A compiled command is, in fact, a sequence of actions, all wrapped up together.

stiX::commands stiX::parsed_command::compile(
    stiX::edit_buffer const& buffer) const {
  auto const [from, to, updated_dot, last_pattern] =
      eval_line_expressions(line_expressions, buffer);

  if (is_error(from, to, code))
    return { command::error };

  auto destination = eval_destination_expression(
    extras.destination_expression,
    buffer,
    updated_dot,
    from,
    to,
    last_pattern);

  if (is_error(destination))
    return { command:: error };

  auto set_dot = command::update_dot(updated_dot);
  auto set_pattern = command::update_pattern(last_pattern);

  auto command = command_for_code(
      code,
      from,
      to,
      destination,
      extras);

  auto and_then = extras.and_print
    ? command::and_print
    : command::noop;

  return { set_dot, set_pattern, command, and_then } ;
}

This multiple action approach only occurred to me quite late on, as I was thinking about implementing g, the global command. I’d actually not quite understood what g does, and imagined that I’d be able to do most of the work up front, and then return several commands, one for each line the global search matched. You can’t, I quickly realised, but I’d made the switch from single action to multiple action. Once the change was in place though, even though it didn’t serve my original purpose, the possibilities just opened up. Suddenly everything started getting a lot simpler, things I’d put off because they seemed awkwardEp just started dropping into place, and I realised I could discard a whole layer of indirection. I’ve got faith in myself and in the process - I’m comfortable I can form the code I write into the shape it needs to have - but every now and again something almost magical happens and some small change just cracks everything open.

Step Three: Action

We’ve parsed out the command string entered by the user, evaluated our line expressions, built our sequence of lambda functions, what now?

Well, we just apply those lambdas to the edit buffer. parsed_command::compile returns a commands object, which provides the function call operator operator(). This simply calls down into each lambda in sequence, which themselves call the functions that do actual work.

void stiX::commands::operator()(
    std::istream& in,
    std::ostream& out,
    edit_buffer& buffer) const {
  for(auto const& c : commands_)
    c(in, out, buffer);
}

The action functions are all pretty straightforward. Indeed, many of them share implementations or are implemented in terms of each other actions. Append, insert and read file, for example, all share their implementation, while change is a combination of delete and insert.

void stiX::append_action(
    std::istream& in,
    size_t const after,
    edit_buffer& buffer) {
  read_lines(in, after, true, buffer);
}

void stiX::insert_action(
    std::istream& in,
    size_t const before,
    edit_buffer& buffer) {
  auto const adjust = (!buffer.empty() && before != 0) ? 1 : 0;
  read_lines(in, before-adjust, true, buffer);
}

void stiX::read_from_file_action(
    std::ostream& out,
    size_t after,
    std::string_view filename,
    edit_buffer& buffer) {
  auto f = std::string { !filename.empty() ? filename : buffer.filename() };
  if (filename_is_good(f, out)) {
    auto source = std::ifstream { f };
    read_lines(source, after, false, buffer);
  }
}

void read_lines(
    std::istream& source,
    size_t index,
    bool stop_on_stop,
    edit_buffer& buffer) {
  while(source.peek() != eof) {
    auto line = stiX::getline(source);

    if (stop_on_stop && line == ".")
      return;

    buffer.insert(index, line);
    ++index;
  }
}

There really is very little code involved in implementing the editor commands, perhaps only 150 lines or so. Even the global command, which involves calling back into the parser, and then recompiling commands is under 20 lines or so spread over three functions. Honestly, the thing that tripped me up the most, even quite late on, is that edit line indexes are 1-based rather 0-based, except that sometimes a 0 line index is ok, depending on the command, and whether the edit buffer is empty or not.

Getting to the point where we’re actually call those functions - the parsing and compiling - is the bulk of rest of the program, 650 lines or so. That’s the eternal truth of software - working out what people want done is much more complicated that actually doing it.

A Long Way To A Small Program

While I’m coming up on two years wall clock time writing edit, I have no real idea how much human CPU time I’ve spent on it. I usually work on this little project on a weekday evening, generally for an hour or so[6], sometimes while watching a bit of FIH Pro League hockey. I’m at the end of the day, the sport might get exciting at any moment, but I’m still trying to do good work, so I do the smallest little step I can that moves the code forward, and commit it. Even if it’s only a single line. Even if it’s a single character. Now I can go again with another little step, and commit. And, if I’m able, again, and again, until it’s time to take the dogs for a wee, put the breadmaker on, or any of the other closing down for the evening chores.

You can see this in the commit history. There are gaps where weeks pass with nothing, interspersed with periods where I work on the code for several evenings in a row, committing five or six or however many times, some times with only a couple of minutes between commits.

The more often I commit, the less I have to hold in my head. The code itself contains everything I know about design and implementation. When I come back, even after an extended period, I can pick it up, find the shape, and start working again.

Maybe I could do some maths on the commit timestamps and come up with an estimate for time spent, but does it really matter? STinC++ is a pleasure project and I have really enjoyed this chapter of it.

Where next?

Kernighan and Plauger suggest a number of exercises for extending edit beyond the code they present. They include small things like a variation on print to represent "invisible" characters like tabs and trailing spaces, or more helpful error messages than ?. They work an exercise for editing with files larger than can be accommodated in memory. This is a historical curiosity, and not something I’m especially worried about today. Even my unsophisticated use of std::vector should stand-up to the rigours of everyday use in 2024[7].

More interesting exercises include implementing shelling out to a process, catching interrupts without data loss, and then cannibalising edit to make a "stream editor". I’d not really thought about it before, but I’d sort of assumed sed predated ed but it seems not.

I don’t propose to leap into any of these straightaway, but I might loop back round to them at some point. I’ve also poked at the idea of making my edit even more ed-like - ^ instead of % in the regexes, filling out the command set, that kind of thing. We’ll see. Something else that might be fun is grappling a screen editor on top of edit, in the same way vim was evolved from ex.

We’ll see.

Kernighan and Plauger conclude by describing how they use edit to pull together the manuscript of Software Tools in Pascal. This involves using edit to build scripts to feed back into edit. Of course they do. This is a book of tools, and they’re putting the tools to use.

Ed-roboros

Throughout this I’ve described edit as ed-like. When Kernighan and Plauger wrote Software Tools and then Software Tools in Pascal, ed already existed on the Bell Labs UNIX systems and the program they presented provided a subset, albeit a large subset, of ed functionality, hence ed-like. Curiously though, the man page for both the BSD ed and GNU ed both reference Software Tools in Pascal. Indeed, the GNU ed info pages go so far as to say

GNU ed is based on the editor algorithm described in Brian W. Kernighan
and P. J. Plauger's book "Software Tools in Pascal", Addison-Wesley, 1981.

Perhaps edit isn’t ed-like, perhaps ed is edit-like.

Source code

Source code for this program, indeed the whole project, is available in the stiX GitHub repository. edit is the only program in Chapter 6.

Wait, I arrived late! What’s going on?

Software Tools In Pascal by Brian Kernighan and PJ Plauger is a book that I love. Since August 2019, I’ve been gradually working through the book, reimplementing their tools in C++. The book really is a delight and I commend it to you. It’s still in print, although wildly expensive so get yourself a second hand copy or borrow it from The Internet Archive.

Tools-wise, I’ve been using JetBrain’s CLion, which I like a lot, with CMake for build and Catch2 as my test framework.


1. As of 29 May, birthday fans!
4. I’d love to know the history of that particular feature. I’d put money on it being an case book study of iterative development in action.
5. Although if you look through the commit history you’ll see it took me some time to realise this
6. A development episode can be as short as 10 minutes, and only very rarely much longer than two hours
7. By which I mean of course, that it won’t be used at all when so many better alternatives exist.

Tagged code, and software-tools-in-c++
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed