Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

A Lightning Dash Through Typesetting With Brian W Kernighan : Yr olygfa o Hescwm Uchaf, rhif 11

The lightning talk[1] sessions are always a really fun part of the ACCU Conference. I gave one this year on favourite topic of mine - an (abbreviated) history of Brian Kernighan typesetting his own books

Typesetting with Brian Kernighan

Brian Kernighan, looking friendly and wise and Canadian

Here’s Brian Kernighan, looking all kind and wise and Canadian.

The C Programming Language

When you hear the name Brian Kernighan, you probably think immediately of this book, The C Programming Language.

The C Programming Language, 2nd Edition

Of course if you’re a little bit younger, like me, you probably think of this book, the The C Programming Language, 2nd Edition.

It really is a terrific book, almost the platonic ideal of a programming language reference. While the C language was all Dennis Ritchie, I think I’m right in saying the text of the book is all Brian Kernighan.

In addition to this book, with which he has become synonymous, Brian Kernighan has written or co-written over a dozen other books. He’s not just written them though. He has typeset them. He has typeset all of them.

It’s often forgotten that typesetting was the excuse for the Unix[2] project in the first place. Bell Labs had a spare machine sitting in a corner, and Ken Thompson, Dennis Ritchie, et al, were able to secure it by undertaking to develop a computer typesetting system. Bell Telephone produced an awful lot of documentation, both internal and external, and a reliable typesetting system would be a huge benefit to the organisation.

The Elements of Programming Style

This book was set in Times Roman and News Gothic Condensed by the authors, using a Graphic Systems phototypesetter driven by a PDP 11/45 running under the UNIX operating system.

The Elements of Programming Style, Kernighan’s first book, was published in 1974. He cowrote it with Bill Plauger, who’s a fine author in his own right. It’s a book about code review and refactoring, both terms some decades away from being coined.

The book was typeset using a Graphic System phototypesetter and a PDP 11/45. Now, this is a bit of brag, but at the same time it’s also a promise. We typeset our own book, and you, too, can do your own typesetting.

A big old PDP-11

Well, you can do your own typesetting if you’ve got a computer like this.

The C Programming Language

The book was set in Times Roman and Courier 12 by the authors, using a Graphic Systems phototypesetter driven by a PDP-11/70 running under the UNIX operating system.

Stepping forward four years to 1978, we have the first edition of The C Programming Language. They’ve had an upgrade, and are now using a PDP-11/70. I don’t know how computers were evolving in the mid-70s, so I’ve no idea if that was physically a bigger or smaller machine.

Software Tools in Pascal

This book was set in Times Roman and Courier by the authors, using a Mergenthaler Linotron 202 phototypesetter driven by a PDP-11/70 running under the UNIX operating system.

Stepping forward another three years we arrive at Software Tools in Pascal, published in 1981. If you’re reading this you probably already know I’m a colossal fan of this book, and I think every programmer should read it. It’s still in print, but ruinously expensive so get yourself a second hand copy. It’s really, really good.

Now, they’re using this Mergenthaler Linotron 202, which was an interesting machine. Very high quality typesetter, apparently, but rather lacking in documentation. It had a proprietary font format, that Kernighan and colleagues reverse engineered. Ken Thompson, a pioneer of computer chess, contributed a chess board and pieces to the font set, which is kind of lovely.

The C Programming Language, 2nd Edition

The book was typeset (pic|tbl|eqn|troff -ms) in Times Roman and Courier by the authors, using an Autologic APS-5 phototypesetter and a DEC VAX 8550 running the 9th Edition of the UNIX operating system.

1988 now, and look, now we have a shell pipeline - pic|tbl|eqn|troff -ms. troff is the typesetting program, originally written by Joe Ossanna in the PDP-11 assembly and then in C. Ossanna unfortunately died quite young, and development of troff was taken over by Kernighan, Lorinda Cherry, Mike Lesk, and others. pic, tbl, and eqn are troff preprocessors, providing markup for line drawings, tabular data, and equations & formulae respectively.

The Practice of Programming

This book was typeset (grap|pic|tbl|eqn|troff -mpm) in Times New Roman and Lucida Sans Typewriter by the authors.

Big jump forward now to 1998. Kernighan’s about to head off to teach at Princeton. The Practice of Programming, cowritten with Rob Pike, is another terrific book. Now, look, we can do this on commodity hardware. We don’t need to say what the typesetter was - it’s not important anymore. Kernighan has popped another little troff preprocessor into his pipeline, grap, which provides a language for typesetting graphs.

UNIX - A History and a Memoir

Camera-ready copy for this book was produced by the author in Times Roman and Helvetica, using groff, ghostscript, and other open source Unix tools.

Leap forward another twenty years, to 2020, and this book UNIX - A History and a Memoir, which is just delightful. Kernighan did this on his laptop. He’s now using groff, which the GNU troff replacement originally developed by James Clark[3] He’s using ghostscript, the free postscript renderer, and "other open source Unix tools". Hmm, for "other open source Unix tools" we can read "Linux". He’s travelled that journey from proprietary Unix back then, to free software now.

The AWK Programming Language

This book was formatted by the authors in Times Roman, Courier and Helvetica, using Groff, Ghostscript and other open source Unix tools.

Just at the end of last year, Kernighan published The AWK Programming Language 2nd Edition. This is a big update to a book first published in 1988, about a language first developed in 1977, prompted, in part, by the fact that a man in his 80s decided to put full Unicode support into a 40 year old programming.

What a gangster.

Brian Kernighan, with his feet up on the desk

Ossana and Kernighan’s Troff User’s Manual does not say how it was typeset.

A blot on an otherwise exemplary career.


1. ACCU’s lightning talks are 5 minutes long, with a hard deadline. Overrun, and the mic gets cut and you’re gonged off the stage. Some lightning talks are highly technical, some are responses to conference sessions, quite a lot of them are just plain fun.
2. A name, if not necessarily a spelling, coined by Kernighan.
3. Clark is another titan of text processing who’s worked on SGML, XML, the Expat parser, and more.

Tagged brian-kernighan, typesetting, and accu-conference

ACCU Conference, It’s Good To Be Back : Yr glygfa o Hescwm Uchaf, rhif 10

In many, many ways the ACCU Conference has helped shape me into the programmer I am today. I’ve learned so, so much, it’s allowed me to meet people I would never have otherwise, I’ve made so many friends, it’s helped me find work (though I never imagined it would nor sought for that to happen), and I’ve had a whole load of fun. It’s important to me, I care about it, and that’s why it was just a joy to head over to Bristol last week for the 2024 edition.

I had forgotten just how much fun it is. I’ve been to, and enjoyed, a whole load of different conferences but they’re not the same. The ACCU Conference is my home conference. It’s the first conference I ever went to. It’s the conference I’ve attended most often, missing only a handful in the past 25 odd years. It’s where I did my first conference presentation. It’s where I’ve done my best talks, and had my biggest (and thus most nerve-wracking) audience. The ecumenical nature of the programme meshes with my own meandering working life. It’s where I’ve attended the deepest, the funniest, the most bit-twiddly, the most useful, just the best talks I’ve ever seen. I know it and it knows me. It’s comfortable, but it’s new each time.

It was exhausting and invigorating and I had a great time. It was wonderful to be there, and I’m already looking forward to next year.

(I’d also forgotten what a falafel-tastic city Bristol is. There are at least three falafel stalls within five minutes walk of the conference hotel, and they’re all, in their distinct ways, really good. I had a falafel wrap every day, and I loved every mouthful.)

I won’t recap all the sessions I went to, but I would like to particularly point out Laura Savino's keynote Coping With Other People’s Code, which I thought was really strong (and I’m not just saying that because she also likes a falafel wrap). She touched on a lot of areas I’ve been thinking about recently in respect to what I laughingly call my own career, and she’s given me plenty more to chew on. She gave a version of the talk at CppCon back in October, but if you get the chance to see her in person then do think about doing so.

And, yea, come to Bristol with me next April. It’ll change your life.


Tagged accu-conference

If You’re So Smart…​ : Yr glygfa o Hescwm Uchaf, rhif 9

Long time readers (you know who you are) will recall that a few days ago I was moaning out loud about how disappointed I was for all the interview candidates we’d seen that had been ill-served by their previous employers. I then chanced upon a piece of code I felt was indicative of what I was unhappy about, pointed at it, and made all kinds of inferences about the organisations that enabled this poor bit of code to stand.

But if I’m so clever, what would I have done?

Strap in, this is going to be a long one …​

One Step At A Time

This is what set me off, a single Python function pulled from a larger body of code.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        components = re.findall(r"\d+", reference_match)
        year = components[0]
        d1 = components[1]
        d2 = ""
        corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        components = re.findall(r"\d+", reference_match)
        year = components[0]
        d1 = components[1]
        d2 = components[2]
        corrected_reference = (
            canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)
        )

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        components = re.findall(r"\d+", reference_match)
        year = ""
        d1 = components[0]
        d2 = components[1]
        corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)

    return corrected_reference, year, d1, d2

It’s just not great. It’s long, for a start, and it’s long because it’s repetitious. The line

components = re.findall(r"\d+", reference_match)

appears in every branch of the if/else. Let’s start by hoisting that up.

hoisting re.findall out of the if/elif bodies
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""
        corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
        corrected_reference = (
            canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)
        )

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]
        corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)

    return corrected_reference, year, d1, d2

Clearing Visual Noise

The unnecessary brackets in the first elif body just jar. They catch the eye and makes it appear that something different is happening in the middle there, when in fact it adds nothing and is just visual noise.

removing redundant brackets
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""
        corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
        corrected_reference = canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]
        corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)

    return corrected_reference, year, d1, d2

Move the action down

The if/else ladder sets up a load of variables, which are then used to build corrected_reference

The lines building corrected_reference aren’t the same, but they are pretty similar. We can move them out of the if/else ladder and combine them together.

moving corrected_reference down
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d+", d1)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Looking Up And Out

This is a bit of a meta-change, because you can’t infer it from the code here, but canonical_form is drawn from a data file elsewhere in the source tree. We control that data file.

Examining it, we can see it’s safe to replace d+ with d1 in the canonical forms. As a result, we can eliminate one of the replace calls when constructing corrected_reference.

discarding replace("d+", d1)
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

The shape of the code hasn’t wildly changed, but feels like we’re moving in a good direction.

Typos Must Die

Another meta-fix is correcting the 'typo' in "RefYearAbrrNumStrokeNum". That string comes from the same data file as the canonical forms. Obviously "RefYearAbrrEtcEtc" looks like a loads of nonsense, but Abrr is so clearly a typo. It’s an abbreviation for abbreviation! It should be Abbr! Like the brackets I mentioned above, this is a piece of visual noise that needs to go.

typo be gone!
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbbrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Ok, it now says "RefYearAbbrNumStrokeNum" which isn’t a world changing difference, but to me it looks better and IDE agrees because there isn’t a squiggle underneath.

Constants

Those string literals give me the heebee-geebies.

replacing string literals with constants.
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == RefYearAbbrNum)
        | (reference_type == RefYearAbbrNumTeam)
        | (reference_type == YearAbbrNum)
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""
    elif (
        (reference_type == RefYearAbbrNumNumTeam)
        | (reference_type == RefYearAbbrNumStrokeNum)
        | (reference_type == RefYearNumAbbrNum)
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
    elif (
        (reference_type == AbbrNumAbbrNum)
        | (reference_type == NumAbbrNum)
        | (reference_type == EuroRefC)
        | (reference_type == EuroRefT)
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Birds of Feather

By grouping like reference types together, we can slim down each if condition.

grouping like types together in an array, test using in
YearAbbrNum_Group = [
    RefYearAbbrNum,
    RefYearAbbrNumTeam,
    YearAbbrNum
]

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearAbbrNum_Group:
        year = components[0]
        d1 = components[1]
        d2 = ""
    elif (
        (reference_type == RefYearAbbrNumNumTeam)
        | (reference_type == RefYearAbbrNumStrokeNum)
        | (reference_type == RefYearNumAbbrNum)
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
    elif (
        (reference_type == AbbrNumAbbrNum)
        | (reference_type == NumAbbrNum)
        | (reference_type == EuroRefC)
        | (reference_type == EuroRefT)
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

I like that. Let’s roll it out to the rest of the types.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group:
        year = components[0]
        d1 = components[1]
        d2 = ""
    elif reference_type in YearNumNum_Group:
        year = components[0]
        d1 = components[1]
        d2 = components[2]
    elif reference_type in NumNum_Group:
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Love it.

Remembered Python calls arrays lists, but also that it has tuples too. Tuples are immutable, so they’re a better choice for our groups.

swap tuples for lists by switching [] to ()
YearAbbrNum_Group = (
    RefYearAbbrNum,
    RefYearAbbrNumTeam,
    YearAbbrNum
)

Destructure FTW!

We can collapse the

year = ...
d1 = ...
d2 = ...

lines together into a single statement, going from three lines into a single line.

collapsing assignments
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group:
        year, d1, d2 = components[0], components[1], ""
    elif reference_type in YearNumNum_Group:
        year, d1, d2 = components[0], components[1], components[2]
    elif reference_type in NumNum_Group:
        year, d1, d2 = "", components[0], components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Much easier on the eye.

An extra level of indirection

Bringing the year, d1, d2 assignments together particular highlights the similarity across each branch of the if ladder.

Let’s pair up a type group with a little function that pulls out the components.

YearNum_Group = {
    "Types": [
        RefYearAbbrNum,
        RefYearAbbrNumTeam,
        YearAbbrNum
    ],
    "Parts": lambda cmpts: (cmpts[0], cmpts[1], "")
}

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group["Types"]:
        year, d1, d2 = YearNum_Group["Parts"](components)
    elif reference_type in YearNumNum_Group:
        year, d1, d2 = components[0], components[1], components[2]
    elif reference_type in NumNum_Group:
        year, d1, d2 = "", components[0], components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Probably did a bit too much in one go here, and it’s ugly as hell. But it works, and it captures something useful.

If we introduce a little class to pair up the types and components lambda function. It more setup at the top, but it’s neater in the function body.

class TypeComponents:
    def __init__(self, types, parts):
        self.Types = types
        self.Parts = parts

YearNum_Group = TypeComponents(
    (
        RefYearAbbrNum,
        RefYearAbbrNumTeam,
        YearAbbrNum
    ),
    lambda cmpts: (cmpts[0], cmpts[1], "")
)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group.Types:
        year, d1, d2 = YearNum_Group.Parts(components)
    elif reference_type in YearNumNum_Group:
        year, d1, d2 = components[0], components[1], components[2]
    elif reference_type in NumNum_Group:
        year, d1, d2 = "", components[0], components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Extend that across the two elif branches.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group.Types:
        year, d1, d2 = YearNum_Group.Parts(components)
    elif reference_type in YearNumNum_Group.Types:
        year, d1, d2 = YearNumNum_Group.Parts(components)
    elif reference_type in NumNum_Group.Types:
        year, d1, d2 = NumNum_Group.Parts(components)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

The if conditions and the bodies now all have the same shape. That’s pretty cool. They were similar before, but now they’re the same.

Yoink out the decision making

It’s not really clear in the code, but there are only two things really going on in this function. The first is pulling chunks out of reference_match, and the second is putting those parts back together into canonical_reference. Let’s make that clearer.

Yoink!
def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group.Types:
        year, d1, d2 = YearNum_Group.Parts(components)
    elif reference_type in YearNumNum_Group.Types:
        year, d1, d2 = YearNumNum_Group.Parts(components)
    elif reference_type in NumNum_Group.Types:
        year, d1, d2 = NumNum_Group.Parts(components)

    return year, d1, d2

def canonicalise_reference(reference_type, reference_match, canonical_form):
    year, d1, d2 = reference_components(reference_type, reference_match)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Say What You Mean

There’s no need to assign year, d1, d2 in that new function. We can just return the values directly.

get out
def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    if (reference_type in YearNum_Group.Types):
        return YearNum_Group.Parts(components)
    elif (reference_type in YearNumNum_Group.Types):
        return YearNumNum_Group.Parts(components)
    elif (reference_type in NumNum_Group.Types):
        return NumNum_Group.Parts(components)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    year, d1, d2 = reference_components(reference_type, reference_match)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

I mentioned the if conditions and the bodies now all have the same shape. We can exploit that now to eliminate the if/else ladder.

check each group in turn, return when there’s a match
TypeGroups = (
    YearNum_Group,
    YearNumNum_Group,
    NumNum_Group
)

def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    for group in TypeGroups:
        if reference_type in group.Types:
            return group.Parts(components)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    year, d1, d2 = reference_components(reference_type, reference_match)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

And Rest

I first wrote this on Mastodon because I’m that kind of bear, and this where I stopped. I felt the code was in a much better place - not perfect by any means, but better.

But then I thought of something else.

You Wouldn’t Let It Lie

Now the types are grouped together, I was inclinded to put the string literals back in.

We only use "RefYearAbbrNum", for example, as part of a TypeComponents object. It’s not needed anywhere else, but having it as a constants in its own right floating around implies that you might and suggests that you can. In fact, it’s YearNum_Group that is the constant, so lets tie things down to that.

YearNum_Group = TypeComponents(
    (
        "RefYearAbbrNum",
        "RefYearAbbrNumTeam",
        "YearAbbrNum"
    ),
    lambda cmpts: (cmpts[0], cmpts[1], ""),
)

I also felt the parameters to

canonicalise_reference(reference_type, reference_match, canonical_form):

are in the wrong order.

reference_type and canonical_form go together. They originate in the same place in the code, from the data file I mentioned earlier, and if they were in a tuple or wrapped in a little object I certainly wouldn’t argue.

The thing we’re working on, that we take apart and reassemble is reference_match. To me, that means it should be the first parameter we pass.

def reference_components(reference_match, reference_type):
    components = re.findall(r"\d+", reference_match)

    for group in TypeGroups:
        if reference_type in group.Types:
            return group.Parts(components)

def canonicalise_reference(reference_match, reference_type, canonical_form):
    year, d1, d2 = reference_components(reference_match, reference_type)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

And that I thought was that. And I went to bed.

It’s a new day

The following morning, I got a nudge from my internet fellow-traveller Barney Dellar who said

I tend to think of for-loops as Primitive Obsession. You aren’t looping to do something n times. You’re actually looking for the correct entry in the array to use. I would make that explicit. I’m not good at Python, but some kind of find or filter. Then invoke your method on the result of that filtering.

He was right and I knew it. Had this code been in C#, for instance, I’d probably have gone straight from the if ladder to a LINQ expression.

He set me off. I knew Python’s list comprehensions were its LINQ-a-like, and I had half an idea I could use one here.

However, I thought list comprehensions only created new lists. If I’d done that here, it would mean I’d still have to extract the first element. That felt at least as clumsy as the for loop.

Turns out I’d only ever half used them, though. A list comprehension actually returns an iterable. Combined with next(), which pulls the next element off the iterable, and well, it’s more pythonic.

def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    return next(group.Parts(components)
                for group in TypeGroups
                if reference_type in group.Types)

What’s kind of fascinating about this change is that the list comprehension has the exact same elements as the for version, but the intent, as Barney suggested, is very different.

At the same time, Barney came up with almost exactly the same thing too. We’d done a weird long-distance almost-synchronous little pairing session. Magic.

Reflecting

This is contrived, obviously, because it’s a single one function I’ve pulled out of larger code base.

But, but, but, I do believe that now I’ve shoved it about that it’s better code.

If I was able to work to my way out from here, I’m confident I could make the whole thing better. It’d be smaller, it would be easier to read, easier to change.

(This isn’t hypothetical - I found this code because I was talking about working on it. It’s right up a number of my alleys.)

The Big Finish

I’m sure I have made the code better, and I’m just as sure that I’d make the people I was working with better programmers too. I’d be better from working with them - I’ve learned from everyone I’ve ever worked with - but I’m old. I’ve been a lot of places, done a lot of stuff, on a lot of different code bases, with busloads of people. I know what I’m doing, and I know I could have helped.

I’m sorry I couldn’t take the job, but it needed more time than I could give. In the future, well, who knows?


PS

I think it’s important to note I didn’t know where I was heading when I started. I just knew that if I nudged things around then a right shape would emerge. When I had that shape, I could be more directed.

Barney's little nudge was important too. He knew there was an improvement in there, even if neither of us was quite sure what it was (until we were!). That was great. A lovely cherry on the top.

PPS

I tried to do the least I could at each stage. In one place I took out two characters, in another I changed a single letter. Didn’t always succeed - some of what I did could have been split - but small is beautiful, and we should all aim for beauty.

This comes, in large part, from my man GeePaw Hill and his Many More Much Smaller Steps. He’s been a big influence on me over the past few years, and I’ve benefited greatly as a result.

PPPS (really, the last one, I promise)

I was proofing this article before pressing publish (which probably means there are only seven spelling and grammatical errors left), when I saw another change I’d make.

def reference_components(reference_match, reference_type):
    components = re.findall(r"\d+", reference_match)

    for group in TypeGroups:
        if reference_type in group.Types:
            return group.Parts(components)

def build_canonical_form(canonical_form, year, d1, d2):
    return (canonical_form.replace("dddd", year)
        .replace("d1", d1)
        .replace("d2", d2))

def canonicalise_reference(reference_match, reference_type, canonical_form):
    year, d1, d2 = reference_components(reference_match, reference_type)

    corrected_reference = build_canonical_form(canonical_form, year, d1, d2)

    return corrected_reference, year, d1, d2

Again, nothing huge but just another little clarification.

That really is it. For now!


Tagged the-trade, and python
Older posts are available in the archive or through tags.


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed