Jez Higgins

Freelance software grandad
software created
extended or repaired


Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

If You’re So Smart…​ : Yr glygfa o Hescwm Uchaf, rhif 9

Long time readers (you know who you are) will recall that a few days ago I was moaning out loud about how disappointed I was for all the interview candidates we’d seen that had been ill-served by their previous employers. I then chanced upon a piece of code I felt was indicative of what I was uphappy about, pointed at it, and made all kinds of inferences about the organisations that enabled this poor bit of code to stand.

But if I’m so clever, what would I have done?

Strap in, this is going to be a long one …​

One Step At A Time

This is what set me off, a single Python function pulled from a larger body of code.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        components = re.findall(r"\d+", reference_match)
        year = components[0]
        d1 = components[1]
        d2 = ""
        corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        components = re.findall(r"\d+", reference_match)
        year = components[0]
        d1 = components[1]
        d2 = components[2]
        corrected_reference = (
            canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)
        )

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        components = re.findall(r"\d+", reference_match)
        year = ""
        d1 = components[0]
        d2 = components[1]
        corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)

    return corrected_reference, year, d1, d2

It’s just not great. It’s long, for a start, and it’s long because it’s repetitious. The line

components = re.findall(r"\d+", reference_match)

appears in every branch of the if/else. Let’s start by hoisting that up.

hoisting re.findall out of the if/elif bodies
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""
        corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
        corrected_reference = (
            canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)
        )

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]
        corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)

    return corrected_reference, year, d1, d2

Clearing Visual Noise

The unnecessary brackets in the first elif body just jar. They catch the eye and makes it appear that something different is happening in the middle there, when in fact it adds nothing and is just visual noise.

removing redundant brackets
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""
        corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
        corrected_reference = canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]
        corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)

    return corrected_reference, year, d1, d2

Move the action down

The if/else ladder sets up a load of variables, which are then used to build corrected_reference

The lines building corrected_reference aren’t the same, but they are pretty similar. We can move them out of the if/else ladder and combine them together.

moving corrected_reference down
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d+", d1)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Looking Up And Out

This is a bit of a meta-change, because you can’t infer it from the code here, but canonical_form is drawn from a data file elsewhere in the source tree. We control that data file.

Examining it, we can see it’s safe to replace d+ with d1 in the canonical forms. As a result, we can eliminate one of the replace calls when constructing corrected_reference.

discarding replace("d+", d1)
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

The shape of the code hasn’t wildly changed, but feels like we’re moving in a good direction.

Typos Must Die

Another meta-fix is correcting the 'typo' in "RefYearAbrrNumStrokeNum". That string comes from the same data file as the canonical forms. Obviously "RefYearAbrrEtcEtc" looks like a loads of nonsense, but Abrr is so clearly a typo. It’s an abbreviation for abbreviation! It should be Abbr! Like the brackets I mentioned above, this is a piece of visual noise that needs to go.

typo be gone!
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbbrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Ok, it now says "RefYearAbbrNumStrokeNum" which isn’t a world changing difference, but to me it looks better and IDE agrees because there isn’t a squiggle underneath.

Constants

Those string literals give me the heebee-geebies.

replacing string literals with constants.
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == RefYearAbbrNum)
        | (reference_type == RefYearAbbrNumTeam)
        | (reference_type == YearAbbrNum)
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""
    elif (
        (reference_type == RefYearAbbrNumNumTeam)
        | (reference_type == RefYearAbbrNumStrokeNum)
        | (reference_type == RefYearNumAbbrNum)
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
    elif (
        (reference_type == AbbrNumAbbrNum)
        | (reference_type == NumAbbrNum)
        | (reference_type == EuroRefC)
        | (reference_type == EuroRefT)
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Birds of Feather

By grouping like reference types together, we can slim down each if condition.

grouping like types together in an array, test using in
YearAbbrNum_Group = [
    RefYearAbbrNum,
    RefYearAbbrNumTeam,
    YearAbbrNum
]

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearAbbrNum_Group:
        year = components[0]
        d1 = components[1]
        d2 = ""
    elif (
        (reference_type == RefYearAbbrNumNumTeam)
        | (reference_type == RefYearAbbrNumStrokeNum)
        | (reference_type == RefYearNumAbbrNum)
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
    elif (
        (reference_type == AbbrNumAbbrNum)
        | (reference_type == NumAbbrNum)
        | (reference_type == EuroRefC)
        | (reference_type == EuroRefT)
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

I like that. Let’s roll it out to the rest of the types.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group:
        year = components[0]
        d1 = components[1]
        d2 = ""
    elif reference_type in YearNumNum_Group:
        year = components[0]
        d1 = components[1]
        d2 = components[2]
    elif reference_type in NumNum_Group:
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Love it.

Remembered Python calls arrays lists, but also that it has tuples too. Tuples are immutable, so they’re a better choice for our groups.

swap tuples for lists by switching [] to ()
YearAbbrNum_Group = (
    RefYearAbbrNum,
    RefYearAbbrNumTeam,
    YearAbbrNum
)

Destructure FTW!

We can collapse the

year = ...
d1 = ...
d2 = ...

lines together into a single statement, going from three lines into a single line.

collapsing assignments
def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group:
        year, d1, d2 = components[0], components[1], ""
    elif reference_type in YearNumNum_Group:
        year, d1, d2 = components[0], components[1], components[2]
    elif reference_type in NumNum_Group:
        year, d1, d2 = "", components[0], components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Much easier on the eye.

An extra level of indirection

Bringing the year, d1, d2 assignments together particular highlights the similarity across each branch of the if ladder.

Let’s pair up a type group with a little function that pulls out the components.

YearNum_Group = {
    "Types": [
        RefYearAbbrNum,
        RefYearAbbrNumTeam,
        YearAbbrNum
    ],
    "Parts": lambda cmpts: (cmpts[0], cmpts[1], "")
}

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group["Types"]:
        year, d1, d2 = YearNum_Group["Parts"](components)
    elif reference_type in YearNumNum_Group:
        year, d1, d2 = components[0], components[1], components[2]
    elif reference_type in NumNum_Group:
        year, d1, d2 = "", components[0], components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Probably did a bit too much in one go here, and it’s ugly as hell. But it works, and it captures something useful.

If we introduce a little class to pair up the types and components lambda function. It more setup at the top, but it’s neater in the function body.

class TypeComponents:
    def __init__(self, types, parts):
        self.Types = types
        self.Parts = parts

YearNum_Group = TypeComponents(
    (
        RefYearAbbrNum,
        RefYearAbbrNumTeam,
        YearAbbrNum
    ),
    lambda cmpts: (cmpts[0], cmpts[1], "")
)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group.Types:
        year, d1, d2 = YearNum_Group.Parts(components)
    elif reference_type in YearNumNum_Group:
        year, d1, d2 = components[0], components[1], components[2]
    elif reference_type in NumNum_Group:
        year, d1, d2 = "", components[0], components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Extend that across the two elif branches.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group.Types:
        year, d1, d2 = YearNum_Group.Parts(components)
    elif reference_type in YearNumNum_Group.Types:
        year, d1, d2 = YearNumNum_Group.Parts(components)
    elif reference_type in NumNum_Group.Types:
        year, d1, d2 = NumNum_Group.Parts(components)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

The if conditions and the bodies now all have the same shape. That’s pretty cool. They were similar before, but now they’re the same.

Yoink out the decision making

It’s not really clear in the code, but there are only two things really going on in this function. The first is pulling chunks out of reference_match, and the second is putting those parts back together into canonical_reference. Let’s make that clearer.

Yoink!
def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group.Types:
        year, d1, d2 = YearNum_Group.Parts(components)
    elif reference_type in YearNumNum_Group.Types:
        year, d1, d2 = YearNumNum_Group.Parts(components)
    elif reference_type in NumNum_Group.Types:
        year, d1, d2 = NumNum_Group.Parts(components)

    return year, d1, d2

def canonicalise_reference(reference_type, reference_match, canonical_form):
    year, d1, d2 = reference_components(reference_type, reference_match)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Say What You Mean

There’s no need to assign year, d1, d2 in that new function. We can just return the values directly.

get out
def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    if (reference_type in YearNum_Group.Types):
        return YearNum_Group.Parts(components)
    elif (reference_type in YearNumNum_Group.Types):
        return YearNumNum_Group.Parts(components)
    elif (reference_type in NumNum_Group.Types):
        return NumNum_Group.Parts(components)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    year, d1, d2 = reference_components(reference_type, reference_match)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

I mentioned the if conditions and the bodies now all have the same shape. We can exploit that now to eliminate the if/else ladder.

check each group in turn, return when there’s a match
TypeGroups = (
    YearNum_Group,
    YearNumNum_Group,
    NumNum_Group
)

def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    for group in TypeGroups:
        if reference_type in group.Types:
            return group.Parts(components)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    year, d1, d2 = reference_components(reference_type, reference_match)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

And Rest

I first wrote this on Mastodon because I’m that kind of bear, and this where I stopped. I felt the code was in a much better place - not perfect by any means, but better.

But then I thought of something else.

You Wouldn’t Let It Lie

Now the types are grouped together, I was inclinded to put the string literals back in.

We only use "RefYearAbbrNum", for example, as part of a TypeComponents object. It’s not needed anywhere else, but having it as a constants in its own right floating around implies that you might and suggests that you can. In fact, it’s YearNum_Group that is the constant, so lets tie things down to that.

YearNum_Group = TypeComponents(
    (
        "RefYearAbbrNum",
        "RefYearAbbrNumTeam",
        "YearAbbrNum"
    ),
    lambda cmpts: (cmpts[0], cmpts[1], ""),
)

I also felt the parameters to

canonicalise_reference(reference_type, reference_match, canonical_form):

are in the wrong order.

reference_type and canonical_form go together. They originate in the same place in the code, from the data file I mentioned earlier, and if they were in a tuple or wrapped in a little object I certainly wouldn’t argue.

The thing we’re working on, that we take apart and reassemble is reference_match. To me, that means it should be the first parameter we pass.

def reference_components(reference_match, reference_type):
    components = re.findall(r"\d+", reference_match)

    for group in TypeGroups:
        if reference_type in group.Types:
            return group.Parts(components)

def canonicalise_reference(reference_match, reference_type, canonical_form):
    year, d1, d2 = reference_components(reference_match, reference_type)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

And that I thought was that. And I went to bed.

It’s a new day

The following morning, I got a nudge from my internet fellow-traveller Barney Dellar who said

I tend to think of for-loops as Primitive Obsession. You aren’t looping to do something n times. You’re actually looking for the correct entry in the array to use. I would make that explicit. I’m not good at Python, but some kind of find or filter. Then invoke your method on the result of that filtering.

He was right and I knew it. Had this code been in C#, for instance, I’d probably have gone straight from the if ladder to a LINQ expression.

He set me off. I knew Python’s list comprehensions were its LINQ-a-like, and I had half an idea I could use one here.

However, I thought list comprehensions only created new lists. If I’d done that here, it would mean I’d still have to extract the first element. That felt at least as clumsy as the for loop.

Turns out I’d only ever half used them, though. A list comprehension actually returns an iterable. Combined with next(), which pulls the next element off the iterable, and well, it’s more pythonic.

def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    return next(group.Parts(components)
                for group in TypeGroups
                if reference_type in group.Types)

What’s kind of fascinating about this change is that the list comprehension has the exact same elements as the for version, but the intent, as Barney suggested, is very different.

At the same time, Barney came up with almost exactly the same thing too. We’d done a weird long-distance almost-synchronous little pairing session. Magic.

Reflecting

This is contrived, obviously, because it’s a single one function I’ve pulled out of larger code base.

But, but, but, I do believe that now I’ve shoved it about that it’s better code.

If I was able to work to my way out from here, I’m confident I could make the whole thing better. It’d be smaller, it would be easier to read, easier to change.

(This isn’t hypothetical - I found this code because I was talking about working on it. It’s right up a number of my alleys.)

The Big Finish

I’m sure I have made the code better, and I’m just as sure that I’d make the people I was working with better programmers too. I’d be better from working with them - I’ve learned from everyone I’ve ever worked with - but I’m old. I’ve been a lot of places, done a lot of stuff, on a lot of different code bases, with busloads of people. I know what I’m doing, and I know I could have helped.

I’m sorry I couldn’t take the job, but it needed more time than I could give. In the future, well, who knows?


PS

I think it’s important to note I didn’t know where I was heading when I started. I just knew that if I nudged things around then a right shape would emerge. When I had that shape, I could be more directed.

Barney's little nudge was important too. He knew there was an improvement in there, even if neither of us was quite sure what it was (until we were!). That was great. A lovely cherry on the top.

PPS

I tried to do the least I could at each stage. In one place I took out two characters, in another I changed a single letter. Didn’t always succeed - some of what I did could have been split - but small is beautiful, and we should all aim for beauty.

This comes, in large part, from my man GeePaw Hill and his Many More Much Smaller Steps. He’s been a big influence on me over the past few years, and I’ve benefited greatly as a result.

PPPS (really, the last one, I promise)

I was proofing this article before pressing publish (which probably means there are only seven spelling and grammatical errors left), when I saw another change I’d make.

def reference_components(reference_match, reference_type):
    components = re.findall(r"\d+", reference_match)

    for group in TypeGroups:
        if reference_type in group.Types:
            return group.Parts(components)

def build_canonical_form(canonical_form, year, d1, d2):
    return (canonical_form.replace("dddd", year)
        .replace("d1", d1)
        .replace("d2", d2))

def canonicalise_reference(reference_match, reference_type, canonical_form):
    year, d1, d2 = reference_components(reference_match, reference_type)

    corrected_reference = build_canonical_form(canonical_form, year, d1, d2)

    return corrected_reference, year, d1, d2

Again, nothing huge but just another little clarification.

That really is it. For now!


Tagged the-trade, and python


Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed