Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Older posts are available in the archive or through tags.

Thursday 07 March 2024 If You’re So Smart… : Yr glygfa o Hescwm Uchaf, rhif 9

Long time readers (you know who you are) will recall that a few days ago I was moaning out loud about how disappointed I was for all the interview candidates we’d seen that had been ill-served by their previous employers. I then chanced upon a piece of code I felt was indicative of what I was uphappy about, pointed at it, and made all kinds of inferences about the organisations that enabled this poor bit of code to stand.

But if I’m so clever, what would I have done?

Strap in, this is going to be a long one …

One Step At A Time

This is what set me off, a single Python function pulled from a larger body of code.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        components = re.findall(r"\d+", reference_match)
        year = components[0]
        d1 = components[1]
        d2 = ""
        corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        components = re.findall(r"\d+", reference_match)
        year = components[0]
        d1 = components[1]
        d2 = components[2]
        corrected_reference = (
            canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)
        )

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        components = re.findall(r"\d+", reference_match)
        year = ""
        d1 = components[0]
        d2 = components[1]
        corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)

    return corrected_reference, year, d1, d2

It’s just not great. It’s long, for a start, and it’s long because it’s repetitious. The line

components = re.findall(r"\d+", reference_match)

appears in every branch of the if/else. Let’s start by hoisting that up.

hoisting re.findall out of the if/elif bodies

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""
        corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
        corrected_reference = (
            canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)
        )

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]
        corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)

    return corrected_reference, year, d1, d2

Clearing Visual Noise

The unnecessary brackets in the first elif body just jar. They catch the eye and makes it appear that something different is happening in the middle there, when in fact it adds nothing and is just visual noise.

removing redundant brackets

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""
        corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1)

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
        corrected_reference = canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2)

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]
        corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2)

    return corrected_reference, year, d1, d2

Move the action down

The if/else ladder sets up a load of variables, which are then used to build corrected_reference

The lines building corrected_reference aren’t the same, but they are pretty similar. We can move them out of the if/else ladder and combine them together.

moving corrected_reference down

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d+", d1)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Looking Up And Out

This is a bit of a meta-change, because you can’t infer it from the code here, but canonical_form is drawn from a data file elsewhere in the source tree. We control that data file.

Examining it, we can see it’s safe to replace d+ with d1 in the canonical forms. As a result, we can eliminate one of the replace calls when constructing corrected_reference.

discarding replace("d+", d1)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbrrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

The shape of the code hasn’t wildly changed, but feels like we’re moving in a good direction.

Typos Must Die

Another meta-fix is correcting the 'typo' in "RefYearAbrrNumStrokeNum". That string comes from the same data file as the canonical forms. Obviously "RefYearAbrrEtcEtc" looks like a loads of nonsense, but Abrr is so clearly a typo. It’s an abbreviation for abbreviation! It should be Abbr! Like the brackets I mentioned above, this is a piece of visual noise that needs to go.

typo be gone!

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == "RefYearAbbrNum")
        | (reference_type == "RefYearAbbrNumTeam")
        | (reference_type == "YearAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""

    elif (
        (reference_type == "RefYearAbbrNumNumTeam")
        | (reference_type == "RefYearAbbrNumStrokeNum")
        | (reference_type == "RefYearNumAbbrNum")
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]

    elif (
        (reference_type == "AbbrNumAbbrNum")
        | (reference_type == "NumAbbrNum")
        | (reference_type == "EuroRefC")
        | (reference_type == "EuroRefT")
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Ok, it now says "RefYearAbbrNumStrokeNum" which isn’t a world changing difference, but to me it looks better and IDE agrees because there isn’t a squiggle underneath.

Constants

Those string literals give me the heebee-geebies.

replacing string literals with constants.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if (
        (reference_type == RefYearAbbrNum)
        | (reference_type == RefYearAbbrNumTeam)
        | (reference_type == YearAbbrNum)
    ):
        year = components[0]
        d1 = components[1]
        d2 = ""
    elif (
        (reference_type == RefYearAbbrNumNumTeam)
        | (reference_type == RefYearAbbrNumStrokeNum)
        | (reference_type == RefYearNumAbbrNum)
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
    elif (
        (reference_type == AbbrNumAbbrNum)
        | (reference_type == NumAbbrNum)
        | (reference_type == EuroRefC)
        | (reference_type == EuroRefT)
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Birds of Feather

By grouping like reference types together, we can slim down each if condition.

grouping like types together in an array, test using in

YearAbbrNum_Group = [
    RefYearAbbrNum,
    RefYearAbbrNumTeam,
    YearAbbrNum
]

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearAbbrNum_Group:
        year = components[0]
        d1 = components[1]
        d2 = ""
    elif (
        (reference_type == RefYearAbbrNumNumTeam)
        | (reference_type == RefYearAbbrNumStrokeNum)
        | (reference_type == RefYearNumAbbrNum)
    ):
        year = components[0]
        d1 = components[1]
        d2 = components[2]
    elif (
        (reference_type == AbbrNumAbbrNum)
        | (reference_type == NumAbbrNum)
        | (reference_type == EuroRefC)
        | (reference_type == EuroRefT)
    ):
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

I like that. Let’s roll it out to the rest of the types.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group:
        year = components[0]
        d1 = components[1]
        d2 = ""
    elif reference_type in YearNumNum_Group:
        year = components[0]
        d1 = components[1]
        d2 = components[2]
    elif reference_type in NumNum_Group:
        year = ""
        d1 = components[0]
        d2 = components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Love it.

Remembered Python calls arrays lists, but also that it has tuples too. Tuples are immutable, so they’re a better choice for our groups.

swap tuples for lists by switching [] to ()

YearAbbrNum_Group = (
    RefYearAbbrNum,
    RefYearAbbrNumTeam,
    YearAbbrNum
)

Destructure FTW!

We can collapse the

year = ...
d1 = ...
d2 = ...

lines together into a single statement, going from three lines into a single line.

collapsing assignments

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group:
        year, d1, d2 = components[0], components[1], ""
    elif reference_type in YearNumNum_Group:
        year, d1, d2 = components[0], components[1], components[2]
    elif reference_type in NumNum_Group:
        year, d1, d2 = "", components[0], components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Much easier on the eye.

An extra level of indirection

Bringing the year, d1, d2 assignments together particular highlights the similarity across each branch of the if ladder.

Let’s pair up a type group with a little function that pulls out the components.

YearNum_Group = {
    "Types": [
        RefYearAbbrNum,
        RefYearAbbrNumTeam,
        YearAbbrNum
    ],
    "Parts": lambda cmpts: (cmpts[0], cmpts[1], "")
}

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group["Types"]:
        year, d1, d2 = YearNum_Group["Parts"](components)
    elif reference_type in YearNumNum_Group:
        year, d1, d2 = components[0], components[1], components[2]
    elif reference_type in NumNum_Group:
        year, d1, d2 = "", components[0], components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Probably did a bit too much in one go here, and it’s ugly as hell. But it works, and it captures something useful.

If we introduce a little class to pair up the types and components lambda function. It more setup at the top, but it’s neater in the function body.

class TypeComponents:
    def __init__(self, types, parts):
        self.Types = types
        self.Parts = parts

YearNum_Group = TypeComponents(
    (
        RefYearAbbrNum,
        RefYearAbbrNumTeam,
        YearAbbrNum
    ),
    lambda cmpts: (cmpts[0], cmpts[1], "")
)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group.Types:
        year, d1, d2 = YearNum_Group.Parts(components)
    elif reference_type in YearNumNum_Group:
        year, d1, d2 = components[0], components[1], components[2]
    elif reference_type in NumNum_Group:
        year, d1, d2 = "", components[0], components[1]

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Extend that across the two elif branches.

def canonicalise_reference(reference_type, reference_match, canonical_form):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group.Types:
        year, d1, d2 = YearNum_Group.Parts(components)
    elif reference_type in YearNumNum_Group.Types:
        year, d1, d2 = YearNumNum_Group.Parts(components)
    elif reference_type in NumNum_Group.Types:
        year, d1, d2 = NumNum_Group.Parts(components)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

The if conditions and the bodies now all have the same shape. That’s pretty cool. They were similar before, but now they’re the same.

Yoink out the decision making

It’s not really clear in the code, but there are only two things really going on in this function. The first is pulling chunks out of reference_match, and the second is putting those parts back together into canonical_reference. Let’s make that clearer.

Yoink!

def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    if reference_type in YearNum_Group.Types:
        year, d1, d2 = YearNum_Group.Parts(components)
    elif reference_type in YearNumNum_Group.Types:
        year, d1, d2 = YearNumNum_Group.Parts(components)
    elif reference_type in NumNum_Group.Types:
        year, d1, d2 = NumNum_Group.Parts(components)

    return year, d1, d2

def canonicalise_reference(reference_type, reference_match, canonical_form):
    year, d1, d2 = reference_components(reference_type, reference_match)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Say What You Mean

There’s no need to assign year, d1, d2 in that new function. We can just return the values directly.

get out

def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    if (reference_type in YearNum_Group.Types):
        return YearNum_Group.Parts(components)
    elif (reference_type in YearNumNum_Group.Types):
        return YearNumNum_Group.Parts(components)
    elif (reference_type in NumNum_Group.Types):
        return NumNum_Group.Parts(components)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    year, d1, d2 = reference_components(reference_type, reference_match)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

Search

I mentioned the if conditions and the bodies now all have the same shape. We can exploit that now to eliminate the if/else ladder.

check each group in turn, return when there’s a match

TypeGroups = (
    YearNum_Group,
    YearNumNum_Group,
    NumNum_Group
)

def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    for group in TypeGroups:
        if reference_type in group.Types:
            return group.Parts(components)

def canonicalise_reference(reference_type, reference_match, canonical_form):
    year, d1, d2 = reference_components(reference_type, reference_match)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

And Rest

I first wrote this on Mastodon because I’m that kind of bear, and this where I stopped. I felt the code was in a much better place - not perfect by any means, but better.

But then I thought of something else.

You Wouldn’t Let It Lie

Now the types are grouped together, I was inclinded to put the string literals back in.

We only use "RefYearAbbrNum", for example, as part of a TypeComponents object. It’s not needed anywhere else, but having it as a constants in its own right floating around implies that you might and suggests that you can. In fact, it’s YearNum_Group that is the constant, so lets tie things down to that.

YearNum_Group = TypeComponents(
    (
        "RefYearAbbrNum",
        "RefYearAbbrNumTeam",
        "YearAbbrNum"
    ),
    lambda cmpts: (cmpts[0], cmpts[1], ""),
)

I also felt the parameters to

canonicalise_reference(reference_type, reference_match, canonical_form):

are in the wrong order.

reference_type and canonical_form go together. They originate in the same place in the code, from the data file I mentioned earlier, and if they were in a tuple or wrapped in a little object I certainly wouldn’t argue.

The thing we’re working on, that we take apart and reassemble is reference_match. To me, that means it should be the first parameter we pass.

def reference_components(reference_match, reference_type):
    components = re.findall(r"\d+", reference_match)

    for group in TypeGroups:
        if reference_type in group.Types:
            return group.Parts(components)

def canonicalise_reference(reference_match, reference_type, canonical_form):
    year, d1, d2 = reference_components(reference_match, reference_type)

    corrected_reference = (canonical_form.replace("dddd", year)
                           .replace("d1", d1)
                           .replace("d2", d2))

    return corrected_reference, year, d1, d2

And that I thought was that. And I went to bed.

It’s a new day

The following morning, I got a nudge from my internet fellow-traveller Barney Dellar who said

I tend to think of for-loops as Primitive Obsession. You aren’t looping to do something n times. You’re actually looking for the correct entry in the array to use. I would make that explicit. I’m not good at Python, but some kind of find or filter. Then invoke your method on the result of that filtering.

He was right and I knew it. Had this code been in C#, for instance, I’d probably have gone straight from the if ladder to a LINQ expression.

He set me off. I knew Python’s list comprehensions were its LINQ-a-like, and I had half an idea I could use one here.

However, I thought list comprehensions only created new lists. If I’d done that here, it would mean I’d still have to extract the first element. That felt at least as clumsy as the for loop.

Turns out I’d only ever half used them, though. A list comprehension actually returns an iterable. Combined with next(), which pulls the next element off the iterable, and well, it’s more pythonic.

def reference_components(reference_type, reference_match):
    components = re.findall(r"\d+", reference_match)

    return next(group.Parts(components)
                for group in TypeGroups
                if reference_type in group.Types)

What’s kind of fascinating about this change is that the list comprehension has the exact same elements as the for version, but the intent, as Barney suggested, is very different.

At the same time, Barney came up with almost exactly the same thing too. We’d done a weird long-distance almost-synchronous little pairing session. Magic.

Reflecting

This is contrived, obviously, because it’s a single one function I’ve pulled out of larger code base.

But, but, but, I do believe that now I’ve shoved it about that it’s better code.

If I was able to work to my way out from here, I’m confident I could make the whole thing better. It’d be smaller, it would be easier to read, easier to change.

(This isn’t hypothetical - I found this code because I was talking about working on it. It’s right up a number of my alleys.)

The Big Finish

I’m sure I have made the code better, and I’m just as sure that I’d make the people I was working with better programmers too. I’d be better from working with them - I’ve learned from everyone I’ve ever worked with - but I’m old. I’ve been a lot of places, done a lot of stuff, on a lot of different code bases, with busloads of people. I know what I’m doing, and I know I could have helped.

I’m sorry I couldn’t take the job, but it needed more time than I could give. In the future, well, who knows?

PS

I think it’s important to note I didn’t know where I was heading when I started. I just knew that if I nudged things around then a right shape would emerge. When I had that shape, I could be more directed.

Barney's little nudge was important too. He knew there was an improvement in there, even if neither of us was quite sure what it was (until we were!). That was great. A lovely cherry on the top.

PPS

I tried to do the least I could at each stage. In one place I took out two characters, in another I changed a single letter. Didn’t always succeed - some of what I did could have been split - but small is beautiful, and we should all aim for beauty.

This comes, in large part, from my man GeePaw Hill and his Many More Much Smaller Steps. He’s been a big influence on me over the past few years, and I’ve benefited greatly as a result.

PPPS (really, the last one, I promise)

I was proofing this article before pressing publish (which probably means there are only seven spelling and grammatical errors left), when I saw another change I’d make.

def reference_components(reference_match, reference_type):
    components = re.findall(r"\d+", reference_match)

    for group in TypeGroups:
        if reference_type in group.Types:
            return group.Parts(components)

def build_canonical_form(canonical_form, year, d1, d2):
    return (canonical_form.replace("dddd", year)
        .replace("d1", d1)
        .replace("d2", d2))

def canonicalise_reference(reference_match, reference_type, canonical_form):
    year, d1, d2 = reference_components(reference_match, reference_type)

    corrected_reference = build_canonical_form(canonical_form, year, d1, d2)

    return corrected_reference, year, d1, d2

Again, nothing huge but just another little clarification.

That really is it. For now!

Tagged the-trade, and python

Thursday 29 February 2024 Let’s Predict! : Yr glygfa o Hescwm Uchaf, rhif 8

For a talk I’m thinking about for later this year or early next, I’ve decided I need a text prediction engine. The basic idea is that there are lots of things in software that seem kind of amazing, even to people who work in software. Compilers & interpreters, operating systems, windowing systems are obvious choices, as is lots of games stuff^[1]. Those are, however, all pretty large topics. You can’t write an o/s in forty minutes. I can’t anyway.

What I think I can do a text prediction engine. You know the kind of thing, that feeds you words as you type on your phone or in your word processor.

Having typed the world 'Hello', my phone keyboard has switched from spelling correction to text prediction, offering me a choice of 'world', 'there', and 'how'.

If I can explain how a ✨magic✨ piece of software works in forty minutes or an hour, demystify the whole thing, I can start lift the lid for people on some of these other ✌︎hard✌︎ things.

Perhaps because my programming first language is C++, I’ve had people who work in C# or JavaScript or Python or similar say that all that low-level stuff is too complicated for them. I’ve also encountered a lot of C++ (and C and assembler) snobs who look down the noses at languages you can’t cause a segfault with and the people who use them.

The latter are probably beyond redemption, but it kind of boils my piss that there are programmers who’ve taken some of that snobbery onto themselves and accept that it’s founded in any kind of truth. C++ programmers aren’t on some rarefied mountain top. Programming’s programming. Software’s the most malleable medium we could wish to work in. Anybody can do anything in software.

Artifical Woo-woo

My phone keyboard asked me again last month if I wanted to turn on some kind of AI assistant to boost productivity and unlock creativity. I obviously said no^[2], but it’s all part of stoking up the idea that text prediction is a difficult problem, best left to the big brains at Google or Apple or Microsoft with their shiny offices and free lunches in Silicon Valley and Redmond.

A text predictor in few lines of JavaScript should pop that bubble. A bit. Hopefully.

Keeping it Real (or whatever today’s young people say)

Why use JavaScript? A couple of reasons:

Everyone can read it. It is the Lingua Franca of the current age.
People shit on it all the time^[3] even though they should know better.

If you can do something in JavaScript, language snobs of all stripes just have to shut up. Meanwhile, everyone else can understand the code you’re showing them.

And we’re off

Anyway, as I type I’ve no idea what I’m doing beyond the phrase markov chain, so here goes.

Markov chains are stochastic models of event sequences, where the probability of the next event only depends on the current event.

Helpful.

For text prediction, we take the current word - let’s say hello - and randomly select the word that comes next.

But if we randomly select a word, doesn’t that just mean it could be anything?

Well no, because I skipped over the model part.

Before we start predicting we build the model. We read a lot of text and every word we encounter, we make a note of the words that come immediately afterwards.

For hello, lets says that it’s followed by world, Dolly, and everybody.

Our prediction for the word following hello will, therefore, be one of those three words.

As we build the model, we don’t just note those following words, we also keep track of how often they occur. Imagine hello is followed by world three times, Dolly once, and everybody twice.

hello  ->  (world, 3)
           (everybody, 2)
           (Dolly, 1)

Our prediction is going to choose one of those three words, but the selection will be weighted by how often they occurred. Here we’d expect world to come up more often than everybody, for example.

The particular words and their weights will depend on your text corpus^[4]. The complete works of Jane Austen, for example, is going to end up with quite a different model to that of yesterday’s football commentary, which will be different again to a model built from what you’ve typed into your phone.

We can make things sound more artificial intelligency by saying training instead of building. Training makes things sound like Neo in the Matrix, while building sounds boring and mechanistic. Actually they mean exactly the same thing. We’re taking an input of some sort, which we use to construct (ie build or train) a representation of that input (a model).

The model we’ve made can then be used in various ways. We can categorise or classify things, for instance - yes this is a picture of cat, no that is a picture of a traffic light - by seeing how well new inputs match the model.

What we’re doing here is using input as a stimulus to generate new output seemingly out of the air. Our model, full of words and their frequencies, has captured some essence of the text it’s been build from. By giving it a word to start with, it will produce a second word for us. If we feed that word back into as the next input, it’ll generate a third word. And so on and so on.

We’ll see that in due course, it hope.

Code!

While writing this, I’ve been poking around with code at the same time. It’s took a little bit longer than 40 minutes (not an excuse, but I was watching the https://www.fih.hockey/events/fih-pro-league[FIH ProLeague hockey at the same time, title='Hockey is top sport'), but I got there more easily than I’d expected.

The implements the little data structure I sketched out above, with a small amount of extra scaffolding to let you populate it with pairs of words.

The code lives up on GitHub, and you’re welcome to poke around with it.

The Future

Here’s my silly hello world/everyone/Dolly example worked up into code.

import {make_chain} from "../src/chain.mjs";

const chain = make_chain()

chain.add('hello', 'world')
chain.add('hello', 'everyone')
chain.add('hello', 'Dolly')
chain.add('hello', 'everyone')
chain.add('hello', 'world')
chain.add('hello', 'world')

for (let i = 0; i !== 10; ++i)
  console.log('hello ' + chain.predict('hello'))

And here it is predicting

Console output from the example program. It’s ten lines of text reading hello world

The core of a text prediction engine in 66 lines of JavaScript^[5]. Not bad.

The Future Future

A markov chain text predictor that only knows how to predict from one word isn’t a great deal of use. We need more words. It was, however, bed time when I’d wrapped the code up, and so rather late to feed it the complete works of Jane Austen.

That’s where I’ll go next time.

1. And not just fancy graphics business. I have no clue how you’d put together something like the computer opponents in Dominion, for example.

2. Because honestly, how much productivity boosting and creativity unlocking do you to send WhatsApp messages? And, in any case, it can’t do that.

3. Tool Shaming, Don’t Do It

4. Text corpus being the technical term for a big ole pile of words

5. find ./src -print | grep mjs$ | xargs grep -v ^$ | wc -l

Tagged the-trade, and javascript

Monday 26 February 2024 First ice cream van of the year …

A longitudinal study

Up until the 2023 sample the lead researcher was resident in Birmingham, and so first ice cream sightings were recorded in that locality. They now live in Pembrokeshire by the seaside, a location well-known for tourists and, consequently, a rather different ice cream van ecosystem. As a result, the study proposed to recruit Birmingham-based field agents for this and future years' study.

Consequently, this year’s sighting comes from Ice Cream Agent Matthew Somerville, to whom we are indebted. He recorded this van in Swanshurst Park at 16:29 on Sunday February 24. He reports

Very reasonably priced too. Single cone with flake, single cone without flake, sprinkles and sauce added without requesting. £3.50, card payments taken.

A very clean and shiny looking ice cream van. Reasonably priced too.

Field observations

Does it have to be a sighting? I’ve been hearing them by mine since January
Franklin Road, B30 last week sometime, playing The Peanut Vendor. Pers com via Slack.

Analysis

First Ice Cream Van Of The Year 2004-2024

The full ice cream van data is available as a spreadsheet.

Forward projection

Projection for next year’s first sighting is on or around the afternoon of 4^th March 2025, with an envelope of +2/-2 days.

Previous Years

Tagged icecream

Older posts are available in the archive or through tags.

« To see a World in a Grain of Sand : Yr glygfa o Hescwm Uchaf, rhif 7

Jez Higgins

Freelance software grandad
software created
extended or repaired

Follow me on Mastodon
Applications, Libraries, Code
Talks & Presentations

Hire me
Contact

Older posts are available in the archive or through tags.

Feed

Jez Higgins

Thursday 07 March 2024 If You’re So Smart…​ : Yr glygfa o Hescwm Uchaf, rhif 9

But if I’m so clever, what would I have done?

One Step At A Time

Clearing Visual Noise

Move the action down

Looking Up And Out

Typos Must Die

Constants

Birds of Feather

Destructure FTW!

An extra level of indirection

Yoink out the decision making

Say What You Mean

Search

And Rest

You Wouldn’t Let It Lie

It’s a new day

Reflecting

The Big Finish

PS

PPS

PPPS (really, the last one, I promise)

Thursday 29 February 2024 Let’s Predict! : Yr glygfa o Hescwm Uchaf, rhif 8

Artifical Woo-woo

Keeping it Real (or whatever today’s young people say)

And we’re off

Code!

The Future

The Future Future

Monday 26 February 2024 First ice cream van of the year …​

Jez Higgins

Thursday 07 March 2024 If You’re So Smart… : Yr glygfa o Hescwm Uchaf, rhif 9

Monday 26 February 2024 First ice cream van of the year …