Implementing regular expressions is hard. Hard in interesting ways that make me want to share the lessons. Thus this series, QRS for short.
People who keep an eye on my Quamina open-source pattern-matching project will have noticed a recent absence of updates and conversation. That’s because, persuant to Issue #66, I’m working on adding fairly-full regular-expression matching.
Personal note · This is turning out to be hard. Either it’s the hardest nut I’ve had to crack in many years, or maybe my advanced age is dulling my skills. It’s going to be weeks before I can do the first incremental release. Whatever; the learning experiences coming out of this work still feel fresh and fascinating and give me the urge to share.
I hope I can retain that urge as long as I’m still mentally present. In fact, I hope I can retain the ability to work on software. For various reasons, I’ve been under a lot of personal stress in recent years. Stealing time from my adult responsibilities to wrestle with executable abstractions has been a pillar of my sanity.
Anyhow, normally when I code I blog about it, but so far I haven’t because the work is unfinished. Then I realized that it’s too big, and addresses too many distinct problems, to be just one piece, thus this mini-series.
[Readers who don’t know what regular expressions are should probably close this tab now. Don’t feel guilty, nobody who’s not a full-time computing professional should have to know much less care.]
[Notation: I’m gonna say “Regexp” or maybe just “RE” in this series.]
There are at least three pieces I know I’m going to write:
Parsing RE syntax. (Already written!)
Representing parsed REs.
Implementing UTF-8 based automata for REs.
At the moment, I think the hardest part of the work is #1, Parsing. (Maybe that’s because I haven’t really dug very deep into #3 yet.) I’d be amazed if the final series had only three parts.
Now, introductory material.
Which regular expressions? · They come in lots of flavors. The one I’m implementing is I-Regexp, RFC 9485. The observant reader will notice that I co-edited that RFC, and I cheerfully confess to bias.
I-Regexp is basically a subset of XSD Regular Expressions (chosen to subset because they have a nice clean immutable spec), which are a lot like good ol’ PCRE (Perl-compatible regular expressions). Except for:
They are designed assuming they will only ever be used to match against a string and return a “yes” or “no” answer.
They are anchored, which is to say that (unlike PCREs) they’re all assumed to start with ^
and end with
$
.
They omit popular single-character escapes like \w
and \S
because those are sketchy in the
Unicode context.
They don’t have capture groups or back-references.
They don’t support character class subtraction, e.g. [a-z-m-p]
I’m going to claim that they hit a very useful 80/20 point if what you’re interested is asking “Did the field value match?” which of course is all Quamina is interested in doing.
Project strategy ·
I’m totally not going to try to do all this as a big bang. I’ve got a reliable RE parser now (it was hard!) that recognizes
ten different RE features, ranging from .
to everything in
(a+b*c?).|d[ef]{3,9}\?\P{Lu}
Unbackslashing again ·
Go check out
Unbackslash.
Tl;dr: It’s terribly painful to deal with the standard RE escaping character \
in Go software that is processing
JSON. Because both Go and JSON use \
for escaping and your unit tests eventually fill up with \\
and
\\\\\\\\
and become brutally hard to read. So after publishing that blog piece and running polls on Mastodon,
~
is the new \
. So that RE above becomes (a+b*c?).|d[ef]{3,9}~?~P{Lu}
.
You’re allowed to not like it. But I request that you hold off pushing the big button that sends me to Hell until you’ve tried writing a few unit tests for REs that you want Quamina to process.
Back to strategy: The first feature is going to be that lovely little dot operator. And thus…
Quiz ·
Just for fun, here’s an intellectual challenge. Suppose you’re building a byte-at-a-time state machine to process UTF-8 text. How
many states, roughly, would it take to match .
, i.e. any single Unicode code point? By “match” I mean reject
any byte sequence that
doesn’t, and when it does match, consume just enough bytes to leave you positioned after the .
and ready to start
matching whatever’s next.
I think I’ve found the correct answer. It surprised me, so I’m still sanity-checking, but I think I’m right. I am convinced the problem isn’t as simple as it looks.