QRS: Matching “.” in UTF-8

QRS: Matching “.” in UTF-8
2024-12-18 20:0:0 Author: www.tbray.org(查看原文) 阅读量:0 收藏

Back on December 13th, I posted a challenge on Mastodon: In a simple UTF-8 byte-driven finite automaton, how many states does it take to match the regular-expression construct “.”, i.e. “any character”? Commenter Anthony Williams responded, getting it almost right I think, but I found his description a little hard to understand. In this piece I’m going to dig into what . actually means, and then how many states you need to match it.

The answer surprised me. Obviously this is of interest only to the faction of people who are interested in automaton wrangling, problematic characters, and the finer points of UTF-8. I expect close attention from all 17 of you!

The answer is… · Four. Or five, depending.

What’s a “Unicode character”? · They’re represented by “code points”, which are numbers in the range 0 … 17×2¹⁶, which is to say 1,114,112 possible values. It turns out you don’t actually want to match all of them; more on that later.

How many states? · Quamina is a “byte-level automaton” which means it’s in a state, it reads a byte, looks up the value of that byte in a map yielding either the next state, or nil, which means no match. Repeat until you match or fail.

What bytes are we talking about here? We’re talking about UTF-8 bytes. If you don’t understand UTF-8 the rest of this is going to be difficult. I wrote a short explainer called Characters vs. Bytes twenty-one years ago. I now assume you understand UTF-8 and knew that code points are encoded as sequences of from 1 to 4 bytes.

Let’s count!

When you match a code point successfully you move to the part of the automaton that’s trying to match the next one; let’s call this condition MATCHED.

(From here on, all the numbers are hex, I’ll skip the leading 0x. And all the ranges are inclusive.)
In multi-byte characters, all the UTF-8 bytes but the first have bitmasks like 10XX XXXX, so there are six significant bits, thus 2⁶ or 64 distinct possible values ranging from 80-BF.
There’s a Start state. It maps byte values 00-7F (as in ASCII) to MATCHED. That’s our first state, and we’ve handled all the one-byte code points.
In the Start state, the 32 byte values C0-DF, all of which begin 110 signaling a two-byte code point, are mapped to the Last state. In the Last state, the 64 values 80-BF are mapped to MATCHED. This takes care of all the two-byte code points and we’re up to two states.
In the Start state, the 16 byte values E0-EF, all of which begin 1110 signaling a three-byte code point, are mapped to the LastInter state. In that state, the 64 values 80-BF are mapped to the Last state. Now we’re up to three states and we’ve handled the three-byte code points.
In the Start state, the 8 byte values F0-F7, all of which begin 11110 signaling a four-byte code point, are mapped to the FirstInter state. In that state, the 64 values 80-BF are mapped to the LastInter state. Now we’ve handled all the code points with four states.

But wait! · I mentioned above about not wanting to match all the code points. “Wait,” you say, “why wouldn’t you want to be maximally inclusive?!” Once again, I’ll link to Unicode Character Repertoire Subsets, a document I co-wrote that is making its way through the IETF and may become an RFC some year. I’m not going to try to summarize a draft that bends over backwards to be short and clear; suffice it to say that there are good reasons for leaving out several different flavors of code point.

Probably the most pernicious code points are the “Surrogates”, U+D800-U+DFFF. If you want an explanation of what they are and why they’re bad, go read that Repertoire Subsets draft or just take my word for it. If you were to encode them per UTF-8 rules (which the UTF-8 spec says you’re not allowed to), the low and high bounds would be ED,A0,80 and ED,BF,BF.

Go’s UTF-8 implementation agrees that Surrogates Are Bad and The UTF-8 Spec Is Good and flatly refuses to convert those UTF-8 sequences into code points or vice versa. The resulting subset of code points even has a catchy name: Unicode Scalars. Case closed, right?

Wrong. Because JSON was designed before we’d thought through these problems, explicitly saying it’s OK to include any code point whatsoever, including surrogates. And Quamina is used for matching JSON data. So, standards fight!

I’m being a little unfair here. I’m sure that if Doug Crockford were inventing JSON now instead of in 2001, he’d exclude surrogates and probably some of the other problematic code points discussed in that Subsets doc.

Anyhow, Quamina will go with Go and exclude surrogates. Any RFC8259 purists out there, feel free accuse me of standards apostasy and I will grant your point but won’t change Quamina. Actually, not true; at some point I’ll probably add an option to be more restrictive and exclude more than just surrogates.

Which means that now we have to go back to the start of this essay and figure out how many states it takes to match “.” Let’s see…

The Start state changes a bit. See #5 in the list above. Instead of mapping all of E0-EF to the LastInter state, it maps one byte in that range, ED, to a new state we’ll call, let’s see, how about ED.
In ED, just as in LastInter, 80-9F are mapped to Last. But A0-BF aren’t mapped to anything, because on that path lie the surrogates.

So, going with the Unicode Scalar path of virtue means I need five states, not four.

文章来源: https://www.tbray.org/ongoing/When/202x/2024/12/18/Matching-Dot-in-UTF8
如有侵权请联系:admin#unsafe.sh