unicode strings using supplemental char range

flack at cs.purdue.edu flack at cs.purdue.edu
Sun Jun 20 17:13:38 PDT 2004


> Actually, I just had an idea.  First, thanks to your help, I know that 
> UTF-16 encoded in a string is unambiguously UTF-16.  Now, the only 
> question is, how do we match a 21-bit char against it?  What if we just 
> specified that the input must be UTF-16 also?  Then, ANTLR can pretend 
> everything is 16 bits, right?

The underlying idea here seems fine to me: pre-encode your user input into
UTF-16 and then do the matching in the UTF-16 domain.  It's analogous to the
'provincial-safe' programming style I described in
http://www.gjt.org/javadoc/org/gjt/cuspy/ByteString.html , only even safer
as the careful construction of UTF-16 ensures that
no matter where you jump into a string you can never mistake either a high
or a low surrogate for a plane-0 code point.  Probably have to do something
messy for things like . and [] that expect to match a single character; you
can probably judge better than I how much of an algorithmic complication
that would be.

As a possible downside the scheme might not promptly detect mispaired
surrogates in the input, as going through a UnicodeBuffer would.  But then
again, it might: if all your pre-encoded user input is ensured well-formed,
then perhaps you can prove that no recognizer you can build would be able
to match mispaired surrogates in the input, so they would always be errors.
Conjecture from me cos I'm too lazy at the moment to try to prove it.

A superficial part of the idea I would quibble with: yes, the user's input
should be pre-encoded into UTF-16, but no, don't make the poor HUMAN do the
mechanical encoding.  When the user is talking about a single character,
s/he shouldn't have to go figure out how to encode it as a pair of
surrogates.  So let the user say '\u1039f' or 'UGARITIC WORD DIVIDER';
don't make the user pull out a calculator just to write "\ud800\udf9f",
which would be bad anyway because you probably wouldn't accept it anyway
in [] or some other context where you expect a single character.  Let the
computer do the pre-encoding of the user's strings, and keep the mechanics
of surrogate-pair encoding out of the user's view.  Of course the pre-encoding
also has to work if the user has a Unicode editor and just types a UGARITIC
WORD DIVIDER right in the source, which you would see as \ud800\udf9f coming
from the Reader, and you still have to treat it as a single character.

Of course that's easy if you still use a UnicodeBuffer at least to look
at the stuff in the user's grammar file, so you would simply see the character
as a single character without knowing how it came in, and then you just
process the explicit things like 'UGARITIC WORD DIVIDER' and encode
everything to UTF-16 and generate the UTF-16 recognizer.

Btw, did any of my suggestions about composable input filters I wrote up,
hmm, maybe a few years ago now, have any appeal at the time?  UTF-16 could
probably provide additional example use cases for them....

-Chap



More information about the antlr-interest mailing list