[antlr-interest] Lexing C-style strings - problems matching
characters not in vocab
Martin Probst
mail at martin-probst.com
Sat Feb 25 14:57:59 PST 2006
Hi,
> I don't think my STRING rule with match characters such as £, ©, ¼
> and so on. What do I do about this? Add them explicitly to the
> expression? I can't go through the entire Unicode specs adding every
> character to my rule - it would be huge.
>
> I looked at Scanning Unicode Characters in the docs, but this only
> refers to 16bit Unicode characters - what do I do for characters
> outside this arbitary limit?
Well, to Java 16bit is all Unicode. For Java characters outside of the
16bit range are represented by surrogate pairs, as far as I know. As
someone else said, you can use the charVocabulary option to include more
characters. If you use C++, it's going to be more difficult though.
Writing down all Unicode characters actually isn't that horrible. You
can use character ranges and end up with something sensible. I once ran
into a problem though when entering the character ranges from the XML
standard - the ANTLR generated arrays got too big for the Java compiler
and classformat :-(
Martin
More information about the antlr-interest
mailing list