[antlr-interest] Lexing C-style strings - problems matching characters not in vocab

Sat Feb 25 14:57:59 PST 2006

Hi,

> I don't think my STRING rule with match characters such as £, ©, ¼  
> and so on. What do I do about this? Add them explicitly to the  
> expression? I can't go through the entire Unicode specs adding every  
> character to my rule - it would be huge.
> 
> I looked at Scanning Unicode Characters in the docs, but this only  
> refers to 16bit Unicode characters - what do I do for characters  
> outside this arbitary limit?

Well, to Java 16bit is all Unicode. For Java characters outside of the
16bit range are represented by surrogate pairs, as far as I know. As
someone else said, you can use the charVocabulary option to include more
characters. If you use C++, it's going to be more difficult though.

Writing down all Unicode characters actually isn't that horrible. You
can use character ranges and end up with something sensible. I once ran
into a problem though when entering the character ranges from the XML
standard - the ANTLR generated arrays got too big for the Java compiler
and classformat :-(

Martin