[antlr-interest] Problems with Unicode support in ANTLR

Thu May 16 10:31:14 PDT 2002

Are the predefined Unicode blocks that are handled by 
java.lang.Character.UnicodeBlock sufficient for what you need? Or, do 
you need a different classification?

I was thinking of patching ANTLR's Java generator to be able to use 
named unicode character catagories as "pre-defined" "protected" lexer 
rules, but supporting anything more than the Character class handles is 
over my head.

Also, the ANTLR documentation isn't really very clear about what kind of 
Unicode support ANTLR already has. What are the limitations?

- Brian

micheal_jor wrote:

>Hi All,
>
>I am currently trying to develop a Lexer (and later a Parser) using 
>ANTLR for a langauge that must be able to deal with UNICODE 
>extensively.
>
>The basic issue is that since the defintion of the language (as does 
>Java in fact) refers to Unicode Categories or Classes, I need a way 
>to direct ANTLR to accept or reject all the caharacters defined to be 
>in such Unicode classes. I can see three general solutions to this:
>
>a) Use ANTLR's has built-in support for Unicode that includes 
>categories and classes
>
>   This would be ideal but ANTLR hasn't evolved to this state yet.  :-
>(
>
>
>b) Use a rule that matches any character but then applies a predicate 
>to validate the character. For instance:
>
>        protected UNICODE_CLASS_Nl
>          :  (   { IsUnicodeClass_Nl(LA(1)) }? . 
>             |   { IsUnicodeClass_Nl(esc_char.getText()) }? 
>esc_char:UNICODE_ESCAPE_SEQUENCE 
>             )
>        ;
>
>   This was my first course of action but it lead to a LOT of 
>ambiguity warnings that I don't know how to turn off ;-(
>   Any ideas how to turn this warnings off selectively please?
>
>
>c) Define all the UNICODE categories directly within the ANTLR 
>definition file
>   (Can one ANTLR definition file #include another ANTLR definition 
>file with all such UNICODE classes?).
>
>   For instance:
>
>        protected UNICODE_CLASS_Nl           // Unicode Category or 
>Class: Nl
>          :  ( '\u16EE'..'\u16F0' 
>             | '\u2160'..'\u2183'
>             | '\u3007'..'\u3007'
>             | '\u3021'..'\u3029'
>             | '\u3038'..'\u303A'
>             )
>           ;
>
>   This option had the effect of generating HUGE lexer files - 
>currently over 100kB with four categories partially defined. There 
>are 32 such categories although I only need about half. And lots of 
>errors because of the numeric size of parameters to the calls 
>to 'matchRange'. The first range above - '\u16EE'..'\u16F0' - 
>generates the following call:
>
>          matchRange('\x4543d','\x45430');
>
>I suspect this is due to a bug in the C# codegenerator (IOW it's 
>probably my bug since I am part of the team that wrote that) because 
>all the character values in the definition are valid. I have used the 
>the following option:
>
>           charVocabulary		= '\u0003'..'\uFFFE';
>
>
>CONCLUSION:
>
>I would have loved to be able to use option (a). Since I don't have 
>that option I thought option (b) would be clearer and more succint 
>than (c) and would perform better given it's vastly reduced codesize. 
>
>I will track down the origins of the errors on option (c) but I 
>dislike it because it results in a huge ANTLR definition file and a 
>huge generated Lexer/Parser source file. Am I right in thinking it 
>would result in perhaps the least performant parsers?
>
>What do you fine people suggest?
>
>
>Cheers,
>
>Micheal
>
>
>
>
> 
>
>Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 
>

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/