[antlr-interest] Problems with Unicode support in ANTLR

micheal_jor open.zone at virgin.net
Thu May 16 08:31:01 PDT 2002


Hi All,

I am currently trying to develop a Lexer (and later a Parser) using 
ANTLR for a langauge that must be able to deal with UNICODE 
extensively.

The basic issue is that since the defintion of the language (as does 
Java in fact) refers to Unicode Categories or Classes, I need a way 
to direct ANTLR to accept or reject all the caharacters defined to be 
in such Unicode classes. I can see three general solutions to this:

a) Use ANTLR's has built-in support for Unicode that includes 
categories and classes

   This would be ideal but ANTLR hasn't evolved to this state yet.  :-
(


b) Use a rule that matches any character but then applies a predicate 
to validate the character. For instance:

        protected UNICODE_CLASS_Nl
          :  (   { IsUnicodeClass_Nl(LA(1)) }? . 
             |   { IsUnicodeClass_Nl(esc_char.getText()) }? 
esc_char:UNICODE_ESCAPE_SEQUENCE 
             )
        ;

   This was my first course of action but it lead to a LOT of 
ambiguity warnings that I don't know how to turn off ;-(
   Any ideas how to turn this warnings off selectively please?


c) Define all the UNICODE categories directly within the ANTLR 
definition file
   (Can one ANTLR definition file #include another ANTLR definition 
file with all such UNICODE classes?).

   For instance:

        protected UNICODE_CLASS_Nl           // Unicode Category or 
Class: Nl
          :  ( '\u16EE'..'\u16F0' 
             | '\u2160'..'\u2183'
             | '\u3007'..'\u3007'
             | '\u3021'..'\u3029'
             | '\u3038'..'\u303A'
             )
           ;

   This option had the effect of generating HUGE lexer files - 
currently over 100kB with four categories partially defined. There 
are 32 such categories although I only need about half. And lots of 
errors because of the numeric size of parameters to the calls 
to 'matchRange'. The first range above - '\u16EE'..'\u16F0' - 
generates the following call:

          matchRange('\x4543d','\x45430');

I suspect this is due to a bug in the C# codegenerator (IOW it's 
probably my bug since I am part of the team that wrote that) because 
all the character values in the definition are valid. I have used the 
the following option:

           charVocabulary		= '\u0003'..'\uFFFE';


CONCLUSION:

I would have loved to be able to use option (a). Since I don't have 
that option I thought option (b) would be clearer and more succint 
than (c) and would perform better given it's vastly reduced codesize. 

I will track down the origins of the errors on option (c) but I 
dislike it because it results in a huge ANTLR definition file and a 
huge generated Lexer/Parser source file. Am I right in thinking it 
would result in perhaps the least performant parsers?

What do you fine people suggest?


Cheers,

Micheal




 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list