[antlr-interest] Problems with Unicode support in ANTLR
micheal_jor
open.zone at virgin.net
Thu May 16 08:31:01 PDT 2002
Hi All,
I am currently trying to develop a Lexer (and later a Parser) using
ANTLR for a langauge that must be able to deal with UNICODE
extensively.
The basic issue is that since the defintion of the language (as does
Java in fact) refers to Unicode Categories or Classes, I need a way
to direct ANTLR to accept or reject all the caharacters defined to be
in such Unicode classes. I can see three general solutions to this:
a) Use ANTLR's has built-in support for Unicode that includes
categories and classes
This would be ideal but ANTLR hasn't evolved to this state yet. :-
(
b) Use a rule that matches any character but then applies a predicate
to validate the character. For instance:
protected UNICODE_CLASS_Nl
: ( { IsUnicodeClass_Nl(LA(1)) }? .
| { IsUnicodeClass_Nl(esc_char.getText()) }?
esc_char:UNICODE_ESCAPE_SEQUENCE
)
;
This was my first course of action but it lead to a LOT of
ambiguity warnings that I don't know how to turn off ;-(
Any ideas how to turn this warnings off selectively please?
c) Define all the UNICODE categories directly within the ANTLR
definition file
(Can one ANTLR definition file #include another ANTLR definition
file with all such UNICODE classes?).
For instance:
protected UNICODE_CLASS_Nl // Unicode Category or
Class: Nl
: ( '\u16EE'..'\u16F0'
| '\u2160'..'\u2183'
| '\u3007'..'\u3007'
| '\u3021'..'\u3029'
| '\u3038'..'\u303A'
)
;
This option had the effect of generating HUGE lexer files -
currently over 100kB with four categories partially defined. There
are 32 such categories although I only need about half. And lots of
errors because of the numeric size of parameters to the calls
to 'matchRange'. The first range above - '\u16EE'..'\u16F0' -
generates the following call:
matchRange('\x4543d','\x45430');
I suspect this is due to a bug in the C# codegenerator (IOW it's
probably my bug since I am part of the team that wrote that) because
all the character values in the definition are valid. I have used the
the following option:
charVocabulary = '\u0003'..'\uFFFE';
CONCLUSION:
I would have loved to be able to use option (a). Since I don't have
that option I thought option (b) would be clearer and more succint
than (c) and would perform better given it's vastly reduced codesize.
I will track down the origins of the errors on option (c) but I
dislike it because it results in a huge ANTLR definition file and a
huge generated Lexer/Parser source file. Am I right in thinking it
would result in perhaps the least performant parsers?
What do you fine people suggest?
Cheers,
Micheal
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list