[antlr-interest] Fun with ANTLR3: mystery of the huge lexer

David Piepgrass qwertie256 at gmail.com
Sat Jun 30 16:12:17 PDT 2007


> Your ML_COMMENT needs to be a fragment rule and you need a predicate to
> stop '.' interfering with ML_COMMENT. I just produce this rule for my
> T-SQL lexer in fact (C here but the predicate is just input.LA(n) for
> Java):

Thanks, but is it really necessary to use a fragment? At the end of my
message I noted that this rule seems to work okay:

ML_COMMENT:
    ('/*')=> '/*'
    (options{greedy=false;} : ML_COMMENT | .)*
    '*/'
    { $channel = HIDDEN; };

ANTLR's architecture has changed and rules do not actually create
tokens (did they in v2?). All token functions return void.

> fragment        ML_COMFRAG
>             :
>                     '/*' ( options { greedy=false;}
>                                 : {(LA(1)== '/' && LA(2) == '*')}? ML_COMFRAG
>                                 |  .
>                                 )* '*/'
>             ;
>
> That should help with that part. Then is your PUNC rule something that
> returns a token, or are you using that somewhere else too?

PUNC returns a token and is not used anywhere else. Its job is to
gather any sequence of adjacent punctuation into one token, which is a
problem because  a string like /*!*/ matches all three rules:
ML_COMMENT, PUNC and RE_STRING.

It's too bad I can't assign "priorities" to each rule. I would like to
match /* as a comment whenever possible, with /regular-expressions/
having the next-highest priority and PUNC having the lowest.

The reason I treat punctuation this was, by the way, is that the set
of available operators can be user-defined and it can vary by scope.
Therefore it is not possible to identify operators within the lexer.

-- 
- David
http://qism.blogspot.com


More information about the antlr-interest mailing list