[antlr-interest] Tokens and literals: how to avoid conflics?

Fri Jul 11 11:54:15 PDT 2008

On Fri, 2008-07-11 at 19:57 +0200, Gioele Barabucci wrote:

> This simple grammar
> 
> > grammar Tok1;
> > 
> > stmt:     'id' S id_name S ('#IMP'|'#FIX') EOF; 
> > id_name:  '#' NAME;
> > 
> > NAME: ('A'..'Z')+;
> > S: (' '|'\n')+;
> 
> parses "id #AAA #FIX" correctly but fails on "id #III #FIX" with a
> MismatchedTokenException(5!=9).
> 
> I think the exception is raised because as soon as the lexer sees "#I"
> (in "#III") it expects the token to be equal to the literal "#IMP".
> 
> Is there a way to work around this problem?

I really , once again, cannot stress too much the fact that new users
should not use the inline 'quote' rules in the parser. They really send
you down the wrong streets until you are completely familiar with the
parser/lexer process. I look at your grammar and see the obvious
problems, but I just don't see how new users would.

So (and htis goes for all new users):

1) Don't use quoted strings in the parser;
2) Write your token rules fo rhte lexer. You shoudl immediately see that
your token specs are not consistent.
3) Think of the decisions that the lexer might have to make, and write
your rules that way. ANTLR is pretty good at workig nthings out, but
when you come back to tweak somethign in 12 months time, you will forget
why it works, so being explicit means your code is maintainable.

IN this case then, you have some keywords that begin with '#' and you
also want identifiers taht start with '#'. There are quite a few ways to
write this, but the following will keep it clea. First some fragment
rules to lay out the tokens and document them:

fragment IMP : 'IMP';
fragment FIX : 'FIX' ;

Now a token rule or two that will actually return something to the
lexer:

ID    : 'ID' ;
IDENT : ('a'..'z' | 'A'..'Z')+ ;
HASH  : '#'   // Many things prefix with HASH, differentiate them here
               (  (FIX)=>FIX  { $type = FIX; }
                  | (IMP)=>IMP {$type = IMP; }
                  | // Neither keyword, sometimes HASH is just HASH and
not pounds    
               )
            ;

Now, in the parser use teh token names:

stmt: ID S idName S (IMP|FIX) EOF ;
idName : HASH IDENT;

Also, if you use:
S : (' '| '\t')+ {$type = HIDDEN; }

Then you can leave teh S tokens out of the parser.

Of course, you only need to do that if you specifically want to eat the
'#' with FIX and IMP. It would be more obvious still if you did:

ID : 'id' ;
FIX : 'FIX' ;
IMP : 'IMP' ;
HASH : '#' ;
IDENT : ('a'..'z'|'A'..'Z')+;

... (impOrFix)
impOrFix : HASH (IMP|FIX) ;

Jim

> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080711/54991b79/attachment.html