[antlr-interest] Mismatched token problem

Tue Jan 13 19:59:10 PST 2009

Richard Wallace wrote:
> Hello,
> 
> I am trying to write a rule to match expressions in the following algebraic form
> 
> an+b
> 
> But, when the b term is negative it is only allowed to be written as
> 
> an-b
> 
> It seems easy enough, the problem is that identifiers can have the '-'
> character in them.  So I have the following in my grammar
> 
> expr
>        :       DASH? NUMBER? 'n' S* ( PLUS | DASH ) S* NUMBER
>        ;
> 
> DASH
>        :        '-'
>        ;
> 
> PLUS
>        :       '+'
>        ;
> 
> IDENT
>        :       ('_' | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' )
>                ('_' | DASH | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' |
> '0'..'9')*
>        |       DASH ('_' | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' )
>                ('_' | DASH | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' |
> '0'..'9')*
>        ;
> 
> NUMBER
>        :       '-' (('0'..'9')* '.')? ('0'..'9')+
>        |       (('0'..'9')* '.')? ('0'..'9')+
>        ;
> S
>        :       ( ' ' | '\t' | '\r' | '\n' | '\f' )
>        ;
> 
> So, when I try this grammar against 4n+3 it works great.  But, if I
> try it against 4n-1 it fails with a MismatchedTokenException.  This
> seems to be because when evaluating 4n-1 antlr matches the expression
> as NUMBER IDENT instead of NUMBER 'n' DASH NUMBER.  I've tried
> changing the lookahead and using backtracking all to no avail.  I'm
> out of ideas on how to make antlr stop seeing the n-1 as an IDENT and
> instead see it as 'n' DASH NUMBER.  Any suggestions?

Take the '-' out of the NUMBER production (ie remove the first alternative)

NUMBER : (('0'..'9')* '.')? ('0'..'9')+
        ;

Why is '-' a valid IDENT character?  And are you using IDENT anywhere 
else in your grammar?  I don't see it referenced in the snippet above.
If you need to use '-' in IDENT names, you may need to use a predicate 
so it doesn't get confused with the usage in the expr.  Where can IDENTs 
be used?  By default antlr will try and match as much as TOKENs as it 
can.  This happens long before it starts parsing.  IDENT is a Lexer rule 
(ie made up of characters) whereas expr is a Parser rule (made up of 
tokens).

> Thanks,
> Rich

-- 
Kevin J. Cummings
kjchome at rcn.com
cummings at kjchome.homeip.net
cummings at kjc386.framingham.ma.us
Registered Linux User #1232 (http://counter.li.org)