[antlr-interest] Mismatched token problem

Wed Jan 14 08:00:17 PST 2009

On Tue, Jan 13, 2009 at 8:59 PM, Kevin J. Cummings
<cummings at kjchome.homeip.net> wrote:
> Richard Wallace wrote:
>>
>> Hello,
>>
>> I am trying to write a rule to match expressions in the following
>> algebraic form
>>
>> an+b
>>
>> But, when the b term is negative it is only allowed to be written as
>>
>> an-b
>>
>> It seems easy enough, the problem is that identifiers can have the '-'
>> character in them.  So I have the following in my grammar
>>
>> expr
>>       :       DASH? NUMBER? 'n' S* ( PLUS | DASH ) S* NUMBER
>>       ;
>>
>> DASH
>>       :        '-'
>>       ;
>>
>> PLUS
>>       :       '+'
>>       ;
>>
>> IDENT
>>       :       ('_' | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' )
>>               ('_' | DASH | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' |
>> '0'..'9')*
>>       |       DASH ('_' | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' )
>>               ('_' | DASH | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' |
>> '0'..'9')*
>>       ;
>>
>> NUMBER
>>       :       '-' (('0'..'9')* '.')? ('0'..'9')+
>>       |       (('0'..'9')* '.')? ('0'..'9')+
>>       ;
>> S
>>       :       ( ' ' | '\t' | '\r' | '\n' | '\f' )
>>       ;
>>
>> So, when I try this grammar against 4n+3 it works great.  But, if I
>> try it against 4n-1 it fails with a MismatchedTokenException.  This
>> seems to be because when evaluating 4n-1 antlr matches the expression
>> as NUMBER IDENT instead of NUMBER 'n' DASH NUMBER.  I've tried
>> changing the lookahead and using backtracking all to no avail.  I'm
>> out of ideas on how to make antlr stop seeing the n-1 as an IDENT and
>> instead see it as 'n' DASH NUMBER.  Any suggestions?
>
> Take the '-' out of the NUMBER production (ie remove the first alternative)
>
> NUMBER : (('0'..'9')* '.')? ('0'..'9')+
>       ;
>

Ah good point.  I had forgotten that was there.  Thanks.

> Why is '-' a valid IDENT character?  And are you using IDENT anywhere else
> in your grammar?  I don't see it referenced in the snippet above.
> If you need to use '-' in IDENT names, you may need to use a predicate so it
> doesn't get confused with the usage in the expr.  Where can IDENTs be used?
>  By default antlr will try and match as much as TOKENs as it can.  This
> happens long before it starts parsing.  IDENT is a Lexer rule (ie made up of
> characters) whereas expr is a Parser rule (made up of tokens).
>

I can't really say why '-' is a valid IDENT character.  I wish it
weren't but it is and I am powerless to change it.  IDENT is used in
quite a few places, I just sent in a shorter more distilled version of
the grammar as an example of the problem.  A few rules where the IDENT
is used is

type : IDENT ;
id : '#' IDENT ;
class : '.' IDENT ;

I've been reading up on predicates trying to understand how to apply
them in this case and I don't fully grasp how to apply it here.  I
thought that maybe doing something like the Lexer Lookahead example on
the page <http://www.antlr.org/wiki/display/~gbrose85/7.++Common+Rules+and+Examples>
might do it, but that would also mean that if 'n' was used as an
identifier elsewhere it wouldn't get parsed as an IDENT as it should.

I don't normally ask for this much hand-holding but I'm drawing a
blank here.  Think you could walk me through what you mean by using a
predicate?

Thanks again,
Rich

>> Thanks,
>> Rich
>
> --
> Kevin J. Cummings
> kjchome at rcn.com
> cummings at kjchome.homeip.net
> cummings at kjc386.framingham.ma.us
> Registered Linux User #1232 (http://counter.li.org)
>