[antlr-interest] Non-deterministic behaviour in matching lexer tokens

Fri May 27 15:43:41 PDT 2011

What you'd likely do is one of the following:

1. Merge all the token types into a single rule that recognizes all of
them, and after that rule finishes, figure out the "right" answer and
set the token type.
(In this case, everything is handled in the lexer).  At the end of the
rule you'd have a section that did { $type = computeTokenType(...); }
Or you can use the float vs. range FAQ entry to get the lexer to do
all of that for you, just the lexer will be a serious hassle to read
and modify.

I don't know off hand what goes in the arguments.  Write a version
that takes no arguments, then look at the generated code and see what
you think you'll need, and pass it in to that function.

2. Treat it like Keyword vs. Identifier problem from the FAQ.
http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741

In this case, you're doing it later in the parser, which likely gives
you more flexibility.  You will do effectively the same thing in this
case, but you'll modify the token type (or generate an imaginary
parent token) in the parser after looking to see "I need an VALUE" at
this point in the parser.  You'll have a rule that effectively asks
"Is this identifier I have really a VALUE?", and use that anywhere you
require a VALUE and not an IDENT.

Honestly, I am guessing that the whole problem is a sign that you're
doing it the hard way.  It'd be much easier if you designed the
language such that it was unambiguous, but I don't know much about the
problem at hand.

Kirby

On Fri, May 27, 2011 at 5:30 PM, Anthony Bargnesi <abargnesi at gmail.com> wrote:
> Thanks for the quick reply!
>
> My second grammar was a mistake, sorry.  I realize that '!'+ does a good job
> of disambiguating
> VALUE from IDENT.
>
> But if I change that second grammar too:
>
> call:
>     'call' id=IDENT
>     ;
>
> action:
>     'action' VALUE
>     ;
>
> IDENT:
>     LETTER (LETTER | DIGIT | '_')*
>     ;
>
> VALUE:
>     (LETTER | DIGIT)+
>     ;
>
> fragment LETTER:
>     ('a'..'z' | 'A'..'Z')
>     ;
>
> fragment DIGIT:
>     '0'..'9'
>     ;
>
> WS:
>     (' ' | '\t' | '\n' | '\r'| '\f')+
>     {$channel = HIDDEN;}
>     ;
>
> Then I parse "action myval" and receive this error:
>
> line 1:7 mismatched input 'myval' expecting VALUE
>
> Because the lexer cannot determine whether the token is IDENT or VALUE
> my action rule will fail.
>
> What are my options for disambiguation at this point?
>
> -tony
>
>
> On Fri, May 27, 2011 at 6:23 PM, Kirby Bohling <kirby.bohling at gmail.com>
> wrote:
>>
>> First grammar:
>> > VALUE:
>> >    (LETTER | DIGIT)+
>> >    ;
>>
>> Second Grammar:
>> > VALUE:
>> >    (LETTER | DIGIT) '!'+
>> >    ;
>> > action MYVAL!   (MismatchedTokenException: line 3:7 mismatched input
>> > 'MYVAL'
>>
>> You've got the rule in + in the wrong place.  I'm pretty sure you meant:
>>
>> VALUE:
>>   (LETTER | DIGIT)+ '!'
>> ;
>>
>> It is blowing up at the 'Y', because it can have one letter or one
>> digit, and at least '!'.  You've given it 5 letters then one '!'.
>>
>> While you can make this work, it would likely be easier to make the
>> difference between those to easier to disambiguate.  However, if you
>> think this is the correct approach read the FAQ about floats vs.
>> ranges:
>>
>> http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs
>>
>> That's got the example of all of the power tools for how to man handle
>> ambiguous tokens types.
>>
>> Kirby
>
>