[antlr-interest] Non-deterministic behaviour in matching lexer tokens

Fri May 27 16:12:17 PDT 2011

Thanks again Kirby.

The first examples were meant for discussion but the real grammar has
ambiguous
language with identifiers and values.

The identifier is defined as a regular expression \w+ which translates to
LETTER: ('a'..'z' | 'A'..'Z');
DIGIT: '0'..'9';
IDENT: LETTER (LETTER | DIGIT | '_')*

and VALUE would be a looser definition than IDENT
VALUE: (LETTER | DIGIT)+;

So because not all IDENT will have an underscore the lexer can not
distinguish
between both tokens.

Assuming I can't differentiate VALUE from IDENT your options might help, but

(option 1 - merge with single rule)
If I merged both tokens into a rule defined as:
identifier: VALUE | IDENT { $type = determineType(...); }

I still won't know how to determine the type because I'm not sure what
parent
rule context I'm processing under.  The characters of the token alone cannot
tell me that.

(option 2 - emit different token at parse time)
Same issue as option 1.  I will not be able to differentiate based on the
token characters alone.

-tony

On Fri, May 27, 2011 at 6:43 PM, Kirby Bohling <kirby.bohling at gmail.com>wrote:

> What you'd likely do is one of the following:
>
> 1. Merge all the token types into a single rule that recognizes all of
> them, and after that rule finishes, figure out the "right" answer and
> set the token type.
> (In this case, everything is handled in the lexer).  At the end of the
> rule you'd have a section that did { $type = computeTokenType(...); }
> Or you can use the float vs. range FAQ entry to get the lexer to do
> all of that for you, just the lexer will be a serious hassle to read
> and modify.
>
> I don't know off hand what goes in the arguments.  Write a version
> that takes no arguments, then look at the generated code and see what
> you think you'll need, and pass it in to that function.
>
> 2. Treat it like Keyword vs. Identifier problem from the FAQ.
> http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741
>
> In this case, you're doing it later in the parser, which likely gives
> you more flexibility.  You will do effectively the same thing in this
> case, but you'll modify the token type (or generate an imaginary
> parent token) in the parser after looking to see "I need an VALUE" at
> this point in the parser.  You'll have a rule that effectively asks
> "Is this identifier I have really a VALUE?", and use that anywhere you
> require a VALUE and not an IDENT.
>
> Honestly, I am guessing that the whole problem is a sign that you're
> doing it the hard way.  It'd be much easier if you designed the
> language such that it was unambiguous, but I don't know much about the
> problem at hand.
>
> Kirby
>
>
>
> On Fri, May 27, 2011 at 5:30 PM, Anthony Bargnesi <abargnesi at gmail.com>
> wrote:
> > Thanks for the quick reply!
> >
> > My second grammar was a mistake, sorry.  I realize that '!'+ does a good
> job
> > of disambiguating
> > VALUE from IDENT.
> >
> > But if I change that second grammar too:
> >
> > call:
> >     'call' id=IDENT
> >     ;
> >
> > action:
> >     'action' VALUE
> >     ;
> >
> > IDENT:
> >     LETTER (LETTER | DIGIT | '_')*
> >     ;
> >
> > VALUE:
> >     (LETTER | DIGIT)+
> >     ;
> >
> > fragment LETTER:
> >     ('a'..'z' | 'A'..'Z')
> >     ;
> >
> > fragment DIGIT:
> >     '0'..'9'
> >     ;
> >
> > WS:
> >     (' ' | '\t' | '\n' | '\r'| '\f')+
> >     {$channel = HIDDEN;}
> >     ;
> >
> > Then I parse "action myval" and receive this error:
> >
> > line 1:7 mismatched input 'myval' expecting VALUE
> >
> > Because the lexer cannot determine whether the token is IDENT or VALUE
> > my action rule will fail.
> >
> > What are my options for disambiguation at this point?
> >
> > -tony
> >
> >
> > On Fri, May 27, 2011 at 6:23 PM, Kirby Bohling <kirby.bohling at gmail.com>
> > wrote:
> >>
> >> First grammar:
> >> > VALUE:
> >> >    (LETTER | DIGIT)+
> >> >    ;
> >>
> >> Second Grammar:
> >> > VALUE:
> >> >    (LETTER | DIGIT) '!'+
> >> >    ;
> >> > action MYVAL!   (MismatchedTokenException: line 3:7 mismatched input
> >> > 'MYVAL'
> >>
> >> You've got the rule in + in the wrong place.  I'm pretty sure you meant:
> >>
> >> VALUE:
> >>   (LETTER | DIGIT)+ '!'
> >> ;
> >>
> >> It is blowing up at the 'Y', because it can have one letter or one
> >> digit, and at least '!'.  You've given it 5 letters then one '!'.
> >>
> >> While you can make this work, it would likely be easier to make the
> >> difference between those to easier to disambiguate.  However, if you
> >> think this is the correct approach read the FAQ about floats vs.
> >> ranges:
> >>
> >>
> http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs
> >>
> >> That's got the example of all of the power tools for how to man handle
> >> ambiguous tokens types.
> >>
> >> Kirby
> >
> >
>