[antlr-interest] Non-deterministic behaviour in matching lexer tokens

Kirby Bohling kirby.bohling at gmail.com
Fri May 27 16:43:27 PDT 2011


I'm going to repeat:  The fact that you're having this problem is a
warning sign that your language is ill-defined, or at least ANTLR
isn't the best tool to parse it.

Well, when choose to resolve the ambiguity in the parser or tree
grammar, you'll have plenty of context in order to figure it out.  I
have no idea what an ident or value is in your language, or how I'd
use them differently.  I've got a guess based on the common usage of
them.

Any chance you'll send examples and how you'd figure it out by hand?

But there's a reason in C, that variables, can't start with a digit,
and that literal strings are enclosed with quotes, precisely to avoid
this type of problem.  It's just so much easier, that is worth forcing
the programmer add something near the beginning to remove the
ambiguity of what they mean.

Without that it is really hard to tell if:

action foo;

foo means the "identifier" foo, or if it means the "value" foo.  How
would you figure it out?  Is it like C++ constructor vs. function call
where, if it can be treated like an identifier it is?  In that case,
you're going to need to write and build an symbol table as you go, and
when you come across an identifier, look it up in the symbol table, if
that works, then it is an identifier, if it doesn't, it is must be a
literal.

Kirby


On Fri, May 27, 2011 at 6:12 PM, Anthony Bargnesi <abargnesi at gmail.com> wrote:
> Thanks again Kirby.
>
> The first examples were meant for discussion but the real grammar has
> ambiguous
> language with identifiers and values.
>
> The identifier is defined as a regular expression \w+ which translates to
> LETTER: ('a'..'z' | 'A'..'Z');
> DIGIT: '0'..'9';
> IDENT: LETTER (LETTER | DIGIT | '_')*
>
> and VALUE would be a looser definition than IDENT
> VALUE: (LETTER | DIGIT)+;
>
> So because not all IDENT will have an underscore the lexer can not
> distinguish
> between both tokens.
>
> Assuming I can't differentiate VALUE from IDENT your options might help, but
>
> (option 1 - merge with single rule)
> If I merged both tokens into a rule defined as:
> identifier: VALUE | IDENT { $type = determineType(...); }
>
> I still won't know how to determine the type because I'm not sure what
> parent
> rule context I'm processing under.  The characters of the token alone cannot
> tell me that.
>
> (option 2 - emit different token at parse time)
> Same issue as option 1.  I will not be able to differentiate based on the
> token characters alone.
>
> -tony
>
> On Fri, May 27, 2011 at 6:43 PM, Kirby Bohling <kirby.bohling at gmail.com>
> wrote:
>>
>> What you'd likely do is one of the following:
>>
>> 1. Merge all the token types into a single rule that recognizes all of
>> them, and after that rule finishes, figure out the "right" answer and
>> set the token type.
>> (In this case, everything is handled in the lexer).  At the end of the
>> rule you'd have a section that did { $type = computeTokenType(...); }
>> Or you can use the float vs. range FAQ entry to get the lexer to do
>> all of that for you, just the lexer will be a serious hassle to read
>> and modify.
>>
>> I don't know off hand what goes in the arguments.  Write a version
>> that takes no arguments, then look at the generated code and see what
>> you think you'll need, and pass it in to that function.
>>
>> 2. Treat it like Keyword vs. Identifier problem from the FAQ.
>> http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741
>>
>> In this case, you're doing it later in the parser, which likely gives
>> you more flexibility.  You will do effectively the same thing in this
>> case, but you'll modify the token type (or generate an imaginary
>> parent token) in the parser after looking to see "I need an VALUE" at
>> this point in the parser.  You'll have a rule that effectively asks
>> "Is this identifier I have really a VALUE?", and use that anywhere you
>> require a VALUE and not an IDENT.
>>
>> Honestly, I am guessing that the whole problem is a sign that you're
>> doing it the hard way.  It'd be much easier if you designed the
>> language such that it was unambiguous, but I don't know much about the
>> problem at hand.
>>
>> Kirby
>>
>>
>>
>> On Fri, May 27, 2011 at 5:30 PM, Anthony Bargnesi <abargnesi at gmail.com>
>> wrote:
>> > Thanks for the quick reply!
>> >
>> > My second grammar was a mistake, sorry.  I realize that '!'+ does a good
>> > job
>> > of disambiguating
>> > VALUE from IDENT.
>> >
>> > But if I change that second grammar too:
>> >
>> > call:
>> >     'call' id=IDENT
>> >     ;
>> >
>> > action:
>> >     'action' VALUE
>> >     ;
>> >
>> > IDENT:
>> >     LETTER (LETTER | DIGIT | '_')*
>> >     ;
>> >
>> > VALUE:
>> >     (LETTER | DIGIT)+
>> >     ;
>> >
>> > fragment LETTER:
>> >     ('a'..'z' | 'A'..'Z')
>> >     ;
>> >
>> > fragment DIGIT:
>> >     '0'..'9'
>> >     ;
>> >
>> > WS:
>> >     (' ' | '\t' | '\n' | '\r'| '\f')+
>> >     {$channel = HIDDEN;}
>> >     ;
>> >
>> > Then I parse "action myval" and receive this error:
>> >
>> > line 1:7 mismatched input 'myval' expecting VALUE
>> >
>> > Because the lexer cannot determine whether the token is IDENT or VALUE
>> > my action rule will fail.
>> >
>> > What are my options for disambiguation at this point?
>> >
>> > -tony
>> >
>> >
>> > On Fri, May 27, 2011 at 6:23 PM, Kirby Bohling <kirby.bohling at gmail.com>
>> > wrote:
>> >>
>> >> First grammar:
>> >> > VALUE:
>> >> >    (LETTER | DIGIT)+
>> >> >    ;
>> >>
>> >> Second Grammar:
>> >> > VALUE:
>> >> >    (LETTER | DIGIT) '!'+
>> >> >    ;
>> >> > action MYVAL!   (MismatchedTokenException: line 3:7 mismatched input
>> >> > 'MYVAL'
>> >>
>> >> You've got the rule in + in the wrong place.  I'm pretty sure you
>> >> meant:
>> >>
>> >> VALUE:
>> >>   (LETTER | DIGIT)+ '!'
>> >> ;
>> >>
>> >> It is blowing up at the 'Y', because it can have one letter or one
>> >> digit, and at least '!'.  You've given it 5 letters then one '!'.
>> >>
>> >> While you can make this work, it would likely be easier to make the
>> >> difference between those to easier to disambiguate.  However, if you
>> >> think this is the correct approach read the FAQ about floats vs.
>> >> ranges:
>> >>
>> >>
>> >> http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs
>> >>
>> >> That's got the example of all of the power tools for how to man handle
>> >> ambiguous tokens types.
>> >>
>> >> Kirby
>> >
>> >
>
>


More information about the antlr-interest mailing list