[antlr-interest] Non-deterministic behaviour in matching lexer tokens

Fri May 27 16:57:09 PDT 2011

I am now seeing your point.  Providing the prefix (and suffix) allows for
the disambiguation
but will also feel consistent for the user.  The user will know that for
values (supporting a wider range
of utf-8 character) they will need to quote their strings, but if they are
declaring a variable they should
follow the identifier format (defined in my grammar as \w+).

The language I am capturing is around biological expressions.  I can express
biological relationships
in arbitrary strings such as

(6E,8Z,11Z,14Z)-(5S)-5-Hydroperoxyeicosa-6,8,11,14-tetraenoate

and I can declare language constructs and assign them to variables

DEFINE my_group AS (...)

Here my_group would follow the IDENT approach.

I initially wanted to provide a quoting convenience and say:
- if you have letters and numbers in sequence then you don't need to double
quote
- if you have other characters (such as '-' or ' ') then you need to double
quote

Consider a biological expression:

    protein(NS:ABC1) increases protein(NS:"AB Acetase 7-G")

In order to know where the value ends it would be straight-forward to
surround with quotes.

I think I will just change the specific values to be QUOTED_VALUE which is
already an option
in the grammar.

Thanks for the help and guidance.

-tony

On Fri, May 27, 2011 at 7:43 PM, Kirby Bohling <kirby.bohling at gmail.com>wrote:

> I'm going to repeat:  The fact that you're having this problem is a
> warning sign that your language is ill-defined, or at least ANTLR
> isn't the best tool to parse it.
>
> Well, when choose to resolve the ambiguity in the parser or tree
> grammar, you'll have plenty of context in order to figure it out.  I
> have no idea what an ident or value is in your language, or how I'd
> use them differently.  I've got a guess based on the common usage of
> them.
>
> Any chance you'll send examples and how you'd figure it out by hand?
>
> But there's a reason in C, that variables, can't start with a digit,
> and that literal strings are enclosed with quotes, precisely to avoid
> this type of problem.  It's just so much easier, that is worth forcing
> the programmer add something near the beginning to remove the
> ambiguity of what they mean.
>
> Without that it is really hard to tell if:
>
> action foo;
>
> foo means the "identifier" foo, or if it means the "value" foo.  How
> would you figure it out?  Is it like C++ constructor vs. function call
> where, if it can be treated like an identifier it is?  In that case,
> you're going to need to write and build an symbol table as you go, and
> when you come across an identifier, look it up in the symbol table, if
> that works, then it is an identifier, if it doesn't, it is must be a
> literal.
>
> Kirby
>
>
> On Fri, May 27, 2011 at 6:12 PM, Anthony Bargnesi <abargnesi at gmail.com>
> wrote:
> > Thanks again Kirby.
> >
> > The first examples were meant for discussion but the real grammar has
> > ambiguous
> > language with identifiers and values.
> >
> > The identifier is defined as a regular expression \w+ which translates to
> > LETTER: ('a'..'z' | 'A'..'Z');
> > DIGIT: '0'..'9';
> > IDENT: LETTER (LETTER | DIGIT | '_')*
> >
> > and VALUE would be a looser definition than IDENT
> > VALUE: (LETTER | DIGIT)+;
> >
> > So because not all IDENT will have an underscore the lexer can not
> > distinguish
> > between both tokens.
> >
> > Assuming I can't differentiate VALUE from IDENT your options might help,
> but
> >
> > (option 1 - merge with single rule)
> > If I merged both tokens into a rule defined as:
> > identifier: VALUE | IDENT { $type = determineType(...); }
> >
> > I still won't know how to determine the type because I'm not sure what
> > parent
> > rule context I'm processing under.  The characters of the token alone
> cannot
> > tell me that.
> >
> > (option 2 - emit different token at parse time)
> > Same issue as option 1.  I will not be able to differentiate based on the
> > token characters alone.
> >
> > -tony
> >
> > On Fri, May 27, 2011 at 6:43 PM, Kirby Bohling <kirby.bohling at gmail.com>
> > wrote:
> >>
> >> What you'd likely do is one of the following:
> >>
> >> 1. Merge all the token types into a single rule that recognizes all of
> >> them, and after that rule finishes, figure out the "right" answer and
> >> set the token type.
> >> (In this case, everything is handled in the lexer).  At the end of the
> >> rule you'd have a section that did { $type = computeTokenType(...); }
> >> Or you can use the float vs. range FAQ entry to get the lexer to do
> >> all of that for you, just the lexer will be a serious hassle to read
> >> and modify.
> >>
> >> I don't know off hand what goes in the arguments.  Write a version
> >> that takes no arguments, then look at the generated code and see what
> >> you think you'll need, and pass it in to that function.
> >>
> >> 2. Treat it like Keyword vs. Identifier problem from the FAQ.
> >> http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741
> >>
> >> In this case, you're doing it later in the parser, which likely gives
> >> you more flexibility.  You will do effectively the same thing in this
> >> case, but you'll modify the token type (or generate an imaginary
> >> parent token) in the parser after looking to see "I need an VALUE" at
> >> this point in the parser.  You'll have a rule that effectively asks
> >> "Is this identifier I have really a VALUE?", and use that anywhere you
> >> require a VALUE and not an IDENT.
> >>
> >> Honestly, I am guessing that the whole problem is a sign that you're
> >> doing it the hard way.  It'd be much easier if you designed the
> >> language such that it was unambiguous, but I don't know much about the
> >> problem at hand.
> >>
> >> Kirby
> >>
> >>
> >>
> >> On Fri, May 27, 2011 at 5:30 PM, Anthony Bargnesi <abargnesi at gmail.com>
> >> wrote:
> >> > Thanks for the quick reply!
> >> >
> >> > My second grammar was a mistake, sorry.  I realize that '!'+ does a
> good
> >> > job
> >> > of disambiguating
> >> > VALUE from IDENT.
> >> >
> >> > But if I change that second grammar too:
> >> >
> >> > call:
> >> >     'call' id=IDENT
> >> >     ;
> >> >
> >> > action:
> >> >     'action' VALUE
> >> >     ;
> >> >
> >> > IDENT:
> >> >     LETTER (LETTER | DIGIT | '_')*
> >> >     ;
> >> >
> >> > VALUE:
> >> >     (LETTER | DIGIT)+
> >> >     ;
> >> >
> >> > fragment LETTER:
> >> >     ('a'..'z' | 'A'..'Z')
> >> >     ;
> >> >
> >> > fragment DIGIT:
> >> >     '0'..'9'
> >> >     ;
> >> >
> >> > WS:
> >> >     (' ' | '\t' | '\n' | '\r'| '\f')+
> >> >     {$channel = HIDDEN;}
> >> >     ;
> >> >
> >> > Then I parse "action myval" and receive this error:
> >> >
> >> > line 1:7 mismatched input 'myval' expecting VALUE
> >> >
> >> > Because the lexer cannot determine whether the token is IDENT or VALUE
> >> > my action rule will fail.
> >> >
> >> > What are my options for disambiguation at this point?
> >> >
> >> > -tony
> >> >
> >> >
> >> > On Fri, May 27, 2011 at 6:23 PM, Kirby Bohling <
> kirby.bohling at gmail.com>
> >> > wrote:
> >> >>
> >> >> First grammar:
> >> >> > VALUE:
> >> >> >    (LETTER | DIGIT)+
> >> >> >    ;
> >> >>
> >> >> Second Grammar:
> >> >> > VALUE:
> >> >> >    (LETTER | DIGIT) '!'+
> >> >> >    ;
> >> >> > action MYVAL!   (MismatchedTokenException: line 3:7 mismatched
> input
> >> >> > 'MYVAL'
> >> >>
> >> >> You've got the rule in + in the wrong place.  I'm pretty sure you
> >> >> meant:
> >> >>
> >> >> VALUE:
> >> >>   (LETTER | DIGIT)+ '!'
> >> >> ;
> >> >>
> >> >> It is blowing up at the 'Y', because it can have one letter or one
> >> >> digit, and at least '!'.  You've given it 5 letters then one '!'.
> >> >>
> >> >> While you can make this work, it would likely be easier to make the
> >> >> difference between those to easier to disambiguate.  However, if you
> >> >> think this is the correct approach read the FAQ about floats vs.
> >> >> ranges:
> >> >>
> >> >>
> >> >>
> http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs
> >> >>
> >> >> That's got the example of all of the power tools for how to man
> handle
> >> >> ambiguous tokens types.
> >> >>
> >> >> Kirby
> >> >
> >> >
> >
> >
>