[antlr-interest] Ambiguity between identifier and operator

Fri Jun 12 08:09:47 PDT 2009

On Thu, Jun 11, 2009 at 5:59 PM, Jim Idle <jimi at temporal-wave.com> wrote:

> David Chipping wrote:
> > I'm having some trouble trying to work out the best approach for some
> > ambiguity.
> >
> > If I have an identifier token defined as:
> >
> > IDENTIFIER: ('A'..'Z' | 'a'..'z' | '_') ('A'..'Z' | 'a'..'z' | '_' |
> > '0'..'9')* ('$' | '!' | '&')?
> >
> > Were the last part of the rule indicates an optional implicit type
> > character.
> >
> > Unfortunately, "!" is also a binary operator that works with
> > identifiers. For example, the following is valid:
> >
> > foo!bar
> >
> > and indicates a ! operator with a left side of foo and right of bar.
> > This is only possible when the left side identifier doesn't end with a
> > implicit type character.
> >
> > I was initially thinking of doing some token re-writing to determine if
> > an identifier (without an implicit type char)  followed by a "!" is
> > followed by another identifier and then emit a separate "!" so that can
> > be picked up by the parser. But I'm not keen on doing this, as (as far
> > as I can see, please correct me if i'm wrong) this takes some of my
> > lexing rules out of my grammar and into another place, complicating any
> > maintenance on the grammar itself.
> >
> > Is there a cleaner/different way to achieve this?
> >
> > Cheers
> >
> >
> David,
>
> Are you sure that the language youare parsing (what is it?), does not
> specify some disambiguation rules, for instance in the way that VB.Net
> does?
>
> What you do would normally do is take the '$' '!' '&' out of the lexer
> rule for ID, then apply the disambiguation in the lexer and parser. For
> instance, in VB, '!' followed by an IDCHAR is a separator, otherwise it
> is a type specifier. So you can do this:
>
> BANG        :
>                ('!' IDSTART)=>    '!'      // Must be a separator as
> per Lang spec 9.0 $2.2.1
>            |    '!'                        // Type specifier
>                { $type = T_SINGLE; }
>
>
> Then the other part in the lexer says:
>
> ident: (identifier) (DOT^ (identifier|keyword))*
>
>        (
>            {( (CommonTokenStream)input ).get( input.index()-1
> ).getType() != WS }?=>
>
>            variableType
>
>         )?
>
>
> Where variable type is T_SINGLE T_DOUBLE and so on.
>
> Whatever you are parsing may have similar rules. For instance, as it is
> a trailing element, it cannot start another identifier, so you could do
> this:
>
> IDENTIFIER: ('A'..'Z' | 'a'..'z' | '_') ('A'..'Z' | 'a'..'z' | '_' |
> '0'..'9')*
>           (    {checkIsTypeChar()}=>
>
>                ('$' | '!' | '&')
>
>              )?
>
> Where that method checks the following character, then checks that the
> character following that is not the start of an ID.
>
> However, I suspect that you need to make them separate tokens, and change
> the type if possible, or use syntactic predicates in the parser when you
> cannot change the type in the lexer.
>
> Jim

Thanks for the response Jim,

I think your suggestion of splitting the IDENT in the lexer and using them
in the parser was probably too much of a simple solution at the time I was
trying to solve the problem, but thats probably going to work for me!

-David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090612/9fa31034/attachment.html