[antlr-interest] Ambiguity between identifier and operator

Jim Idle jimi at temporal-wave.com
Thu Jun 11 08:59:22 PDT 2009


David Chipping wrote:
> I'm having some trouble trying to work out the best approach for some 
> ambiguity.
>
> If I have an identifier token defined as:
>
> IDENTIFIER: ('A'..'Z' | 'a'..'z' | '_') ('A'..'Z' | 'a'..'z' | '_' | 
> '0'..'9')* ('$' | '!' | '&')?
>
> Were the last part of the rule indicates an optional implicit type 
> character.
>
> Unfortunately, "!" is also a binary operator that works with 
> identifiers. For example, the following is valid: 
>
> foo!bar
>
> and indicates a ! operator with a left side of foo and right of bar. 
> This is only possible when the left side identifier doesn't end with a 
> implicit type character.
>
> I was initially thinking of doing some token re-writing to determine if 
> an identifier (without an implicit type char)  followed by a "!" is 
> followed by another identifier and then emit a separate "!" so that can 
> be picked up by the parser. But I'm not keen on doing this, as (as far 
> as I can see, please correct me if i'm wrong) this takes some of my 
> lexing rules out of my grammar and into another place, complicating any 
> maintenance on the grammar itself.
>
> Is there a cleaner/different way to achieve this?
>
> Cheers
>
>   
David,

Are you sure that the language youare parsing (what is it?), does not 
specify some disambiguation rules, for instance in the way that VB.Net does?

What you do would normally do is take the '$' '!' '&' out of the lexer 
rule for ID, then apply the disambiguation in the lexer and parser. For 
instance, in VB, '!' followed by an IDCHAR is a separator, otherwise it 
is a type specifier. So you can do this:

BANG        :
                ('!' IDSTART)=>    '!'      // Must be a separator as 
per Lang spec 9.0 $2.2.1
            |    '!'                        // Type specifier
                { $type = T_SINGLE; }


Then the other part in the lexer says:

ident: (identifier) (DOT^ (identifier|keyword))*
   
        (
            {( (CommonTokenStream)input ).get( input.index()-1 
).getType() != WS }?=>
           
            variableType

         )?


Where variable type is T_SINGLE T_DOUBLE and so on.

Whatever you are parsing may have similar rules. For instance, as it is 
a trailing element, it cannot start another identifier, so you could do 
this:

IDENTIFIER: ('A'..'Z' | 'a'..'z' | '_') ('A'..'Z' | 'a'..'z' | '_' | 
'0'..'9')* 
           (	{checkIsTypeChar()}=>	

		('$' | '!' | '&')

              )?

Where that method checks the following character, then checks that the character following that is not the start of an ID. 

However, I suspect that you need to make them separate tokens, and change the type if possible, or use syntactic predicates in the parser when you cannot change the type in the lexer. 

Jim








More information about the antlr-interest mailing list