[antlr-interest] Keywords as identifiers in ANTLR 3.0

Wed Aug 29 05:28:01 PDT 2007

At 09:51 29/08/2007, Ayende Rahien wrote:
>I know that the question was raised before, and I checked to wiki 
>for the explanation on it, but I can't seem to follow the 
>solutions there and can't get it to work.
[...]
>// Multi word keywords
>ORDER_BY    :    ORDER WS+ BY;
>GROUP_BY    :    GROUP WS+ BY;
>
>IDENTIFIER
>     : ID_LETTER+
>     ;
>
>fragment
>AS :    'as';
>fragment
>ORDER    : 'order';
>fragment
>GROUP    : 'group';
>fragment
>BY        : 'by';

ANTLR currently can't "see" other lexer tokens when it's making 
predictions about which lexer rule to pick (actually that's not 
quite right, but it's hard to explain).  Since your ORDER_BY rule 
is public and your ORDER rule is a fragment, when faced with the 
input "order by order", ANTLR will correctly decide the first bit 
is an ORDER_BY, then see the next bit starting out as "order" 
again and start trying to generate another ORDER_BY, finally 
throwing an exception when it can't find the "by".

The workaround for this is a bit ugly, but it's fairly 
straightforward once you get your head around it.  Basically where 
there's a "common stem" ambiguity in the grammar you need to merge 
rules together and provide type-change alternatives for the 
failure cases.  In this case:

ORDER_BY : ORDER
            (  (WS+ BY) => WS+ BY
            |  /*nothing*/ { $type = IDENTIFIER; }
            );

In other words: match 'order' first.  Then look ahead to see if 
the next bit is whitespace followed by 'by' -- if so, then the 
rule matches and becomes an ORDER_BY token.  Otherwise we take the 
alternate path, and return the 'order' we already matched (and 
nothing else) as an IDENTIFIER instead.

(There are other alternatives; another common approach would be to 
ditch the ORDER_BY rule entirely and put some extra code in 
IDENTIFIER that recognises if the id it just matched looked like a 
keyword, and then change the type accordingly.)

Note that both of these will only help you in cases where the 
keywords are lexically distinct from identifiers in a context free 
manner -- which is the case here, since "order by" is a keyword, 
but both "order" and "by" alone are merely identifiers.  If you 
had a case where eg. "select" could be a statement in one context 
and an identifier in a different context, then you have to do some 
more work.  Again, there are a couple of different approaches 
here; the one I prefer is just to make the parser accept both 
IDENTIFIER and SELECT tokens as an 'identifier'.