[antlr-interest] Keywords as identifiers in ANTLR 3.0
Gavin Lambert
antlr at mirality.co.nz
Wed Aug 29 05:28:01 PDT 2007
At 09:51 29/08/2007, Ayende Rahien wrote:
>I know that the question was raised before, and I checked to wiki
>for the explanation on it, but I can't seem to follow the
>solutions there and can't get it to work.
[...]
>// Multi word keywords
>ORDER_BY : ORDER WS+ BY;
>GROUP_BY : GROUP WS+ BY;
>
>IDENTIFIER
> : ID_LETTER+
> ;
>
>fragment
>AS : 'as';
>fragment
>ORDER : 'order';
>fragment
>GROUP : 'group';
>fragment
>BY : 'by';
ANTLR currently can't "see" other lexer tokens when it's making
predictions about which lexer rule to pick (actually that's not
quite right, but it's hard to explain). Since your ORDER_BY rule
is public and your ORDER rule is a fragment, when faced with the
input "order by order", ANTLR will correctly decide the first bit
is an ORDER_BY, then see the next bit starting out as "order"
again and start trying to generate another ORDER_BY, finally
throwing an exception when it can't find the "by".
The workaround for this is a bit ugly, but it's fairly
straightforward once you get your head around it. Basically where
there's a "common stem" ambiguity in the grammar you need to merge
rules together and provide type-change alternatives for the
failure cases. In this case:
ORDER_BY : ORDER
( (WS+ BY) => WS+ BY
| /*nothing*/ { $type = IDENTIFIER; }
);
In other words: match 'order' first. Then look ahead to see if
the next bit is whitespace followed by 'by' -- if so, then the
rule matches and becomes an ORDER_BY token. Otherwise we take the
alternate path, and return the 'order' we already matched (and
nothing else) as an IDENTIFIER instead.
(There are other alternatives; another common approach would be to
ditch the ORDER_BY rule entirely and put some extra code in
IDENTIFIER that recognises if the id it just matched looked like a
keyword, and then change the type accordingly.)
Note that both of these will only help you in cases where the
keywords are lexically distinct from identifiers in a context free
manner -- which is the case here, since "order by" is a keyword,
but both "order" and "by" alone are merely identifiers. If you
had a case where eg. "select" could be a statement in one context
and an identifier in a different context, then you have to do some
more work. Again, there are a couple of different approaches
here; the one I prefer is just to make the parser accept both
IDENTIFIER and SELECT tokens as an 'identifier'.
More information about the antlr-interest
mailing list