[antlr-interest] Lex Matching Issues
Ron Hunter-Duvar
ron.hunter-duvar at oracle.com
Mon Jul 19 12:33:27 PDT 2010
Cid Dennis wrote:
> So I am new to ANTLR and have created a grammar but found a strange issue. Because of the structure of the language I am parsing there can be tokens that match reserved works as variables but only when they are in a sub rule that does not use the reserved word.
>
> ...
Hi Cid,
I had to deal with a similar situation, and neither of the solutions
suggested in the wiki worked for me
(http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741), and I
haven't seen any other solutions described in the email. After a lot of
experimentation I came up with a hybrid solution.
One approach described is to define identifier like this:
identifier : KEY1 | KEY2 | ... | ID ;
The reason that this doesn't work is that where you have a production
that says "( KEY1 | identifier ) "Antlr reports a conflict and disables
the second alternative unless it can resolve the conflict with its
extended lookahead. So depending on which order you put them either KEY1
or identifier gets disabled, neither of which gives you a working
parser. The only way you would be able to make this work would be to
define a different identifier production for every different situation,
with the conflicting keywords removed. That might work for a handful of
different keywords, but doesn't scale.
The other suggested solution is to create productions like this:
keyIF : {input.LT(1).getText().equals("if")}? ID ;
keyCALL : {input.LT(1).getText().equals("call")}? ID ;
This is even worse than the other solution. I don't think there is any
way to make this work at all. The problem is that the semantic predicate
doesn't get applied when doing looking ahead in other productions, only
when actually trying to make the keyXXX productions. So everywhere you
have "( key1 | identifier )", to Antlr this looks like "( ID | ID )".
This completely throws the lookahead for a loop. In many cases it will
actually generate a parser without warnings, but the actual lookahead
it's doing is just wrong. For example, I had a case something like
"(keyA keyB LEFT_BRACKET) | (keyC keyD keyE)". To Antler's lookahead
analysis this said "(ID ID LEFT_BRACKET)|(ID ID ID)", so it generated
lookahead that basically said "if (LT(3) == LEFT_BRACKET) option1 else
option2". So it was taking other things like "C D (" and trying to parse
them as "A B (", then of course throwing run time exceptions on the
failed semantic predicates.
The hybrid approach I came up with was to define the keywords as
separate token types and define an identifier rule like in the first
approach, but then also create keyword productions similar to the second
approach:
keyIF: {input.LT(1).getText().equals("if")}? IF ;
I know at first glance the semantic predicate seems pointless. But it
causes Antlr to ignore the conflicts that it would have reported
otherwise, resulting in a working grammar.
There are just two caveats with this approach. First, it can also
suppress reporting of real conflicts, causing bugs in your parser. To
avoid this it's a good idea to first develop and test the grammar
without the semantic predicates in the keyword rules and without the
keywords as alternatives in the identifier rule. Once you have the
grammar working this way with no conflicts, then it's safe to put the
other rules that handle the non-reserved keywords.
Second, all those semantic predicates can cause the size of the
generated parser to blow up. It can cause syntactic predicates to have a
lot of special states and make the DFAs huge. This may only be a problem
with the Java runtime, where the JVM imposes extreme size limits on
class size (64KB). With large grammars or complex lookahead situations
it can result in an uncompilable Java class being generated.
Hope this helps,
Ron
--
Ron Hunter-Duvar | Software Developer V | 403-272-6580
Oracle Service Engineering
Gulf Canada Square 401 - 9th Avenue S.W., Calgary, AB, Canada T2P 3C5
All opinions expressed here are mine, and do not necessarily represent
those of my employer.
More information about the antlr-interest
mailing list