[antlr-interest] Optional keyword causes ambiguity in parser
Gavin Lambert
antlr at mirality.co.nz
Fri Apr 18 03:45:40 PDT 2008
At 21:11 18/04/2008, Ramon Verbruggen wrote:
>Generally, if I have a largish grammar in perfect working order,
>and I want to make one keyword optional, but this causes ANTLR
to
>have ambiguity in the parser (Input such as 'Identifier' can be
>matched in multiple ways), how would you solve this?
[...]
>> statementBody: statementList returnStatement? EOF;
>> returnStatement: 'return' expression ';'*;
>> // <= 'return' should be optional
>> statementList: (statement ';'*)*;
>> statement: addressable;
>> expression: addressable( '*' addressable)*;
>> addressable: Identifier ( '.' Identifier '()' )*;
>> Identifier: ('a'..'z')+;
Think of what decisions ANTLR has to make (assuming you've made it
optional), and you'll see why it thinks it's ambiguous.
Let's take as input "foo;", which is an Identifier followed by a
';'.
We start out with a statementBody, which then has a statementList
optionally followed by a returnStatement.
- statementList: statement followed by ';'s. An addressable is a
statement. An Identifier is an addressable. Valid path.
- returnStatement: can be just an expression if 'return' is
optional. An Identifier is an expression. Valid path.
statementBody is allowed to have zero or more statements followed
by zero or one returnStatement. But how is it to know whether the
Identifier it just saw is a statement or a
returnStatement? Answer: it can't.
Now, you've said that you want the "return" keyword to be
optional. One way you could do that would be to permit an
expression as a valid statement:
statementBody: statementList returnStatement? EOF;
returnStatement: 'return' expression;
statementList: (statement ';'*)*;
statement: expression;
expression: addressable( '*' addressable)*;
addressable: Identifier ( '.' Identifier '()' )*;
Identifier: ('a'..'z')+;
There are two downsides to this, of course; the first is that
you've widened the possible inputs (which may not be acceptable),
and the second is that it isn't very easy to pick off the final
expression statement to give it special handling, if you want to.
Another thing you could do is to remove 'addressable' from the
list of possible statements. So long as anything that can match
'expression' cannot also match 'statement', the ambiguity is gone.
Yet another thing is to re-examine the parentheses in
'addressable'. Is 'Identifier' really supposed to be an
addressable by itself? Should the parentheses be outside the loop
instead (so that 'foo' isn't an addressable, but 'foo()' is)?
If you can't change the input language in this way, then you'll
have to resolve the ambiguity with sempreds. The main one (since
it's in a loop) is the 'addressable' in 'statement' -- you have to
add a sempred telling it to fail any addressables that should be
interpreted as return statements instead. Presumably they're
distinct enough that you can tell the difference. (If you can't
tell the difference, then you probably shouldn't be trying to
remove that keyword.)
Beyond these suggestions, I'm not really sure. I've usually been
working with DSLs that I'm in full control of, so I've tended to
designed languages that are easy to parse :)
More information about the antlr-interest
mailing list