[antlr-interest] Optional keyword causes ambiguity in parser

Gavin Lambert antlr at mirality.co.nz
Fri Apr 18 03:45:40 PDT 2008


At 21:11 18/04/2008, Ramon Verbruggen wrote:
 >Generally, if I have a largish grammar in perfect working order, 

 >and I want to make one keyword optional, but this causes ANTLR 
to
 >have ambiguity in the parser (Input such as 'Identifier' can be
 >matched in multiple ways), how would you solve this?
[...]
 >> statementBody:	statementList returnStatement? EOF;
 >> returnStatement:	'return' expression ';'*;
 >> // <= 'return' should be optional
 >> statementList:	(statement ';'*)*;
 >> statement:	addressable;
 >> expression:	addressable( '*' addressable)*;
 >> addressable:	Identifier ( '.' Identifier '()' )*;
 >> Identifier:	('a'..'z')+;

Think of what decisions ANTLR has to make (assuming you've made it 
optional), and you'll see why it thinks it's ambiguous.

Let's take as input "foo;", which is an Identifier followed by a 
';'.

We start out with a statementBody, which then has a statementList 
optionally followed by a returnStatement.

- statementList: statement followed by ';'s.  An addressable is a 
statement.  An Identifier is an addressable.  Valid path.
- returnStatement: can be just an expression if 'return' is 
optional.  An Identifier is an expression.  Valid path.

statementBody is allowed to have zero or more statements followed 
by zero or one returnStatement.  But how is it to know whether the 
Identifier it just saw is a statement or a 
returnStatement?  Answer: it can't.

Now, you've said that you want the "return" keyword to be 
optional.  One way you could do that would be to permit an 
expression as a valid statement:
   statementBody:	statementList returnStatement? EOF;
   returnStatement: 'return' expression;
   statementList:	(statement ';'*)*;
   statement:	expression;
   expression:	addressable( '*' addressable)*;
   addressable:	Identifier ( '.' Identifier '()' )*;
   Identifier:	('a'..'z')+;

There are two downsides to this, of course; the first is that 
you've widened the possible inputs (which may not be acceptable), 
and the second is that it isn't very easy to pick off the final 
expression statement to give it special handling, if you want to.

Another thing you could do is to remove 'addressable' from the 
list of possible statements.  So long as anything that can match 
'expression' cannot also match 'statement', the ambiguity is gone.

Yet another thing is to re-examine the parentheses in 
'addressable'.  Is 'Identifier' really supposed to be an 
addressable by itself?  Should the parentheses be outside the loop 
instead (so that 'foo' isn't an addressable, but 'foo()' is)?

If you can't change the input language in this way, then you'll 
have to resolve the ambiguity with sempreds.  The main one (since 
it's in a loop) is the 'addressable' in 'statement' -- you have to 
add a sempred telling it to fail any addressables that should be 
interpreted as return statements instead.  Presumably they're 
distinct enough that you can tell the difference.  (If you can't 
tell the difference, then you probably shouldn't be trying to 
remove that keyword.)

Beyond these suggestions, I'm not really sure.  I've usually been 
working with DSLs that I'm in full control of, so I've tended to 
designed languages that are easy to parse :)



More information about the antlr-interest mailing list