[antlr-interest] Identifiers with Spaces

Fri Nov 26 18:38:36 PST 2010

Greetings!

On Fri, 2010-11-26 at 23:31 +0100, Michael Bosch wrote:
> Hi,
> 
> I am trying to parse a language where identifiers can contain
> spaces but otherwise spaces need to be ignored.  I have a problem
> getting the ANTLR tokenizer to do this.  My problem can be
> reproduced with the following grammar:
> 
> grammar test2;
> s	:	ID ' ';
> ID	:	'a' (' ' 'a')*;
> 
> No warnings / errors about ambiguities are reported but the
> tokenizer fails on inputs "a " and "a a ".
> 
> When generating the code it turns out that the decision to enter
> / repeat the (' ' 'a') part is based only on a one character
> lookahead.  A two character lookahead would fix my problem.
> 
> My understanding was that ANTLR was using unbounded lookahead as
> needed to resolve such decisions and would be able to recognize
> any regular language with no trouble.
> 
> Trying to understand the problem better created a grammar where
> the parser should behave just like the lexer in the test2
> grammar.  I did this by converting lexer rules to parser rules,
> adding a token rule that combines all tokens and creating a
> tokenstream that matches any number of tokens just to simulate
> the repeated getting of tokens from the lexer:
> 
> grammar test3;
> tokenstream
> 	:	token*;
> token	:	id | ' ';
> id	:	'a' (' ' 'a')*;
> 
> Compiling grammar test3 reports an ambiguity causing some
> transition to be disabled.  The resulting parser behaves
> different from the test2 lexer:
> 
> - Any input with leading space makes the parser match nothing
> - Everything else parses just as intended, e.g. "a a a  " is
>   grouped as "a a a", " ", " ".
> 
> My questions are:
> 
> - Is there a pragmatic solution for my original identifiers with
>   spaces language (Preferably one that is target language independent)?

dis-allow blank(s) as the very last character of an identifier.

i suggest something like this (untested):

ID : ID_HEAD ( ' '* ID_TAIL )* ;
fragment ID_HEAD : LETTER ;
fragment ID_TAIL : LETTER | DIGIT | '_' ;

and of course ID_TAIL above should include all the non-blank characters
you intend to be valid.

basically we are saying that within an ID, a blank(s) may appear but if
it does appear it must be followed by at least one ID_TAIL character.

> - Why is the lexer for test2 only using a 1 character lookahead?

because that is all that is needed to disambiguate the situation. recall
that the lexer operates without any knowledge of parsing context. so, to
the lexer, (assuming a rule like ID:LETTER(' '|LETTER)*) "a " is clearly
an ID and not an 'a' followed by ' '.

> - How does ANTLR resolve ambiguities in the lexer? Apparently
>   keywords are always preferred over general identifiers but I have
>   not found an explanation why this is the case.

currently, ANTLR lexers greedily suck up all input that matches the
longest possible lexical Token. however, when 2 tokens match the exact
same input sequence, then the first lexer rule in the grammar wins.
implicit lexer rules (e.g. those lexer rules created for the 'quoted'
keyword strings in the parser rules) are placed first in the set of
lexer rules. so "a" would be the keyword 'a' and "ab" would be an ID
rather than a 'a' followed by an ID.

> - Why is the behavior of the parser in test3 different than the
>   lexer in test2?

moving lexer rules into the parser changes the ball-game.

hope this helps....
   -jbb