[antlr-interest] Identifiers with Spaces

Fri Nov 26 20:42:42 PST 2010

Michael:

There are workarounds for your specific problem, but in general I would suggest a complete revision of your approach. Your approach is going to have problems dealing with some common typos, e.g. double spaces when one is intended. It is also going to have problems dealing with spaces in other contexts. What you are trying to do is generally better addressed during the semantic analysis, then during the lexical construction. I suggest the following approach

id_sequence : ID ID*

where ID is whatever you allow in an identifier between spaces. Then during the semantic analysis wherever you find an id_sequence in effect treat the first ID as a function that takes the rest of the id_sequence as an argument returning an "identifier". This analysis can be performed recursively fore each ID in the sequence. The implementation is straightforward, but tedious, and of course left to the student.

On Nov 26, 2010, at 3:31 PM, Michael Bosch wrote:

> Hi,
> 
> I am trying to parse a language where identifiers can contain
> spaces but otherwise spaces need to be ignored.  I have a problem
> getting the ANTLR tokenizer to do this.  My problem can be
> reproduced with the following grammar:
> 
> grammar test2;
> s	:	ID ' ';
> ID	:	'a' (' ' 'a')*;
> 
> No warnings / errors about ambiguities are reported but the
> tokenizer fails on inputs "a " and "a a ".
> 
> When generating the code it turns out that the decision to enter
> / repeat the (' ' 'a') part is based only on a one character
> lookahead.  A two character lookahead would fix my problem.
> 
> My understanding was that ANTLR was using unbounded lookahead as
> needed to resolve such decisions and would be able to recognize
> any regular language with no trouble.
> 
> Trying to understand the problem better created a grammar where
> the parser should behave just like the lexer in the test2
> grammar.  I did this by converting lexer rules to parser rules,
> adding a token rule that combines all tokens and creating a
> tokenstream that matches any number of tokens just to simulate
> the repeated getting of tokens from the lexer:
> 
> grammar test3;
> tokenstream
> 	:	token*;
> token	:	id | ' ';
> id	:	'a' (' ' 'a')*;
> 
> Compiling grammar test3 reports an ambiguity causing some
> transition to be disabled.  The resulting parser behaves
> different from the test2 lexer:
> 
> - Any input with leading space makes the parser match nothing
> - Everything else parses just as intended, e.g. "a a a  " is
>  grouped as "a a a", " ", " ".
> 
> My questions are:
> 
> - Is there a pragmatic solution for my original identifiers with
>  spaces language (Preferably one that is target language independent)?
> - Why is the lexer for test2 only using a 1 character lookahead?
> - How does ANTLR resolve ambiguities in the lexer? Apparently
>  keywords are always preferred over general identifiers but I have
>  not found an explanation why this is the case.
> - Why is the behavior of the parser in test3 different than the
>  lexer in test2?
> 
> Michael
> 
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address