[antlr-interest] Identifiers with Spaces

Fri Nov 26 14:31:56 PST 2010

Hi,

I am trying to parse a language where identifiers can contain
spaces but otherwise spaces need to be ignored.  I have a problem
getting the ANTLR tokenizer to do this.  My problem can be
reproduced with the following grammar:

grammar test2;
s	:	ID ' ';
ID	:	'a' (' ' 'a')*;

No warnings / errors about ambiguities are reported but the
tokenizer fails on inputs "a " and "a a ".

When generating the code it turns out that the decision to enter
/ repeat the (' ' 'a') part is based only on a one character
lookahead.  A two character lookahead would fix my problem.

My understanding was that ANTLR was using unbounded lookahead as
needed to resolve such decisions and would be able to recognize
any regular language with no trouble.

Trying to understand the problem better created a grammar where
the parser should behave just like the lexer in the test2
grammar.  I did this by converting lexer rules to parser rules,
adding a token rule that combines all tokens and creating a
tokenstream that matches any number of tokens just to simulate
the repeated getting of tokens from the lexer:

grammar test3;
tokenstream
	:	token*;
token	:	id | ' ';
id	:	'a' (' ' 'a')*;

Compiling grammar test3 reports an ambiguity causing some
transition to be disabled.  The resulting parser behaves
different from the test2 lexer:

- Any input with leading space makes the parser match nothing
- Everything else parses just as intended, e.g. "a a a  " is
  grouped as "a a a", " ", " ".

My questions are:

- Is there a pragmatic solution for my original identifiers with
  spaces language (Preferably one that is target language independent)?
- Why is the lexer for test2 only using a 1 character lookahead?
- How does ANTLR resolve ambiguities in the lexer? Apparently
  keywords are always preferred over general identifiers but I have
  not found an explanation why this is the case.
- Why is the behavior of the parser in test3 different than the
  lexer in test2?

Michael