[antlr-interest] Handling optional spaces

Sun Oct 7 21:27:05 PDT 2007

I had a look at the generated code. It's a bug, IMO. I'm surprised there 
wasn't a warning emitted.

CLIFFS:
1. It should have griped at you.
2. You need to change into the "antlr paradigm" to get around whitespace 
issues.

The "reason" for the difference is that you are doing a combined 
parser/lexer. So the first case generated a parser that expects to see 
Token#1, Token#2, Token#1 on its input (assuming that OptSpace = 
Token#1, and ID = Token#2).

The "inline" version generated code that handled the detection of 
optional spaces in place. As a result, it was expecting {do some work} 
Token #2 {do some work}.

The second form was what the lexer was giving it, because your OptSpace 
could match an empty string. Given an empty string, the lexer has the 
choice of doing nothing, or generating OptSpace. It chooses to "do 
nothing" and get on with processing the "a".

The approach "recommended" by Antlr seems to be to do a "positive 
recognition" of white space, and then throw it away or hide it. Hence 
you'll see definitions like

WS : (' ' | '\t' | '\r' | '\n')+ { skip(); }

This recognizes that WS is a token separate from other tokens (so the 
Lexer knows to stop working on those tokens and work on this one) but 
then once the token is recognized, the skip() chucks it in the trash.

=Austin

Justin Crites wrote:
> This is the full grammar that fails to parse "a" 
> (MismatchedTokenException):
>  
> expr     :    OptSpace ID OptSpace;
> ID  :   ('a'..'z'|'A'..'Z')+ ;
> OptSpace :    ' '*;
>
> This is the full grammar that succeeds:
>
> expr     :    ' '* ID ' '*;
> ID  :   ('a'..'z'|'A'..'Z')+ ;
>
> These grammars are identical except that in the latter I have replaced 
> OptSpace with its definition in the rule "expr".
>
> In my mind these grammars should behave identically -- I would expect 
> the grammar specification to follow a "substitution rule" such that if 
> I have a rule A : X; then I can replace instances of "A" in other 
> rules with simply "X" and get identical behavior.  However, even 
> though OptSpace : ' '*; the rule
>
> expr : OptSpace ID OptSpace
>
> behaves differently than:
>
> expr : ' '* ID ' '*;  // substituting ' '* for OptSpace
>
> Does this clarify my question?  Thank you.
>
> -- 
> Justin Crites
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG Free Edition. 
> Version: 7.5.488 / Virus Database: 269.14.4/1055 - Release Date: 10/7/2007 10:24 AM
>