[antlr-interest] Handling optional spaces

Mon Oct 8 19:57:52 PDT 2007

Justin,

What you are saying is that your language is 99% whitespace-independent. 
That immediately tells me that the default behavior should be to 
recognize whitespace separately and move it out of the way.

The requirement to know about whitespace separating an identifier and an 
opening parenthesis can be satisfied in two ways:

1- Put the whitespace on the hidden channel, then check that channel to 
see if whitespace came between two tokens:

s_expression
    : '(' ID s_args ')'
    ;

s_args
    : (s_expression
      | {input.LA(1) == ID && input.get(input.index + 1).getChannel() == 
HIDDEN}? => ID   // **** LOOK HERE *****
      | c_expression
      )*
    ;

2- Alternatively, you could treat the problem as being lexical. This 
makes it simpler to write (meaning your hands don't get dirty with the 
internals of Antlr) but gives a different result:

ID : ('a'..'z') + ;

FUNCTION_CALL: ID '(' ;

You would then have to compensate in your parser for the 'unmatched' 
parenthesis:

C_function_call: FUNCTION_CALL args ')' ;

=====

The second approach is "easier" to do, in that you won't wind up having 
to debug the generated java code when something goes awry. But if you've 
been around C for very long you'll remember the old "preprocessor 
tricks" that used to be used for things like token-pasting, etc.

How would you parse this code?

  (z f/*look, no spaces! */(g))

Option 1 treats the comment as a separator (space-type-token), while 
option 2 recognizes that there are three tokens (ID, COMMENT, LPAREN). 
If you want that to invoke the f(g) C function,
you could change option 1 to look for a space token, rather than just 
any off-channel token. The change to option 2 is obvious (insert 
optional COMMENT) and hideous.

=Austin

Justin Crites wrote:
> Austin,
>
> Thanks for taking the time to look at it and explain.  I am not sure 
> how I should fit whitespace into my grammar, though.  Hopefully your 
> generosity will continue long enough to allow me to explain :-)
>
> I have a C-like grammar with standard expressions like "3 + (4 * 5) + 
> f(10)".  What I want to do is allow _optional_ whitespace in many 
> places in expressions, but not everywhere.
>
> Specifically, in a function call "f(x,y,z)" I do not want to allow 
> space between the name of a function and the opening parenthesis.  For 
> example, "f(x)" is valid but "f (x)" is not. [1]  Most other 
> constructs in the language allow unlimited whitespace anywhere.  In 
> fact, this is true for _every_ other construct except that one place. [2]
>
> This is different from other constructs in my language like implicit 
> function calls where any amount of space is valid ("3+1", "3 + 1", " 
> 3    +     1", etc).
>
> My initial approach was to include, explicitly, the OptionalWhitespace 
> production everywhere whitespace would be allowed.  This complicates 
> things for a variety of reasons. [3]
>
> For example, my grammar might look like:
>
>      expr : term OptSpace operator OptSpace expr
>      term : ...
>      call : id '(' OptSpace expr OptSpace ')'      // no OptSpace after id
>
> Do you have any advice for me on how I could accomplish the handling 
> of whitespace properly?
>
>
>
> [1] The reason I have made this language design choice is because I am 
> trying to support S-expression-style function calls.  For example, f(a 
> b) == (f a b).  I believe I can fit C-style calls and S-expressions 
> together with some restrictions on whitespace.   (z f (g)) is 
> different than (z f(g) )
>
> [2] Yes, this does make me reconsider from a language design 
> perspective.  But, I suspect there are many such cases in other 
> successful languages, and so would prefer not to disqualify this 
> feature based on the clumsiness or difficulty of the grammar alone.
>
> [3] One complexity seems to be ensuring that the optional space is 
> absorbed by precisely the production following or preceding it, which 
> otherwise leads to ambiguity.
>
> On 10/7/07, *Austin Hastings* <Austin_Hastings at yahoo.com 
> <mailto:Austin_Hastings at yahoo.com>> wrote:
>
>     I had a look at the generated code. It's a bug, IMO. I'm surprised
>     there
>     wasn't a warning emitted.
>
>     CLIFFS:
>     1. It should have griped at you.
>     2. You need to change into the "antlr paradigm" to get around
>     whitespace
>     issues.
>
>
>     The "reason" for the difference is that you are doing a combined
>     parser/lexer. So the first case generated a parser that expects to see
>     Token#1, Token#2, Token#1 on its input (assuming that OptSpace =
>     Token#1, and ID = Token#2).
>
>     The "inline" version generated code that handled the detection of
>     optional spaces in place. As a result, it was expecting {do some work}
>     Token #2 {do some work}.
>
>     The second form was what the lexer was giving it, because your
>     OptSpace
>     could match an empty string. Given an empty string, the lexer has the
>     choice of doing nothing, or generating OptSpace. It chooses to "do
>     nothing" and get on with processing the "a".
>
>     The approach "recommended" by Antlr seems to be to do a "positive
>     recognition" of white space, and then throw it away or hide it. Hence
>     you'll see definitions like
>
>     WS : (' ' | '\t' | '\r' | '\n')+ { skip(); }
>
>     This recognizes that WS is a token separate from other tokens (so the
>     Lexer knows to stop working on those tokens and work on this one) but
>     then once the token is recognized, the skip() chucks it in the trash.
>
>     =Austin
>
>
>
>     Justin Crites wrote:
>     > This is the full grammar that fails to parse "a"
>     > (MismatchedTokenException):
>     >
>     > expr     :    OptSpace ID OptSpace;
>     > ID  :   ('a'..'z'|'A'..'Z')+ ;
>     > OptSpace :    ' '*;
>     >
>     > This is the full grammar that succeeds:
>     >
>     > expr     :    ' '* ID ' '*;
>     > ID  :   ('a'..'z'|'A'..'Z')+ ;
>     >
>     > These grammars are identical except that in the latter I have
>     replaced
>     > OptSpace with its definition in the rule "expr".
>     >
>     > In my mind these grammars should behave identically -- I would
>     expect
>     > the grammar specification to follow a "substitution rule" such
>     that if
>     > I have a rule A : X; then I can replace instances of "A" in other
>     > rules with simply "X" and get identical behavior.  However, even
>     > though OptSpace : ' '*; the rule
>     >
>     > expr : OptSpace ID OptSpace
>     >
>     > behaves differently than:
>     >
>     > expr : ' '* ID ' '*;  // substituting ' '* for OptSpace
>     >
>     > Does this clarify my question?  Thank you.
>     >
>     > --
>     > Justin Crites
>     >
>     ------------------------------------------------------------------------
>
>     >
>     > No virus found in this incoming message.
>     > Checked by AVG Free Edition.
>     > Version: 7.5.488 / Virus Database: 269.14.4/1055 - Release Date:
>     10/7/2007 10:24 AM
>     >
>
>
>
>
> -- 
> Justin Crites
>
> E-Mail: <mailto:jcrites at gmail.com <mailto:jcrites at gmail.com>>
> IM: <aim:Xiphoris>
> WWW: <http://xiphoris.com <http://xiphoris.com>>
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG Free Edition. 
> Version: 7.5.488 / Virus Database: 269.14.4/1057 - Release Date: 10/8/2007 9:04 AM
>