[antlr-interest] Handling optional spaces

Mon Oct 8 18:03:16 PDT 2007

Austin,

Thanks for taking the time to look at it and explain.  I am not sure how I
should fit whitespace into my grammar, though.  Hopefully your generosity
will continue long enough to allow me to explain :-)

I have a C-like grammar with standard expressions like "3 + (4 * 5) +
f(10)".  What I want to do is allow _optional_ whitespace in many places in
expressions, but not everywhere.

Specifically, in a function call "f(x,y,z)" I do not want to allow space
between the name of a function and the opening parenthesis.  For example,
"f(x)" is valid but "f (x)" is not. [1]  Most other constructs in the
language allow unlimited whitespace anywhere.  In fact, this is true for
_every_ other construct except that one place. [2]

This is different from other constructs in my language like implicit
function calls where any amount of space is valid ("3+1", "3 + 1", " 3
+     1", etc).

My initial approach was to include, explicitly, the OptionalWhitespace
production everywhere whitespace would be allowed.  This complicates things
for a variety of reasons. [3]

For example, my grammar might look like:

     expr : term OptSpace operator OptSpace expr
     term : ...
     call : id '(' OptSpace expr OptSpace ')'      // no OptSpace after id

Do you have any advice for me on how I could accomplish the handling of
whitespace properly?

[1] The reason I have made this language design choice is because I am
trying to support S-expression-style function calls.  For example, f(a b) ==
(f a b).  I believe I can fit C-style calls and S-expressions together with
some restrictions on whitespace.   (z f (g)) is different than (z f(g) )

[2] Yes, this does make me reconsider from a language design perspective.
But, I suspect there are many such cases in other successful languages, and
so would prefer not to disqualify this feature based on the clumsiness or
difficulty of the grammar alone.

[3] One complexity seems to be ensuring that the optional space is absorbed
by precisely the production following or preceding it, which otherwise leads
to ambiguity.

On 10/7/07, Austin Hastings <Austin_Hastings at yahoo.com> wrote:
>
> I had a look at the generated code. It's a bug, IMO. I'm surprised there
> wasn't a warning emitted.
>
> CLIFFS:
> 1. It should have griped at you.
> 2. You need to change into the "antlr paradigm" to get around whitespace
> issues.
>
>
> The "reason" for the difference is that you are doing a combined
> parser/lexer. So the first case generated a parser that expects to see
> Token#1, Token#2, Token#1 on its input (assuming that OptSpace =
> Token#1, and ID = Token#2).
>
> The "inline" version generated code that handled the detection of
> optional spaces in place. As a result, it was expecting {do some work}
> Token #2 {do some work}.
>
> The second form was what the lexer was giving it, because your OptSpace
> could match an empty string. Given an empty string, the lexer has the
> choice of doing nothing, or generating OptSpace. It chooses to "do
> nothing" and get on with processing the "a".
>
> The approach "recommended" by Antlr seems to be to do a "positive
> recognition" of white space, and then throw it away or hide it. Hence
> you'll see definitions like
>
> WS : (' ' | '\t' | '\r' | '\n')+ { skip(); }
>
> This recognizes that WS is a token separate from other tokens (so the
> Lexer knows to stop working on those tokens and work on this one) but
> then once the token is recognized, the skip() chucks it in the trash.
>
> =Austin
>
>
>
> Justin Crites wrote:
> > This is the full grammar that fails to parse "a"
> > (MismatchedTokenException):
> >
> > expr     :    OptSpace ID OptSpace;
> > ID  :   ('a'..'z'|'A'..'Z')+ ;
> > OptSpace :    ' '*;
> >
> > This is the full grammar that succeeds:
> >
> > expr     :    ' '* ID ' '*;
> > ID  :   ('a'..'z'|'A'..'Z')+ ;
> >
> > These grammars are identical except that in the latter I have replaced
> > OptSpace with its definition in the rule "expr".
> >
> > In my mind these grammars should behave identically -- I would expect
> > the grammar specification to follow a "substitution rule" such that if
> > I have a rule A : X; then I can replace instances of "A" in other
> > rules with simply "X" and get identical behavior.  However, even
> > though OptSpace : ' '*; the rule
> >
> > expr : OptSpace ID OptSpace
> >
> > behaves differently than:
> >
> > expr : ' '* ID ' '*;  // substituting ' '* for OptSpace
> >
> > Does this clarify my question?  Thank you.
> >
> > --
> > Justin Crites
> > ------------------------------------------------------------------------
> >
> > No virus found in this incoming message.
> > Checked by AVG Free Edition.
> > Version: 7.5.488 / Virus Database: 269.14.4/1055 - Release Date:
> 10/7/2007 10:24 AM
> >
>
>

-- 
Justin Crites

E-Mail: <mailto:jcrites at gmail.com>
IM: <aim:Xiphoris>
WWW: <http://xiphoris.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20071008/c77fd9c9/attachment.html