[antlr-interest] Handling optional spaces
Justin Crites
jcrites at gmail.com
Mon Oct 8 18:03:16 PDT 2007
Austin,
Thanks for taking the time to look at it and explain. I am not sure how I
should fit whitespace into my grammar, though. Hopefully your generosity
will continue long enough to allow me to explain :-)
I have a C-like grammar with standard expressions like "3 + (4 * 5) +
f(10)". What I want to do is allow _optional_ whitespace in many places in
expressions, but not everywhere.
Specifically, in a function call "f(x,y,z)" I do not want to allow space
between the name of a function and the opening parenthesis. For example,
"f(x)" is valid but "f (x)" is not. [1] Most other constructs in the
language allow unlimited whitespace anywhere. In fact, this is true for
_every_ other construct except that one place. [2]
This is different from other constructs in my language like implicit
function calls where any amount of space is valid ("3+1", "3 + 1", " 3
+ 1", etc).
My initial approach was to include, explicitly, the OptionalWhitespace
production everywhere whitespace would be allowed. This complicates things
for a variety of reasons. [3]
For example, my grammar might look like:
expr : term OptSpace operator OptSpace expr
term : ...
call : id '(' OptSpace expr OptSpace ')' // no OptSpace after id
Do you have any advice for me on how I could accomplish the handling of
whitespace properly?
[1] The reason I have made this language design choice is because I am
trying to support S-expression-style function calls. For example, f(a b) ==
(f a b). I believe I can fit C-style calls and S-expressions together with
some restrictions on whitespace. (z f (g)) is different than (z f(g) )
[2] Yes, this does make me reconsider from a language design perspective.
But, I suspect there are many such cases in other successful languages, and
so would prefer not to disqualify this feature based on the clumsiness or
difficulty of the grammar alone.
[3] One complexity seems to be ensuring that the optional space is absorbed
by precisely the production following or preceding it, which otherwise leads
to ambiguity.
On 10/7/07, Austin Hastings <Austin_Hastings at yahoo.com> wrote:
>
> I had a look at the generated code. It's a bug, IMO. I'm surprised there
> wasn't a warning emitted.
>
> CLIFFS:
> 1. It should have griped at you.
> 2. You need to change into the "antlr paradigm" to get around whitespace
> issues.
>
>
> The "reason" for the difference is that you are doing a combined
> parser/lexer. So the first case generated a parser that expects to see
> Token#1, Token#2, Token#1 on its input (assuming that OptSpace =
> Token#1, and ID = Token#2).
>
> The "inline" version generated code that handled the detection of
> optional spaces in place. As a result, it was expecting {do some work}
> Token #2 {do some work}.
>
> The second form was what the lexer was giving it, because your OptSpace
> could match an empty string. Given an empty string, the lexer has the
> choice of doing nothing, or generating OptSpace. It chooses to "do
> nothing" and get on with processing the "a".
>
> The approach "recommended" by Antlr seems to be to do a "positive
> recognition" of white space, and then throw it away or hide it. Hence
> you'll see definitions like
>
> WS : (' ' | '\t' | '\r' | '\n')+ { skip(); }
>
> This recognizes that WS is a token separate from other tokens (so the
> Lexer knows to stop working on those tokens and work on this one) but
> then once the token is recognized, the skip() chucks it in the trash.
>
> =Austin
>
>
>
> Justin Crites wrote:
> > This is the full grammar that fails to parse "a"
> > (MismatchedTokenException):
> >
> > expr : OptSpace ID OptSpace;
> > ID : ('a'..'z'|'A'..'Z')+ ;
> > OptSpace : ' '*;
> >
> > This is the full grammar that succeeds:
> >
> > expr : ' '* ID ' '*;
> > ID : ('a'..'z'|'A'..'Z')+ ;
> >
> > These grammars are identical except that in the latter I have replaced
> > OptSpace with its definition in the rule "expr".
> >
> > In my mind these grammars should behave identically -- I would expect
> > the grammar specification to follow a "substitution rule" such that if
> > I have a rule A : X; then I can replace instances of "A" in other
> > rules with simply "X" and get identical behavior. However, even
> > though OptSpace : ' '*; the rule
> >
> > expr : OptSpace ID OptSpace
> >
> > behaves differently than:
> >
> > expr : ' '* ID ' '*; // substituting ' '* for OptSpace
> >
> > Does this clarify my question? Thank you.
> >
> > --
> > Justin Crites
> > ------------------------------------------------------------------------
> >
> > No virus found in this incoming message.
> > Checked by AVG Free Edition.
> > Version: 7.5.488 / Virus Database: 269.14.4/1055 - Release Date:
> 10/7/2007 10:24 AM
> >
>
>
--
Justin Crites
E-Mail: <mailto:jcrites at gmail.com>
IM: <aim:Xiphoris>
WWW: <http://xiphoris.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20071008/c77fd9c9/attachment.html
More information about the antlr-interest
mailing list