[antlr-interest] Handling optional spaces

Mon Oct 8 20:03:41 PDT 2007

Option 1 is wrong. Sorry. You'll need to check for following paren, as well.

Maybe

(ID '(')? {input.get($ID.index + 1).getChannel() == HIDDEN}? ID

instead?

=Austin

Austin Hastings wrote:
> Justin,
>
> What you are saying is that your language is 99% 
> whitespace-independent. That immediately tells me that the default 
> behavior should be to recognize whitespace separately and move it out 
> of the way.
>
> The requirement to know about whitespace separating an identifier and 
> an opening parenthesis can be satisfied in two ways:
>
> 1- Put the whitespace on the hidden channel, then check that channel 
> to see if whitespace came between two tokens:
>
> s_expression
>    : '(' ID s_args ')'
>    ;
>
> s_args
>    : (s_expression
>      | {input.LA(1) == ID && input.get(input.index + 1).getChannel() 
> == HIDDEN}? => ID   // **** LOOK HERE *****
>      | c_expression
>      )*
>    ;
>
> 2- Alternatively, you could treat the problem as being lexical. This 
> makes it simpler to write (meaning your hands don't get dirty with the 
> internals of Antlr) but gives a different result:
>
> ID : ('a'..'z') + ;
>
> FUNCTION_CALL: ID '(' ;
>
> You would then have to compensate in your parser for the 'unmatched' 
> parenthesis:
>
> C_function_call: FUNCTION_CALL args ')' ;
>
> =====
>
> The second approach is "easier" to do, in that you won't wind up 
> having to debug the generated java code when something goes awry. But 
> if you've been around C for very long you'll remember the old 
> "preprocessor tricks" that used to be used for things like 
> token-pasting, etc.
>
> How would you parse this code?
>
>  (z f/*look, no spaces! */(g))
>
> Option 1 treats the comment as a separator (space-type-token), while 
> option 2 recognizes that there are three tokens (ID, COMMENT, LPAREN). 
> If you want that to invoke the f(g) C function,
> you could change option 1 to look for a space token, rather than just 
> any off-channel token. The change to option 2 is obvious (insert 
> optional COMMENT) and hideous.
>
> =Austin
>
> Justin Crites wrote:
>> Austin,
>>
>> Thanks for taking the time to look at it and explain.  I am not sure 
>> how I should fit whitespace into my grammar, though.  Hopefully your 
>> generosity will continue long enough to allow me to explain :-)
>>
>> I have a C-like grammar with standard expressions like "3 + (4 * 5) + 
>> f(10)".  What I want to do is allow _optional_ whitespace in many 
>> places in expressions, but not everywhere.
>>
>> Specifically, in a function call "f(x,y,z)" I do not want to allow 
>> space between the name of a function and the opening parenthesis.  
>> For example, "f(x)" is valid but "f (x)" is not. [1]  Most other 
>> constructs in the language allow unlimited whitespace anywhere.  In 
>> fact, this is true for _every_ other construct except that one place. 
>> [2]
>>
>> This is different from other constructs in my language like implicit 
>> function calls where any amount of space is valid ("3+1", "3 + 1", " 
>> 3    +     1", etc).
>>
>> My initial approach was to include, explicitly, the 
>> OptionalWhitespace production everywhere whitespace would be 
>> allowed.  This complicates things for a variety of reasons. [3]
>>
>> For example, my grammar might look like:
>>
>>      expr : term OptSpace operator OptSpace expr
>>      term : ...
>>      call : id '(' OptSpace expr OptSpace ')'      // no OptSpace 
>> after id
>>
>> Do you have any advice for me on how I could accomplish the handling 
>> of whitespace properly?
>>
>>
>>
>> [1] The reason I have made this language design choice is because I 
>> am trying to support S-expression-style function calls.  For example, 
>> f(a b) == (f a b).  I believe I can fit C-style calls and 
>> S-expressions together with some restrictions on whitespace.   (z f 
>> (g)) is different than (z f(g) )
>>
>> [2] Yes, this does make me reconsider from a language design 
>> perspective.  But, I suspect there are many such cases in other 
>> successful languages, and so would prefer not to disqualify this 
>> feature based on the clumsiness or difficulty of the grammar alone.
>>
>> [3] One complexity seems to be ensuring that the optional space is 
>> absorbed by precisely the production following or preceding it, which 
>> otherwise leads to ambiguity.
>>
>> On 10/7/07, *Austin Hastings* <Austin_Hastings at yahoo.com 
>> <mailto:Austin_Hastings at yahoo.com>> wrote:
>>
>>     I had a look at the generated code. It's a bug, IMO. I'm surprised
>>     there
>>     wasn't a warning emitted.
>>
>>     CLIFFS:
>>     1. It should have griped at you.
>>     2. You need to change into the "antlr paradigm" to get around
>>     whitespace
>>     issues.
>>
>>
>>     The "reason" for the difference is that you are doing a combined
>>     parser/lexer. So the first case generated a parser that expects 
>> to see
>>     Token#1, Token#2, Token#1 on its input (assuming that OptSpace =
>>     Token#1, and ID = Token#2).
>>
>>     The "inline" version generated code that handled the detection of
>>     optional spaces in place. As a result, it was expecting {do some 
>> work}
>>     Token #2 {do some work}.
>>
>>     The second form was what the lexer was giving it, because your
>>     OptSpace
>>     could match an empty string. Given an empty string, the lexer has 
>> the
>>     choice of doing nothing, or generating OptSpace. It chooses to "do
>>     nothing" and get on with processing the "a".
>>
>>     The approach "recommended" by Antlr seems to be to do a "positive
>>     recognition" of white space, and then throw it away or hide it. 
>> Hence
>>     you'll see definitions like
>>
>>     WS : (' ' | '\t' | '\r' | '\n')+ { skip(); }
>>
>>     This recognizes that WS is a token separate from other tokens (so 
>> the
>>     Lexer knows to stop working on those tokens and work on this one) 
>> but
>>     then once the token is recognized, the skip() chucks it in the 
>> trash.
>>
>>     =Austin
>>
>>
>>
>>     Justin Crites wrote:
>>     > This is the full grammar that fails to parse "a"
>>     > (MismatchedTokenException):
>>     >
>>     > expr     :    OptSpace ID OptSpace;
>>     > ID  :   ('a'..'z'|'A'..'Z')+ ;
>>     > OptSpace :    ' '*;
>>     >
>>     > This is the full grammar that succeeds:
>>     >
>>     > expr     :    ' '* ID ' '*;
>>     > ID  :   ('a'..'z'|'A'..'Z')+ ;
>>     >
>>     > These grammars are identical except that in the latter I have
>>     replaced
>>     > OptSpace with its definition in the rule "expr".
>>     >
>>     > In my mind these grammars should behave identically -- I would
>>     expect
>>     > the grammar specification to follow a "substitution rule" such
>>     that if
>>     > I have a rule A : X; then I can replace instances of "A" in other
>>     > rules with simply "X" and get identical behavior.  However, even
>>     > though OptSpace : ' '*; the rule
>>     >
>>     > expr : OptSpace ID OptSpace
>>     >
>>     > behaves differently than:
>>     >
>>     > expr : ' '* ID ' '*;  // substituting ' '* for OptSpace
>>     >
>>     > Does this clarify my question?  Thank you.
>>     >
>>     > --
>>     > Justin Crites
>>     >
>>     
>> ------------------------------------------------------------------------
>>
>>     >
>>     > No virus found in this incoming message.
>>     > Checked by AVG Free Edition.
>>     > Version: 7.5.488 / Virus Database: 269.14.4/1055 - Release Date:
>>     10/7/2007 10:24 AM
>>     >
>>
>>
>>
>>
>> -- 
>> Justin Crites
>>
>> E-Mail: <mailto:jcrites at gmail.com <mailto:jcrites at gmail.com>>
>> IM: <aim:Xiphoris>
>> WWW: <http://xiphoris.com <http://xiphoris.com>>
>> ------------------------------------------------------------------------
>>
>> No virus found in this incoming message.
>> Checked by AVG Free Edition. Version: 7.5.488 / Virus Database: 
>> 269.14.4/1057 - Release Date: 10/8/2007 9:04 AM
>>   
>
>
>