[antlr-interest] Handling optional spaces
Austin Hastings
Austin_Hastings at Yahoo.com
Mon Oct 8 20:03:41 PDT 2007
Option 1 is wrong. Sorry. You'll need to check for following paren, as well.
Maybe
(ID '(')? {input.get($ID.index + 1).getChannel() == HIDDEN}? ID
instead?
=Austin
Austin Hastings wrote:
> Justin,
>
> What you are saying is that your language is 99%
> whitespace-independent. That immediately tells me that the default
> behavior should be to recognize whitespace separately and move it out
> of the way.
>
> The requirement to know about whitespace separating an identifier and
> an opening parenthesis can be satisfied in two ways:
>
> 1- Put the whitespace on the hidden channel, then check that channel
> to see if whitespace came between two tokens:
>
> s_expression
> : '(' ID s_args ')'
> ;
>
> s_args
> : (s_expression
> | {input.LA(1) == ID && input.get(input.index + 1).getChannel()
> == HIDDEN}? => ID // **** LOOK HERE *****
> | c_expression
> )*
> ;
>
> 2- Alternatively, you could treat the problem as being lexical. This
> makes it simpler to write (meaning your hands don't get dirty with the
> internals of Antlr) but gives a different result:
>
> ID : ('a'..'z') + ;
>
> FUNCTION_CALL: ID '(' ;
>
> You would then have to compensate in your parser for the 'unmatched'
> parenthesis:
>
> C_function_call: FUNCTION_CALL args ')' ;
>
> =====
>
> The second approach is "easier" to do, in that you won't wind up
> having to debug the generated java code when something goes awry. But
> if you've been around C for very long you'll remember the old
> "preprocessor tricks" that used to be used for things like
> token-pasting, etc.
>
> How would you parse this code?
>
> (z f/*look, no spaces! */(g))
>
> Option 1 treats the comment as a separator (space-type-token), while
> option 2 recognizes that there are three tokens (ID, COMMENT, LPAREN).
> If you want that to invoke the f(g) C function,
> you could change option 1 to look for a space token, rather than just
> any off-channel token. The change to option 2 is obvious (insert
> optional COMMENT) and hideous.
>
> =Austin
>
> Justin Crites wrote:
>> Austin,
>>
>> Thanks for taking the time to look at it and explain. I am not sure
>> how I should fit whitespace into my grammar, though. Hopefully your
>> generosity will continue long enough to allow me to explain :-)
>>
>> I have a C-like grammar with standard expressions like "3 + (4 * 5) +
>> f(10)". What I want to do is allow _optional_ whitespace in many
>> places in expressions, but not everywhere.
>>
>> Specifically, in a function call "f(x,y,z)" I do not want to allow
>> space between the name of a function and the opening parenthesis.
>> For example, "f(x)" is valid but "f (x)" is not. [1] Most other
>> constructs in the language allow unlimited whitespace anywhere. In
>> fact, this is true for _every_ other construct except that one place.
>> [2]
>>
>> This is different from other constructs in my language like implicit
>> function calls where any amount of space is valid ("3+1", "3 + 1", "
>> 3 + 1", etc).
>>
>> My initial approach was to include, explicitly, the
>> OptionalWhitespace production everywhere whitespace would be
>> allowed. This complicates things for a variety of reasons. [3]
>>
>> For example, my grammar might look like:
>>
>> expr : term OptSpace operator OptSpace expr
>> term : ...
>> call : id '(' OptSpace expr OptSpace ')' // no OptSpace
>> after id
>>
>> Do you have any advice for me on how I could accomplish the handling
>> of whitespace properly?
>>
>>
>>
>> [1] The reason I have made this language design choice is because I
>> am trying to support S-expression-style function calls. For example,
>> f(a b) == (f a b). I believe I can fit C-style calls and
>> S-expressions together with some restrictions on whitespace. (z f
>> (g)) is different than (z f(g) )
>>
>> [2] Yes, this does make me reconsider from a language design
>> perspective. But, I suspect there are many such cases in other
>> successful languages, and so would prefer not to disqualify this
>> feature based on the clumsiness or difficulty of the grammar alone.
>>
>> [3] One complexity seems to be ensuring that the optional space is
>> absorbed by precisely the production following or preceding it, which
>> otherwise leads to ambiguity.
>>
>> On 10/7/07, *Austin Hastings* <Austin_Hastings at yahoo.com
>> <mailto:Austin_Hastings at yahoo.com>> wrote:
>>
>> I had a look at the generated code. It's a bug, IMO. I'm surprised
>> there
>> wasn't a warning emitted.
>>
>> CLIFFS:
>> 1. It should have griped at you.
>> 2. You need to change into the "antlr paradigm" to get around
>> whitespace
>> issues.
>>
>>
>> The "reason" for the difference is that you are doing a combined
>> parser/lexer. So the first case generated a parser that expects
>> to see
>> Token#1, Token#2, Token#1 on its input (assuming that OptSpace =
>> Token#1, and ID = Token#2).
>>
>> The "inline" version generated code that handled the detection of
>> optional spaces in place. As a result, it was expecting {do some
>> work}
>> Token #2 {do some work}.
>>
>> The second form was what the lexer was giving it, because your
>> OptSpace
>> could match an empty string. Given an empty string, the lexer has
>> the
>> choice of doing nothing, or generating OptSpace. It chooses to "do
>> nothing" and get on with processing the "a".
>>
>> The approach "recommended" by Antlr seems to be to do a "positive
>> recognition" of white space, and then throw it away or hide it.
>> Hence
>> you'll see definitions like
>>
>> WS : (' ' | '\t' | '\r' | '\n')+ { skip(); }
>>
>> This recognizes that WS is a token separate from other tokens (so
>> the
>> Lexer knows to stop working on those tokens and work on this one)
>> but
>> then once the token is recognized, the skip() chucks it in the
>> trash.
>>
>> =Austin
>>
>>
>>
>> Justin Crites wrote:
>> > This is the full grammar that fails to parse "a"
>> > (MismatchedTokenException):
>> >
>> > expr : OptSpace ID OptSpace;
>> > ID : ('a'..'z'|'A'..'Z')+ ;
>> > OptSpace : ' '*;
>> >
>> > This is the full grammar that succeeds:
>> >
>> > expr : ' '* ID ' '*;
>> > ID : ('a'..'z'|'A'..'Z')+ ;
>> >
>> > These grammars are identical except that in the latter I have
>> replaced
>> > OptSpace with its definition in the rule "expr".
>> >
>> > In my mind these grammars should behave identically -- I would
>> expect
>> > the grammar specification to follow a "substitution rule" such
>> that if
>> > I have a rule A : X; then I can replace instances of "A" in other
>> > rules with simply "X" and get identical behavior. However, even
>> > though OptSpace : ' '*; the rule
>> >
>> > expr : OptSpace ID OptSpace
>> >
>> > behaves differently than:
>> >
>> > expr : ' '* ID ' '*; // substituting ' '* for OptSpace
>> >
>> > Does this clarify my question? Thank you.
>> >
>> > --
>> > Justin Crites
>> >
>>
>> ------------------------------------------------------------------------
>>
>> >
>> > No virus found in this incoming message.
>> > Checked by AVG Free Edition.
>> > Version: 7.5.488 / Virus Database: 269.14.4/1055 - Release Date:
>> 10/7/2007 10:24 AM
>> >
>>
>>
>>
>>
>> --
>> Justin Crites
>>
>> E-Mail: <mailto:jcrites at gmail.com <mailto:jcrites at gmail.com>>
>> IM: <aim:Xiphoris>
>> WWW: <http://xiphoris.com <http://xiphoris.com>>
>> ------------------------------------------------------------------------
>>
>> No virus found in this incoming message.
>> Checked by AVG Free Edition. Version: 7.5.488 / Virus Database:
>> 269.14.4/1057 - Release Date: 10/8/2007 9:04 AM
>>
>
>
>
More information about the antlr-interest
mailing list