[antlr-interest] Optional spaces question

Thomas Thomsen thomas at t-t.dk
Thu Jan 19 02:22:02 PST 2012


Thanks a lot Eric for your detailed answer. I have been looking through the
generated code in the debugger, but I easily get lost in the method calls
and iterations. The problem is that my grammar is already quite large and
complex by now. But I could of course isolate my current problem in a small
testing grammar. I think I'll do that: thanks for your advice.

I am already generating and using an AST tree, and I like your suggestion
about directing the whitespace tokens onto the AST tree. I would still
prefer to put the whitespace tokens on the hidden channel, so that they do
not clutter the parser grammar, but if they could somehow be "revived" in
the AST tree... This reminds me of the article "Preserving Whitespace
During Translation" (http://www.antlr.org/article/whitespace/index.html),
where the parser copies the hidden tokens into the tree nodes (actually
into special tree nodes of type CommonASTWithHiddenTokens). Since I would
also very much like to be able to translate between versions of my DSL
language (so that users can auto-translate if I change the syntax), this
might be the way to go?

Thanks also for our tips regarding stream rewriting and syntactic
predicates.

Best regards,

-Thomas



2012/1/18 Eric <researcher0x00 at gmail.com>

>
>
> On Wed, Jan 18, 2012 at 9:08 AM, Eric <researcher0x00 at gmail.com> wrote:
>
>>
>>
>> On Wed, Jan 18, 2012 at 8:17 AM, Thomas Thomsen <thomas at t-t.dk> wrote:
>>
>>> I am pretty new to ANTLR, doing a DSL language. I like ANTLR a lot, but I
>>> am struggling with a problem regarding optional whitespaces. My problem
>>> is
>>> that I need to distinguish between "f(x)" and "f  (x)" -- note the space
>>> between "f" and "(x)" in the latter (I am putting whitespace on the
>>> hidden
>>> channel, and I want to continue to do that). The former is a function
>>> call,
>>> the latter something different.
>>>
>>> I found a post on this list from 2007 ("Handling optional spaces") which
>>> addresses the exact same question. One suggestion was to have the lexer
>>> absorb the left parenthesis if there is no space in between:
>>>
>>> ID : ('a'..'z') + ;
>>> FUNCTION_CALL: ID '(' ;
>>>
>>> Then the lexer would return "f(" as a FUNCTION_CALL-token if there is not
>>> space in between. This works, but it is not too pretty and complicates
>>> things elsewhere in my code. The other suggestion was to check the hidden
>>> channel for whitespace-tokens by means of Java code (actually C# in my
>>> case). But since I am not yet too familiar with the inner workings of
>>> ANTLR, this scares me a bit.
>>>
>>> So I was thinking of a third strategy: Have a simple preprocessor look
>>> through the input file, and if a letter is directly followed by a left
>>> parenthesis, put some special character in between. So the preprocessor
>>> transforms "f(x)" into "f&(x)", where "&" is a (glue) character not used
>>> elsewhere in the grammar. And afterwards, it would be much easier to
>>> distinguish between "f&(x)" and "f  (x)" in ANTLR.
>>>
>>> Is this question or strategy completely stupid for some reason?
>>>
>>
>> Personally, I think avoiding the inner workings of ANTLR because it is
>> scary is a bad trait to pick up.
>>
>> When I started using ANTLR I spent lots of hours learning how it worked
>> by using the debugger. While I am not an expert at everything ANTLR, I
>> don't fear it.
>>
>> One thing I have learned is that while the lexer and parser are probably
>> capable of determining if an input is acceptable, that doesn't mean that
>> the lexer and parser should do all of the work of accepting the input.
>>
>> If you think of accepting an input as
>> 1. Use the lexer to convert the input to tokens.
>> 2. Use the parser to accept unambiguous input.
>> 3. Use tree manipulation to validate and accept valid input.
>> then you can let the parser pass input that may not be valid but that is
>> unambiguous onto the next step and sort out the meaning and validity there.
>>
>> For me, once the input is converted to a tree, it is easier to analyze
>> and manipulate because you can
>> 1. search backward and forward
>> 2. change the structure of the branches
>> 3. change the info in the nodes
>> 4. add and remove nodes and branches
>>
>> Hope this sheds some light on the problem.
>>
>> Eric
>>
>>
>
> Another option, though I don't use it, would be looking into using the
> stream rewrite API, you should be able to pick up the tokens from the lexer
> with the space not on the hidden channel, then when you see the pattern ID
> SPACE RIGHT_PAREN, you could rewrite it to SOMETHING_DIFFERENT, before
> passing onto the parser. If you don't want the parser to see a SPACE token,
> you could also use the stream rewrite to remove them.
>
> Additionally,
>
> Once the tree is available after the parser, one can create tables, cross
> references and other data structures to assist in the final goal, there is
> no requirement limiting one to using only the tree.
>
> One way to make a grammar easier to write is to make the rules less
> stringent. If you think of a input value as a dog, but don't know how to
> define a dog using grammar rules, try creating a rule for animals and then
> sort out of if the animal is a dog once you have the tree.
>
> Or in your case, I would avoid putting the space onto the hidden channel
> and pass the space all the way back to the tree and then sort it out there.
>
> A third option might be to try using Syntactic Predicates, but again I
> suspect that you will have to pass the SPACE to the parser, which requires
> parser rules deal with spaces everywhere.
>
> Eric
>
>
>
>>
>>
>>
>
>>
>>
>>
>>
>>
>>>
>>> Best regards, and thanks for all the good work on ANTLR,
>>>
>>> -Thomas Thomsen
>>>
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe:
>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>
>>
>


More information about the antlr-interest mailing list