[antlr-interest] combine tokens in rewrite rule

Fri Nov 9 16:15:09 PST 2007

However, what a person asks for, and what they need are not necessarily 
the same thing. ;-) 

It doesnt have to be an imaginary token, but it usually is because 
there won't be a lexer defined token to use with the rewrite, given that 
you are parsing the construct rather than lexing it. 

So, you are parsing the elements of some complicated reference or 
variable or class etc, and you need '.' in other places in your parser, 
and you also need to look at the individual pieces of the id. When you 
send the reference to the tree parser, you want to tag it with something 
to introduce it as a reference. Hence you would usually use an imaginary 
token as the place holder for the reference and have it introduce the 
individual pieces of the reference, which can then be looked up to find 
out if they are enumerations, objects and so on, such that the tree 
parser deals with them accordingly. 

If you pass the whole thing in as one token from the lexer, then you 
will probably end up splitting the token text anyway, so you can look up 
the context. However, if you never need to do this, then a lexical 
solution probably does work for you. Trying to apply context within the 
lexer rules though is definitely not something you should be doing by 
choice.

Now keep in mind that there are always 18 ways to skin a cat, and that's 
just the way I do such things, it's whatever floats your boat in the end 
:-)

Check the wiki or book for the rewrite syntax, but you can set the text 
of a token when you rewrite it.

So, your options are:

1) Lexical if there is no need to do anything with the different 
components (maybe you are formatting and dont need to know what it is 
for instance).
2) Declare a local String variable and as you get each ID text, append 
it, then rewrite with that as the token text (here this would be so that 
you have simpler lexer rules or are avoiding some lexing ambiguity say, 
because putting the text back together is kind of redundant (I seem to 
think that this was why the first question received the answer it did);
3) Rewrite the place holding imaginary token and each of the name 
components. If you can work out the type or context at this stage in the 
parse, then you might write one of a number of imaginaries, but If you 
have to parser the whole thing before you can work out types, then you 
would use one token and resolve the types in the next phase.

So:
tokens
{
	REFERENCE;
}

id:
   i=ID (DOT i+=ID) -> ^(REFERENCE $i+)
;

Or perhaps, if you have context, something like

id:
   v=ID (DOT r+=ID)

	-> {lookup($v) == OBJECT}? ^(OBJECT $v $r*)
	-> and so on

Jim

-----Original Message-----
From: Curtis Clauson [mailto:NOSPAM at TheSnakePitDev.com] 
Sent: Friday, November 09, 2007 3:32 PM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] combine tokens in rewrite rule

I understand and agree with AST sub-trees for fully qualified 
identifiers in complex languages. However, that is not what either the 
referenced thread poster nor this thread's original poster asked.
They asked for one fully qualified name as one token in one node, which 
would be a lexical solution. I did not want to try to redefine their 
language.

The solution you gave in the referenced thread specified
  1 Declare an "imaginary" token declared in the tokens {} section.
  2 Accumulate the text of the individual IDs and dots ('.') in some 
unspecified manner.
  3 Rewrite the rule as the imaginary token set to the concatenated text 

in some unspecified manner.

What confuses me is why use an "imaginary" token, precisely how steps 2 
and 3 are performed, and how such a solution would differ from using 
lexical fragments as I demonstrated.

Could you provide a concrete example grammar? You got me all curious 
now. :)

Thanks
-- Curtis

Jim Idle wrote:
> But that is a lexical solution. When '.' is used in many places it is 
> quite often a better bet to have the parser determine the pieces of a 
> valid reference and in many cases you need the individual components 
> because the meanings change according to context.
> 
> For instance x.y could be an enumeration, or a property reference or 
> something else.
> 
> All that needs to be done is to take the .text of each element of the 
ID 
> and concatenate them. To be honest, I would probably not even do that 
in 
> the parser, but in the tree parser, where you probably have the 
> contextual information available (and may well not have in the 
parser). 
> Then the write would be something like ->^(IDEXPR $ids+ ) or some 
such.
> 
> Jim
> 
> -----Original Message-----
> From: Curtis Clauson [mailto:NOSPAM at TheSnakePitDev.com] 
> Sent: Friday, November 09, 2007 1:56 PM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] combine tokens in rewrite rule
> 
> I must admit, I am somewhat confused by the answer given in the 
> referenced thread. Doesn't the use of fragment lexer rules solve this?
> 
> For example, the grammar below will parse this input
> <<
> name
> name.subName1
> name.subName1.subName2.subName3
>  >>
> into a tree that has three ID nodes under the root nil node, each with 

> the complete qualified ID text as a single token. Is this what you 
mean, 
> 
> or have I missed something?
> 
> (tested with AntLR v3.0.1 and ANTLRWorks v1.1.4)
> ----------
> grammar ABer1;
> 
> options {
>      output = AST;
> }
> 
> 
> unit: ID+;
> 
> 
> ID: UnqualifiedID ('.' UnqualifiedID)*;
> WS: (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;};
> 
> 
> fragment UnqualifiedID     : UnqualifiedIDFirst (UnqualifiedIDRest)*;
> fragment UnqualifiedIDFirst: 'a'..'z' | 'A'..'Z' | '_';
> fragment UnqualifiedIDRest : 'a'..'z' | 'A'..'Z' | '_' | '0'..'9';
> ----------
> 
> I hope that helps.
> -- Curtis
> 
> 
> Adrian Ber wrote:
>> Hi all,
>>
>> I want to find a way to combine multiple tokens in a single one.
>> I've searched the archive and found this thread: 
> 
http://www.antlr.org/pipermail/antlr-interest/2007-January/019161.html.
>> Does any of you have a short sample code on how to do it?