[antlr-interest] combine tokens in rewrite rule
Jim Idle
jimi at temporal-wave.com
Fri Nov 9 16:15:09 PST 2007
However, what a person asks for, and what they need are not necessarily
the same thing. ;-)
It doesnt have to be an imaginary token, but it usually is because
there won't be a lexer defined token to use with the rewrite, given that
you are parsing the construct rather than lexing it.
So, you are parsing the elements of some complicated reference or
variable or class etc, and you need '.' in other places in your parser,
and you also need to look at the individual pieces of the id. When you
send the reference to the tree parser, you want to tag it with something
to introduce it as a reference. Hence you would usually use an imaginary
token as the place holder for the reference and have it introduce the
individual pieces of the reference, which can then be looked up to find
out if they are enumerations, objects and so on, such that the tree
parser deals with them accordingly.
If you pass the whole thing in as one token from the lexer, then you
will probably end up splitting the token text anyway, so you can look up
the context. However, if you never need to do this, then a lexical
solution probably does work for you. Trying to apply context within the
lexer rules though is definitely not something you should be doing by
choice.
Now keep in mind that there are always 18 ways to skin a cat, and that's
just the way I do such things, it's whatever floats your boat in the end
:-)
Check the wiki or book for the rewrite syntax, but you can set the text
of a token when you rewrite it.
So, your options are:
1) Lexical if there is no need to do anything with the different
components (maybe you are formatting and dont need to know what it is
for instance).
2) Declare a local String variable and as you get each ID text, append
it, then rewrite with that as the token text (here this would be so that
you have simpler lexer rules or are avoiding some lexing ambiguity say,
because putting the text back together is kind of redundant (I seem to
think that this was why the first question received the answer it did);
3) Rewrite the place holding imaginary token and each of the name
components. If you can work out the type or context at this stage in the
parse, then you might write one of a number of imaginaries, but If you
have to parser the whole thing before you can work out types, then you
would use one token and resolve the types in the next phase.
So:
tokens
{
REFERENCE;
}
id:
i=ID (DOT i+=ID) -> ^(REFERENCE $i+)
;
Or perhaps, if you have context, something like
id:
v=ID (DOT r+=ID)
-> {lookup($v) == OBJECT}? ^(OBJECT $v $r*)
-> and so on
Jim
-----Original Message-----
From: Curtis Clauson [mailto:NOSPAM at TheSnakePitDev.com]
Sent: Friday, November 09, 2007 3:32 PM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] combine tokens in rewrite rule
I understand and agree with AST sub-trees for fully qualified
identifiers in complex languages. However, that is not what either the
referenced thread poster nor this thread's original poster asked.
They asked for one fully qualified name as one token in one node, which
would be a lexical solution. I did not want to try to redefine their
language.
The solution you gave in the referenced thread specified
1 Declare an "imaginary" token declared in the tokens {} section.
2 Accumulate the text of the individual IDs and dots ('.') in some
unspecified manner.
3 Rewrite the rule as the imaginary token set to the concatenated text
in some unspecified manner.
What confuses me is why use an "imaginary" token, precisely how steps 2
and 3 are performed, and how such a solution would differ from using
lexical fragments as I demonstrated.
Could you provide a concrete example grammar? You got me all curious
now. :)
Thanks
-- Curtis
Jim Idle wrote:
> But that is a lexical solution. When '.' is used in many places it is
> quite often a better bet to have the parser determine the pieces of a
> valid reference and in many cases you need the individual components
> because the meanings change according to context.
>
> For instance x.y could be an enumeration, or a property reference or
> something else.
>
> All that needs to be done is to take the .text of each element of the
ID
> and concatenate them. To be honest, I would probably not even do that
in
> the parser, but in the tree parser, where you probably have the
> contextual information available (and may well not have in the
parser).
> Then the write would be something like ->^(IDEXPR $ids+ ) or some
such.
>
> Jim
>
> -----Original Message-----
> From: Curtis Clauson [mailto:NOSPAM at TheSnakePitDev.com]
> Sent: Friday, November 09, 2007 1:56 PM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] combine tokens in rewrite rule
>
> I must admit, I am somewhat confused by the answer given in the
> referenced thread. Doesn't the use of fragment lexer rules solve this?
>
> For example, the grammar below will parse this input
> <<
> name
> name.subName1
> name.subName1.subName2.subName3
> >>
> into a tree that has three ID nodes under the root nil node, each with
> the complete qualified ID text as a single token. Is this what you
mean,
>
> or have I missed something?
>
> (tested with AntLR v3.0.1 and ANTLRWorks v1.1.4)
> ----------
> grammar ABer1;
>
> options {
> output = AST;
> }
>
>
> unit: ID+;
>
>
> ID: UnqualifiedID ('.' UnqualifiedID)*;
> WS: (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;};
>
>
> fragment UnqualifiedID : UnqualifiedIDFirst (UnqualifiedIDRest)*;
> fragment UnqualifiedIDFirst: 'a'..'z' | 'A'..'Z' | '_';
> fragment UnqualifiedIDRest : 'a'..'z' | 'A'..'Z' | '_' | '0'..'9';
> ----------
>
> I hope that helps.
> -- Curtis
>
>
> Adrian Ber wrote:
>> Hi all,
>>
>> I want to find a way to combine multiple tokens in a single one.
>> I've searched the archive and found this thread:
>
http://www.antlr.org/pipermail/antlr-interest/2007-January/019161.html.
>> Does any of you have a short sample code on how to do it?
More information about the antlr-interest
mailing list