[antlr-interest] Fragments and setText appear to not work at all - not even with code from the book
Harald Mueller
harald_m_mueller at gmx.de
Wed Nov 21 01:47:50 PST 2007
Hello - me again - aren't "fragment tokens" created wrong also??
Here's a very tiny addition to the grammar, which seems to show that actually the Text property is "highly ignored" - but maybe I misread the code:
If you replace the line
: '{' ( CODE[stripCurlies] | ~('{'|'}') )* '}'
>from the book's example with
: '{' ( x=CODE[stripCurlies] | ~('{'|'}') )* '}'
you get the following code (Java - C# is very similar):
switch (alt1) {
case 1 :
// SetTextTrouble.g:39:13: x= CODE[stripCurlies]
{
int xStart26 = getCharIndex();
mCODE(stripCurlies);
x = new CommonToken(input, Token.INVALID_TOKEN_TYPE, Token.DEFAULT_CHANNEL, xStart26, getCharIndex()-1);
}
break;
The new CommonToken gets
* the input (CharStream)
* xStart26 - which is the position in the stream before calling mCODE()
and
* getCharIndex()-1 - which is presumably the current position.
Inside, CommonToken simply remembers these parameters to access input later lazily in property Text (in C# - I did not look into the Java code here). So any calls to setText(...) ($text = ... in C#) would never show up in x, wouldn't they? - but I (at least ...) would expect them to ...
:-(
Regards
Harald
P.S. I leave my original posting attached below for reference - I hope that's ok.
> Hi -
>
> I'm quite sure that the code generated for (lexer) fragments is wrong. Not
> even the example on page 105 in Terence's book works as one would assume
> (but maybe we have to argue about what someone *would* assume). At least the
> behavior is totally different from ANTLR2, and there is no easy way to
> rewrite certain ANTLR2 lexer grammers as ANTLR3.
>
> Here is the example from p.105 extended to be runnable in Java:
>
> // -----------------------------------------------------------------
> grammar SetTextTrouble;
>
> @parser::header {
> import org.antlr.runtime.*;
> }
>
> @parser::members {
> private static void run(String s) throws Exception {
> System.out.print(s + " ==> ");
> ANTLRStringStream input = new ANTLRStringStream(s);
> SetTextTroubleLexer lexer = new SetTextTroubleLexer(input);
> CommonTokenStream tokens = new CommonTokenStream(lexer);
> SetTextTroubleParser p = new SetTextTroubleParser(tokens);
> p.a();
> }
>
> public static void main(String[] args) throws Exception {
> run("{ a { b c }}");
> // run("{ {2}}");
> // run("{{2}}");
>
> }
> }
>
> // Parser
>
> a : m=MAIN { System.out.println(m.getText()); };
>
> // BEGIN verbatim copy from book p.105
> fragment
> CODE[boolean stripCurlies]
> : '{' ( CODE[stripCurlies] | ~('{'|'}') )* '}'
> {
> if ( stripCurlies ) {
> setText(getText().substring(1, getText().length()));
> //C#: $text = $text.Substring(1, getText().length()-1);
> }
> }
> ;
> // Another rule would invoke CODE via CODE[false] or CODE[true].
> // END verbatim copy from book p.105
>
> MAIN : CODE[true]
> ;
>
> // -----------------------------------------------------------------
>
> The result of this is:
>
> { a { b c }} ==> a { b c }
>
> One sees that the curlies are NOT stripped from the inner fragment - i.e.,
> the call to setText is a no-op. One can guess the reason if one looks into
> the generated code: The recursive call is
>
> switch (alt1) {
> case 1 :
> // SetTextTrouble.g:37:13: CODE[stripCurlies]
> {
> mCODE(stripCurlies);
>
> }
> break;
>
> No-one cares for the fact that the text has changed, it seems. I have some
> examples (of more complex grammars) where one can see that the text of the
> fragment is put into a temporary token simply using an index from BEFORE
> the fragment call and the character position (getCharIndex()?) after the
> call - so each change of the fragment's text appears to be completely
> bypassed.
>
> For reasons I do not know, the whole thing works on the outermost level -
> even though the code looks like this:
>
> if ( token==null && ruleNestingLevel==1 ) {
>
> emit(_type,_line,_charPosition,_channel,_start,getCharIndex()-1);
> }
>
> Also here, nothing about text ... but probably emit internally honors
> changes to the complete symbol's text.
>
> The whole problem is very unfortunate because in ANTLR2, the following
> worked flawlessly as expected:
>
> protected
> NAME
> : '\''!
> (NAME_CHARACTER)*
> (GENERIC_TAIL!)? // We cut off the "generic tail", e.g. 'Stack`1'
> becomes 'Stack'
> '\''!
> ;
>
> protected
> METHODNAME
> : // empty
> | ':'
> ':'
> ( NAME | DIRECTIVE )
> ;
>
> FULLNAME
> : n1:NAME // simple name or namespace name
> ('.' n2:NAME)? // classname if namespaced name
> n3:NESTEDNAME // nested classnames
> n4:METHODNAME // method name (DIRECTIVE if .ctor, .cctor etc.)
> {
> $setToken(CreateNameToken(n1,n2,n3,n4));
> }
> ;
>
>
> Here, "protected" NAME is a fragment which wants to pass up a stripped
> text - using the exclamation marks !, it was easy to strip off some characters
> from the fragment. I have no idea how to write this (straightforwardly -
> not with any hacks using internal variables) in ANTLR3.
>
> Regards
> Harald M.
>
> --
> Psssst! Schon vom neuen GMX MultiMessenger gehört?
> Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger
--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger
More information about the antlr-interest
mailing list