[antlr-interest] Fragments and setText appear to not work at all - not even with code from the book

Wed Nov 21 01:47:50 PST 2007

Hello - me again - aren't "fragment tokens" created wrong also??

Here's a very tiny addition to the grammar, which seems to show that actually the Text property is "highly ignored" - but maybe I misread the code:

If you replace the line

     : '{' ( CODE[stripCurlies] | ~('{'|'}') )* '}'

>from the book's example with

     : '{' ( x=CODE[stripCurlies] | ~('{'|'}') )* '}'

you get the following code (Java - C# is very similar):

                switch (alt1) {
                case 1 :
                    // SetTextTrouble.g:39:13: x= CODE[stripCurlies]
                    {
                    int xStart26 = getCharIndex();
                    mCODE(stripCurlies);
                    x = new CommonToken(input, Token.INVALID_TOKEN_TYPE, Token.DEFAULT_CHANNEL, xStart26, getCharIndex()-1);

                    }
                    break;

The new CommonToken gets
* the input (CharStream)
* xStart26 - which is the position in the stream before calling mCODE()
and
* getCharIndex()-1 - which is presumably the current position.

Inside, CommonToken simply remembers these parameters to access input later lazily in property Text (in C# - I did not look into the Java code here). So any calls to setText(...) ($text = ... in C#) would never show up in x, wouldn't they? - but I (at least ...) would expect them to ...
:-(

Regards
Harald

P.S. I leave my original posting attached below for reference - I hope that's ok.

> Hi - 
> 
> I'm quite sure that the code generated for (lexer) fragments is wrong. Not
> even the example on page 105 in Terence's book works as one would assume
> (but maybe we have to argue about what someone *would* assume). At least the
> behavior is totally different from ANTLR2, and there is no easy way to
> rewrite certain ANTLR2 lexer grammers as ANTLR3.
> 
> Here is the example from p.105 extended to be runnable in Java:
> 
> // -----------------------------------------------------------------
> grammar SetTextTrouble;
> 
> @parser::header {
>   import org.antlr.runtime.*;
> }
> 
> @parser::members {
>   private static void run(String s) throws Exception {
>     System.out.print(s + " ==> ");
>     ANTLRStringStream input = new ANTLRStringStream(s);
>     SetTextTroubleLexer lexer = new SetTextTroubleLexer(input);
>     CommonTokenStream tokens = new CommonTokenStream(lexer);   
>     SetTextTroubleParser p = new SetTextTroubleParser(tokens);
>     p.a();
>   }
> 
>   public static void main(String[] args) throws Exception {
>     run("{ a { b c }}");
>     // run("{ {2}}");
>     // run("{{2}}");
>     
>   }
> }
> 
> // Parser
> 
> a : m=MAIN { System.out.println(m.getText()); };
> 
> // BEGIN verbatim copy from book p.105
> fragment
> CODE[boolean stripCurlies]
>     : '{' ( CODE[stripCurlies] | ~('{'|'}') )* '}'
>         {
>         if ( stripCurlies ) {
>             setText(getText().substring(1, getText().length()));
>             //C#: $text = $text.Substring(1, getText().length()-1);
>         }
>         }
>     ;
> // Another rule would invoke CODE via CODE[false] or CODE[true].
> // END verbatim copy from book p.105
> 
> MAIN : CODE[true]
>      ;
> 
> // -----------------------------------------------------------------
> 
> The result of this is:
> 
> { a { b c }} ==> a { b c }
> 
> One sees that the curlies are NOT stripped from the inner fragment - i.e.,
> the call to setText is a no-op. One can guess the reason if one looks into
> the generated code: The recursive call is
> 
>                switch (alt1) {
>             	case 1 :
>             	    // SetTextTrouble.g:37:13: CODE[stripCurlies]
>             	    {
>             	    mCODE(stripCurlies); 
>             	    
>             	    }
>             	    break;
>  
> No-one cares for the fact that the text has changed, it seems. I have some
> examples (of more complex grammars) where one can see that the text of the
> fragment is put into a temporary token simply using an index from BEFORE
> the fragment call and the character position (getCharIndex()?) after the
> call - so each change of the fragment's text appears to be completely
> bypassed.
> 
> For reasons I do not know, the whole thing works on the outermost level -
> even though the code looks like this:
> 
>                     if ( token==null && ruleNestingLevel==1 ) {
>                        
> emit(_type,_line,_charPosition,_channel,_start,getCharIndex()-1);
>                     }
> 
> Also here, nothing about text ... but probably emit internally honors
> changes to the complete symbol's text.
> 
> The whole problem is very unfortunate because in ANTLR2, the following
> worked flawlessly as expected:
> 
> protected
> NAME
> 	: '\''!
> 	  (NAME_CHARACTER)*
> 	  (GENERIC_TAIL!)?   // We cut off the "generic tail", e.g. 'Stack`1'
> becomes 'Stack'
> 	  '\''!
> 	;
> 
> protected
> METHODNAME
>    :          // empty
>    | ':'
>      ':' 
>      ( NAME | DIRECTIVE )
>    ;
> 
> FULLNAME
>    : n1:NAME            // simple name or namespace name
>      ('.' n2:NAME)?     // classname if namespaced name
>      n3:NESTEDNAME      // nested classnames
>      n4:METHODNAME      // method name (DIRECTIVE if .ctor, .cctor etc.)
>      {
>        $setToken(CreateNameToken(n1,n2,n3,n4));
>      } 
>    ;
>    
> 
> Here, "protected" NAME is a fragment which wants to pass up a stripped
> text - using the exclamation marks !, it was easy to strip off some characters
> from the fragment. I have no idea how to write this (straightforwardly -
> not with any hacks using internal variables) in ANTLR3.
> 
> Regards
> Harald M.
> 
> -- 
> Psssst! Schon vom neuen GMX MultiMessenger gehört?
> Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger

-- 
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger