[antlr-interest] Fragments and setText appear to not work at all - not even with code from the book

Wed Nov 21 02:06:39 PST 2007

My copy of the book has something totally different on page 105 so I 
can't comment on your book reference. Anyway, what you refer to is not a 
bug in the sense that it was a deliberate design decision for the sake 
of performance. The default behavior of token creation is to set 
pointers into the character stream for the beginning and end of the 
token; not to copy a string to a token. "settext at the end of a token 
definition will overwrite this behavior. In other words, "settext" that 
you do in a fragment rule is irrelevant.

PS. Thanks for noticing the problem with the date on my machine. It 
seems to reset itself randomly. Plays havoc with anti-virus updates.

Harald Mueller wrote:
> Hi - 
>
> I'm quite sure that the code generated for (lexer) fragments is wrong. Not even the example on page 105 in Terence's book works as one would assume (but maybe we have to argue about what someone *would* assume). At least the behavior is totally different from ANTLR2, and there is no easy way to rewrite certain ANTLR2 lexer grammers as ANTLR3.
>
> Here is the example from p.105 extended to be runnable in Java:
>
> // -----------------------------------------------------------------
> grammar SetTextTrouble;
>
> @parser::header {
>   import org.antlr.runtime.*;
> }
>
> @parser::members {
>   private static void run(String s) throws Exception {
>     System.out.print(s + " ==> ");
>     ANTLRStringStream input = new ANTLRStringStream(s);
>     SetTextTroubleLexer lexer = new SetTextTroubleLexer(input);
>     CommonTokenStream tokens = new CommonTokenStream(lexer);   
>     SetTextTroubleParser p = new SetTextTroubleParser(tokens);
>     p.a();
>   }
>
>   public static void main(String[] args) throws Exception {
>     run("{ a { b c }}");
>     // run("{ {2}}");
>     // run("{{2}}");
>     
>   }
> }
>
> // Parser
>
> a : m=MAIN { System.out.println(m.getText()); };
>
> // BEGIN verbatim copy from book p.105
> fragment
> CODE[boolean stripCurlies]
>     : '{' ( CODE[stripCurlies] | ~('{'|'}') )* '}'
>         {
>         if ( stripCurlies ) {
>             setText(getText().substring(1, getText().length()));
>             //C#: $text = $text.Substring(1, getText().length()-1);
>         }
>         }
>     ;
> // Another rule would invoke CODE via CODE[false] or CODE[true].
> // END verbatim copy from book p.105
>
> MAIN : CODE[true]
>      ;
>
> // -----------------------------------------------------------------
>
> The result of this is:
>
> { a { b c }} ==> a { b c }
>
> One sees that the curlies are NOT stripped from the inner fragment - i.e., the call to setText is a no-op. One can guess the reason if one looks into the generated code: The recursive call is
>
>                switch (alt1) {
>             	case 1 :
>             	    // SetTextTrouble.g:37:13: CODE[stripCurlies]
>             	    {
>             	    mCODE(stripCurlies); 
>             	    
>             	    }
>             	    break;
>  
> No-one cares for the fact that the text has changed, it seems. I have some examples (of more complex grammars) where one can see that the text of the fragment is put into a temporary token simply using an index from BEFORE the fragment call and the character position (getCharIndex()?) after the call - so each change of the fragment's text appears to be completely bypassed.
>
> For reasons I do not know, the whole thing works on the outermost level - even though the code looks like this:
>
>                     if ( token==null && ruleNestingLevel==1 ) {
>                         emit(_type,_line,_charPosition,_channel,_start,getCharIndex()-1);
>                     }
>
> Also here, nothing about text ... but probably emit internally honors changes to the complete symbol's text.
>
> The whole problem is very unfortunate because in ANTLR2, the following worked flawlessly as expected:
>
> protected
> NAME
> 	: '\''!
> 	  (NAME_CHARACTER)*
> 	  (GENERIC_TAIL!)?   // We cut off the "generic tail", e.g. 'Stack`1' becomes 'Stack'
> 	  '\''!
> 	;
>
> protected
> METHODNAME
>    :          // empty
>    | ':'
>      ':' 
>      ( NAME | DIRECTIVE )
>    ;
>
> FULLNAME
>    : n1:NAME            // simple name or namespace name
>      ('.' n2:NAME)?     // classname if namespaced name
>      n3:NESTEDNAME      // nested classnames
>      n4:METHODNAME      // method name (DIRECTIVE if .ctor, .cctor etc.)
>      {
>        $setToken(CreateNameToken(n1,n2,n3,n4));
>      } 
>    ;
>    
>
> Here, "protected" NAME is a fragment which wants to pass up a stripped text - using the exclamation marks !, it was easy to strip off some characters from the fragment. I have no idea how to write this (straightforwardly - not with any hacks using internal variables) in ANTLR3.
>
> Regards
> Harald M.
>
>