[antlr-interest] Fragments and setText appear to not work at all - not even with code from the book

Harald Mueller harald_m_mueller at gmx.de
Wed Nov 21 00:53:44 PST 2007


Hi - 

I'm quite sure that the code generated for (lexer) fragments is wrong. Not even the example on page 105 in Terence's book works as one would assume (but maybe we have to argue about what someone *would* assume). At least the behavior is totally different from ANTLR2, and there is no easy way to rewrite certain ANTLR2 lexer grammers as ANTLR3.

Here is the example from p.105 extended to be runnable in Java:

// -----------------------------------------------------------------
grammar SetTextTrouble;

@parser::header {
  import org.antlr.runtime.*;
}

@parser::members {
  private static void run(String s) throws Exception {
    System.out.print(s + " ==> ");
    ANTLRStringStream input = new ANTLRStringStream(s);
    SetTextTroubleLexer lexer = new SetTextTroubleLexer(input);
    CommonTokenStream tokens = new CommonTokenStream(lexer);   
    SetTextTroubleParser p = new SetTextTroubleParser(tokens);
    p.a();
  }

  public static void main(String[] args) throws Exception {
    run("{ a { b c }}");
    // run("{ {2}}");
    // run("{{2}}");
    
  }
}

// Parser

a : m=MAIN { System.out.println(m.getText()); };

// BEGIN verbatim copy from book p.105
fragment
CODE[boolean stripCurlies]
    : '{' ( CODE[stripCurlies] | ~('{'|'}') )* '}'
        {
        if ( stripCurlies ) {
            setText(getText().substring(1, getText().length()));
            //C#: $text = $text.Substring(1, getText().length()-1);
        }
        }
    ;
// Another rule would invoke CODE via CODE[false] or CODE[true].
// END verbatim copy from book p.105

MAIN : CODE[true]
     ;

// -----------------------------------------------------------------

The result of this is:

{ a { b c }} ==> a { b c }

One sees that the curlies are NOT stripped from the inner fragment - i.e., the call to setText is a no-op. One can guess the reason if one looks into the generated code: The recursive call is

               switch (alt1) {
            	case 1 :
            	    // SetTextTrouble.g:37:13: CODE[stripCurlies]
            	    {
            	    mCODE(stripCurlies); 
            	    
            	    }
            	    break;
 
No-one cares for the fact that the text has changed, it seems. I have some examples (of more complex grammars) where one can see that the text of the fragment is put into a temporary token simply using an index from BEFORE the fragment call and the character position (getCharIndex()?) after the call - so each change of the fragment's text appears to be completely bypassed.

For reasons I do not know, the whole thing works on the outermost level - even though the code looks like this:

                    if ( token==null && ruleNestingLevel==1 ) {
                        emit(_type,_line,_charPosition,_channel,_start,getCharIndex()-1);
                    }

Also here, nothing about text ... but probably emit internally honors changes to the complete symbol's text.

The whole problem is very unfortunate because in ANTLR2, the following worked flawlessly as expected:

protected
NAME
	: '\''!
	  (NAME_CHARACTER)*
	  (GENERIC_TAIL!)?   // We cut off the "generic tail", e.g. 'Stack`1' becomes 'Stack'
	  '\''!
	;

protected
METHODNAME
   :          // empty
   | ':'
     ':' 
     ( NAME | DIRECTIVE )
   ;

FULLNAME
   : n1:NAME            // simple name or namespace name
     ('.' n2:NAME)?     // classname if namespaced name
     n3:NESTEDNAME      // nested classnames
     n4:METHODNAME      // method name (DIRECTIVE if .ctor, .cctor etc.)
     {
       $setToken(CreateNameToken(n1,n2,n3,n4));
     } 
   ;
   

Here, "protected" NAME is a fragment which wants to pass up a stripped text - using the exclamation marks !, it was easy to strip off some characters from the fragment. I have no idea how to write this (straightforwardly - not with any hacks using internal variables) in ANTLR3.

Regards
Harald M.

-- 
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger


More information about the antlr-interest mailing list