[antlr-interest] antlrworks confused by imaginary tokens?

Fri Mar 16 05:22:02 PDT 2012

Hi,

some hints on defining imaginary and real tokens in an ANTLR grammar.

The ANTLR grammar for *.g files prescribes a certain order of sections in the 
grammar file. Therefore you must follow this order in your grammar file. See 
this short excerpt from the ANTLR grammar for ANTLR grammar files:

grammarDef
    :   DOC_COMMENT?
        (   'lexer'  {gtype=LEXER_GRAMMAR;}    // pure lexer
        |   'parser' {gtype=PARSER_GRAMMAR;}   // pure parser
        |   'tree'   {gtype=TREE_GRAMMAR;}     // a tree parser
        |            {gtype=COMBINED_GRAMMAR;} // merged parser/lexer
        )
        g='grammar' id ';' optionsSpec? tokensSpec? attrScope* action*
        rule+
        EOF
        -> ^( {adaptor.create(gtype,$g)}
                  id DOC_COMMENT? optionsSpec? tokensSpec? attrScope* action* 
rule+
                )
    ;

You see the tokensSpec follows the optionsSpec and so on ...

The token section should look like the following example:

token {
	VIRTUAL_TOKEN1;
	VIRTUAL_TOKEN2;
	REAL_TEXT = 'TEXT';	// Only a single char / string allowed!
	REAL_INFO = 'INFO';
}

Please observe the fact that between the token name and the token text there 
is an equals ("=") sign! This is an deviation from the syntax of a lexer rule 
to define a token. 

Please also observe that in the token section only a single string or char 
literal is allowed. If you need something like a keyword which may have an 
abbreviated form then you must use a lexer rule like this:

KW_IDENT: ('IDENT' | 'IDENTICAL');

If you look at these rules you see that your posted tokens section violates 
these rules.

I hope that helps,
	Stefan

PS.: You may look for the ANTLR grammar for grammar files in the source 
distribution. Look for file ./antlr-3.4/tool/src/main/antlr3/org/antlr/
grammar/v3/ANTLRv3.g

Am 16.03.2012 05:03:44 schrieb(en) Michael Roberts:
> I've been happily hacking on my little grammar using antlrworks.
>  Everything was going swimmingly until I introduced a section of imaginary
> tokens for use in rewrite rules.  For some reason, antlr/antlrworks really
> wanted this section of imaginary tokens at the start of the file, directly
> behind the options section.  It didn't seem to like it in other places, and
> would not recognize the imaginary tokens otherwise.
> 
> However, oddly, it didn't like it if I defined my regular tokens inside the
> tokens sections and refused to recognize them, flagging mismatched token
> exceptions all over the place.  So, accepting defeat, I moved these
> non-imaginary tokens back to the end of the file, where they'd previously
> been living.  No missing tokens, everything generates fine now.
> 
> However, when I attempt to debug my parser, the generated test code
> references the first non-imaginary token it finds as the top level
> construct, in my case CLOSE_PAREN, and not my top-level compilationUnit
> production (which is ahead of it in the file).  Thus:
> 
> public class __Test__ {
> 
>     public static void main(String args[]) throws Exception {
>         JLG2Lexer lex = new JLG2Lexer(new
> ANTLRFileStream("C:\\src\\Core\\src\\org\\veve\\reflect\\interpreter\\output
> \\__Test___input.txt",
> "UTF8"));
>         CommonTokenStream tokens = new CommonTokenStream(lex);
> 
>         JLG2Parser g = new JLG2Parser(tokens, 49100, null);
>         try {
>             g.CLOSE_PAREN();   // <-- BAD, was expecting to see
> compilationUnit here ...
>         } catch (RecognitionException e) {
>             e.printStackTrace();
>         }
>     }
> }
> 
> So, my main question is ..  why doesn't this form of token definition
> (below) work:
> 
> 
> tokens
> {
> 
> // Imaginary tokens for AST rewrite ops
> IDENTIFIER_PATH;
> INVOCATION;
> STATEMENT_BLOCK;
> AMPERSAND_INVOCATION;
> INVOCATION_STAT;
> OBJECT;
> ARRAY;
> ELEMENT_STAT;
> MEMBERS;
> PAIR;
> PAIR_LIST;
> METHOD_INVOCATION;
> NEW_COMMAND;
> STRING;
> NUMBER;
> ARRAY;
> BOOLEAN;
> NULL;
> PATH;
> 
> // Real, defined tokens
> CLOSE_PAREN : ')';
> AMPERSAND : '@';
> WS       :           (' '|'\t'|'\f'|'\n'|'\r')+{ skip(); };
> COLON : ':';
> EQUALS : '=';
> INJECT : '<-';
> COMMA : ',';
> SLASH : '/';
> OPEN_PAREN :    '(' ;
> OPEN_BRACE   : '{';
> CLOSE_BRACE
> :   '}';
> DOT
> : '.';
> SEMI_COLON
> : ';';
> BLOCK :   '|' ;
> }
> 
> is the token section just for imaginary tokens, then, and, if not how do I
> define regular tokens in it .. and, in essence, what could I possibly be
> doing to so confuse the test jig generator code so that it's generating
> something silly?
> 
> MR
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-
> address
> 
>