[antlr-interest] antlrworks confused by imaginary tokens?

Fri Mar 16 15:10:53 PDT 2012

Related question.  If I define all of the tokens in the token section,
what's the recommended method for dealing with whitespace?  It was my
understanding that antlr would use the WS token to process whitespaces.
Clearly, this won't work:

WS       :           (' '|'\t'|'\f'|'\n'|'\r')+{ skip(); };

MR

On Fri, Mar 16, 2012 at 5:22 AM, Stefan Mätje <
Stefan.Maetje at esd-electronics.com> wrote:

> Hi,
>
> some hints on defining imaginary and real tokens in an ANTLR grammar.
>
> The ANTLR grammar for *.g files prescribes a certain order of sections in
> the
> grammar file. Therefore you must follow this order in your grammar file.
> See
> this short excerpt from the ANTLR grammar for ANTLR grammar files:
>
> grammarDef
>    :   DOC_COMMENT?
>        (   'lexer'  {gtype=LEXER_GRAMMAR;}    // pure lexer
>        |   'parser' {gtype=PARSER_GRAMMAR;}   // pure parser
>        |   'tree'   {gtype=TREE_GRAMMAR;}     // a tree parser
>        |            {gtype=COMBINED_GRAMMAR;} // merged parser/lexer
>        )
>        g='grammar' id ';' optionsSpec? tokensSpec? attrScope* action*
>        rule+
>        EOF
>        -> ^( {adaptor.create(gtype,$g)}
>                  id DOC_COMMENT? optionsSpec? tokensSpec? attrScope*
> action*
> rule+
>                )
>    ;
>
> You see the tokensSpec follows the optionsSpec and so on ...
>
> The token section should look like the following example:
>
> token {
>        VIRTUAL_TOKEN1;
>        VIRTUAL_TOKEN2;
>        REAL_TEXT = 'TEXT';     // Only a single char / string allowed!
>        REAL_INFO = 'INFO';
> }
>
> Please observe the fact that between the token name and the token text
> there
> is an equals ("=") sign! This is an deviation from the syntax of a lexer
> rule
> to define a token.
>
> Please also observe that in the token section only a single string or char
> literal is allowed. If you need something like a keyword which may have an
> abbreviated form then you must use a lexer rule like this:
>
> KW_IDENT: ('IDENT' | 'IDENTICAL');
>
> If you look at these rules you see that your posted tokens section violates
> these rules.
>
> I hope that helps,
>        Stefan
>
> PS.: You may look for the ANTLR grammar for grammar files in the source
> distribution. Look for file ./antlr-3.4/tool/src/main/antlr3/org/antlr/
> grammar/v3/ANTLRv3.g
>
>
> Am 16.03.2012 05:03:44 schrieb(en) Michael Roberts:
> > I've been happily hacking on my little grammar using antlrworks.
> >  Everything was going swimmingly until I introduced a section of
> imaginary
> > tokens for use in rewrite rules.  For some reason, antlr/antlrworks
> really
> > wanted this section of imaginary tokens at the start of the file,
> directly
> > behind the options section.  It didn't seem to like it in other places,
> and
> > would not recognize the imaginary tokens otherwise.
> >
> > However, oddly, it didn't like it if I defined my regular tokens inside
> the
> > tokens sections and refused to recognize them, flagging mismatched token
> > exceptions all over the place.  So, accepting defeat, I moved these
> > non-imaginary tokens back to the end of the file, where they'd previously
> > been living.  No missing tokens, everything generates fine now.
> >
> > However, when I attempt to debug my parser, the generated test code
> > references the first non-imaginary token it finds as the top level
> > construct, in my case CLOSE_PAREN, and not my top-level compilationUnit
> > production (which is ahead of it in the file).  Thus:
> >
> > public class __Test__ {
> >
> >     public static void main(String args[]) throws Exception {
> >         JLG2Lexer lex = new JLG2Lexer(new
> >
> ANTLRFileStream("C:\\src\\Core\\src\\org\\veve\\reflect\\interpreter\\output
> > \\__Test___input.txt",
> > "UTF8"));
> >         CommonTokenStream tokens = new CommonTokenStream(lex);
> >
> >         JLG2Parser g = new JLG2Parser(tokens, 49100, null);
> >         try {
> >             g.CLOSE_PAREN();   // <-- BAD, was expecting to see
> > compilationUnit here ...
> >         } catch (RecognitionException e) {
> >             e.printStackTrace();
> >         }
> >     }
> > }
> >
> > So, my main question is ..  why doesn't this form of token definition
> > (below) work:
> >
> >
> > tokens
> > {
> >
> > // Imaginary tokens for AST rewrite ops
> > IDENTIFIER_PATH;
> > INVOCATION;
> > STATEMENT_BLOCK;
> > AMPERSAND_INVOCATION;
> > INVOCATION_STAT;
> > OBJECT;
> > ARRAY;
> > ELEMENT_STAT;
> > MEMBERS;
> > PAIR;
> > PAIR_LIST;
> > METHOD_INVOCATION;
> > NEW_COMMAND;
> > STRING;
> > NUMBER;
> > ARRAY;
> > BOOLEAN;
> > NULL;
> > PATH;
> >
> > // Real, defined tokens
> > CLOSE_PAREN : ')';
> > AMPERSAND : '@';
> > WS       :           (' '|'\t'|'\f'|'\n'|'\r')+{ skip(); };
> > COLON : ':';
> > EQUALS : '=';
> > INJECT : '<-';
> > COMMA : ',';
> > SLASH : '/';
> > OPEN_PAREN :    '(' ;
> > OPEN_BRACE   : '{';
> > CLOSE_BRACE
> > :   '}';
> > DOT
> > : '.';
> > SEMI_COLON
> > : ';';
> > BLOCK :   '|' ;
> > }
> >
> > is the token section just for imaginary tokens, then, and, if not how do
> I
> > define regular tokens in it .. and, in essence, what could I possibly be
> > doing to so confuse the test jig generator code so that it's generating
> > something silly?
> >
> > MR
> >
> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-
> > address
> >
> >
>
>
>