[antlr-interest] newbie needs help

Thu Jan 21 13:25:24 PST 2010

Greetings!

On Thu, 2010-01-21 at 20:20 +0100, Hugo wrote:
> I started using antlr to parse a specific file format.
> The problem is that i don't know how to write correctly my grammar.
> 
> The file have the following format.
> It contains multiple lines and each can have the following format:
> 
> Only one or multilple hexadecimal caracter with space or not
> ex: A0 A4 B5 77
> or: A0
> 
> Only variable identifier with the format VAR_XXX
> ex: VAR_MY_VARIABLE
> 
> Or the combinaison of the two previous format
> ex:
> A0 A4B5 VAR_MY_VARIABLE 77 98 VAR_MY_VARIABLE2
> or
> VAR_MY_VARIABLE AA BB
> or
> AA BB VAR_MY_VARIABLE
> 
> 
> what i want to do is to build a AST tree

attached please find a grammar file that is *almost* what I think you
are trying to do.

It does not have a MULTIPLE_BYTES_DEF node because the grouping of a
collection of single_byte instances into a multibyte is ambiguous.
Consider

11 22 33 44 55 66 77 88

is this 8 single bytes? 1 single byte and 7-long multi? is it 4 multi
pairs? a triple, a single and a quad?

i kinda expect you want it to be a single 8-long multi, e.g. any run of
single bytes becomes a multi. But that is a semantic of your language
and getting a parser to do semantics isn't always possible....

if you really need the MULTIPLE_BYTE_DEF node, you might be best served
by parsing using some like my code (e.g. the parser produces only
BYTE_DEF nodes) and then write a tree-walker that transforms the AST
resultant from the parse into a new AST that contains the requisite
MULTIPLE_BYTE_DEF nodes. e.g. scan for and collapse sequences of
consecutive EXPR_DEF nodes that have BYTE_DEF children into a single
EXPR_DEF node containing a single MULTIPLE_BYTE_DEF child.

> 
> And the problem is that i don't know how to do this with antlr. the tool
> always tell me that multiple rule can be applies with my grammar.
> 
> please help me to solve my problem. 
> 
> Here is my grammar:
> 
> stmts               : bytes+ ;
> 
> 
> bytes : multiple_byte bytes? -> ^(EXPR_DEF multiple_byte  bytes? )
> 
> | define_expression bytes? -> ^(EXPR_DEF define_expression bytes? )
> 
> | NEWLINE ;
> 
> define_expression : define_var -> ^(DEFINE_VAR_DEF define_var) ;
> 
> define_var : DEFINE_VARIABLE ;
> multiple_byte : single_byte (single_byte)+ -> ^(MULTIPLE_BYTES_DEF
> single_byte single_byte+) ;
> 
> 
> single_byte : byte_digit -> ^(BYTES_DEF byte_digit) ;
> 
> byte_digit : BYTE_DIGIT ;
> 
> DEFINE_VARIABLE :
> 'VAR_'('a'..'z'|'A'..'Z'|'_')('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
> 
> BYTE_DIGIT :('0'..'9'| 'A'..'F'|'a'..'f')('0'..'9'| 'A'..'F'|'a'..'f') ;
> 
> // Ignore whitespace, tab and escape sequence WS : (' '|'\t'|'\\\r\n')+
> {$channel = HIDDEN;} ;
> 
> // a new line NEWLINE : '\r'? '\n' ;
> 
> thanks a lot

hope this helps...
   -jbb

-------------- next part --------------
grammar Test;

options {
   output = AST;
   ASTLabelType = CommonTree;
}

tokens {
   EXPR_DEF;
   DEFINE_VAR_DEF;
   BYTES_DEF;
}

@members {
   private static final String [] x = new String[]{
      "A0\n",
      "A0 A4 B5 77\n",
      "VAR_MY_VARIABLE\n",
      "A0 A4B5 VAR_MY_VARIABLE 77 98 VAR_MY_VARIABLE2\n",
      "VAR_MY_VARIABLE AA BB\n",
      "AA BB VAR_MY_VARIABLE\n"
   };

   public static void main(String [] args) {
      for( int i = 0; i < x.length; ++i ) {
         try {
            System.out.println("about to parse:`"+x[i]+"`");
            TestLexer lexer = new TestLexer(new ANTLRStringStream(x[i]));
            CommonTokenStream tokens = new CommonTokenStream(lexer);

            TestParser parser = new TestParser(tokens);
            TestParser.stmts_return p_result = parser.stmts();

            CommonTree ast = p_result.tree;
            if( ast == null ) {
               System.out.println("resultant tree: is NULL");
            } else {
               System.out.println("resultant tree: " + ast.toStringTree());
            }
            System.out.println();
         } catch(Exception e) {
            e.printStackTrace();
         }
      }
   }
}

stmts : bytes+ EOF!;

bytes
   : ( b=BYTE_DIGIT t=bytes -> ^(EXPR_DEF ^(BYTES_DEF $b) $t) )
   | ( d=DEFINE_VARIABLE t=bytes -> ^(EXPR_DEF ^(DEFINE_VAR_DEF $d) $t) )
   | NEWLINE ;

fragment LETTER :  'a' .. 'z' | 'A' .. 'Z' ;
fragment DIGIT : '0'.. '9' ;
DEFINE_VARIABLE : 'VAR_' (LETTER|'_') (LETTER | DIGIT | '_')*;

fragment HEXIT : '0'..'9' | 'A'..'F' | 'a'..'f' ;
BYTE_DIGIT : HEXIT HEXIT ;

// Ignore whitespace, tab and escape sequence
WS : (' '|'\t'|'\\\r\n')+ {$channel = HIDDEN;} ;

// a new line
NEWLINE : '\r'? '\n' ;