[antlr-interest] Basic predicate question

John B. Brodie jbb at acm.org
Thu Jul 1 13:22:45 PDT 2010


Greetings!
On Thu, 2010-07-01 at 14:03 -0400, Zeafla, Larry wrote:
> I am new to Antlr, which I am trying to use to parse simple existing
> messages.  The message structure is exceptionally simple and
> straightforward.  Message fields include integer and floating-point
> numbers, single letter codes, and field separator characters.  Each
> individual message type has a narrowly defined structure, needs no look
> ahead, and typically has at most 2 possible tokens for any location in
> the message.
> 
Welcome!

Respectfully, in my opinion, using ANTLR for this task seems to be
overkill. Why not just read each message into a String. Use the split()
method on the comma in order to get the fields. And then analyze the
array returned by split(",")? (or maybe regular expressions?)
 

> My problem is that one of the fields is a 2-digit (in ASCII)
> representation of a hex number.  This is known purely from context.  It
> seems there should be a simple technique (probably a predicate), to
> force this behavior.  I just can't seem to find it.
> 
>  
> 
> Here is a short sample grammar to illustrate:
> 
>           grammar sample;
>           prog   :   test+ ;
>           test    :   'TEST' COMMA INT COMMA FLOAT ( 'A' | 'B' ) 
> 
>                               COMMA HEX_DIGIT  HEX_DIGIT    ;
> 
>           HEX_DIGIT   :  '0'..'9' | 'A'..'F' | 'a'..'f'  ;
>           INT         :  '0'..'9'+ ;
>           FLOAT       :  '0'..'9'+ ('.' '0'..'9'*)? ; 
>           COMMA       :  ',' ;
> 
> The associated test input is:
> 
>           TEST,123,5.6A,2D
> 
>           TEST,321,4.20A,3B
> 
>           TEST,45,5.68B,78            
> 
> 
> 
> For this example, the hex digits are the last 2 characters on each line.
> For the first test statement, parsing is successful.  For the second, I
> get a MismatchedTokenException (0!=0) on the B (the last character).
> For the third, I get a MismatchedTokenException(0!=0)  on the 7 (the
> next to last character).  I am definitely confused.

as pointed out in another message in this thread. you have specified
that 'A' and 'B' are keywords in your language and yet you also want
them to be HEX_DIGITs. the lexer can not work out this ambiguity (i
believe). same problem with '0' .. '9' ---- are they a HEX_DIGIT or are
they a single digit INT?

if you really really want to do this task using ANTLR (see above rant
regarding split() and regex's) I think you will have to do all of the
work in the parser.

usually manipulating individual characters in parser rules quickly leads
to parsing ambiguities. but your problem as stated seems to be simple
enough that it will not be a problem (unless you are gonna add more
stuff).

attached please find an alternative grammar of your sample that
illustrates this approach tested with just your 3 sample inputs.

Hope this helps...
   -jbb

-------------- next part --------------
grammar Sample;

options {
   output = AST;
   ASTLabelType = CommonTree;
}

@members {
   private static final String [] x = new String[] {
      "TEST,123,5.6A,2D",
      "TEST,321,4.20A,3B",
      "TEST,45,5.68B,78"
  };

   public static void main(String [] args) {
      for( int i = 0; i < x.length; ++i ) {
         try {
            System.out.println("about to parse:`"+x[i]+"`");
            SampleLexer lexer = new SampleLexer(new ANTLRStringStream(x[i]));
            CommonTokenStream tokens = new CommonTokenStream(lexer);

            SampleParser parser = new SampleParser(tokens);
            SampleParser.start_return p_result = parser.start();

            CommonTree ast = p_result.tree;
            if( ast == null ) {
               System.out.println("resultant tree: is NULL");
            } else {
               System.out.println("resultant tree: " + ast.toStringTree());
            }
            System.out.println();
         } catch(Exception e) {
            e.printStackTrace();
         }
      }
   }
}

start   :   test+ ;
test    :   T E S T comma integer comma flonum ( A | B ) 
                   comma hex_digit  hex_digit    ;

hex_digit   :  DIGIT | A | B | C | D | E | F ;
integer     :  DIGIT+ ;
flonum      :  DIGIT+ ('.' DIGIT*)? ; 
comma       :  ',' ;

DIGIT : '0'..'9';

A : 'A' | 'a' ;
B : 'B' | 'b' ;
C : 'C' | 'c' ;
D : 'D' | 'd' ;
E : 'E' | 'e' ;
F : 'F' | 'f' ;
G : 'G' | 'g' ;
H : 'H' | 'h' ;
I : 'I' | 'i' ;
J : 'J' | 'j' ;
K : 'K' | 'k' ;
L : 'L' | 'l' ;
M : 'M' | 'm' ;
N : 'N' | 'n' ;
O : 'O' | 'o' ;
P : 'P' | 'p' ;
Q : 'Q' | 'q' ;
R : 'R' | 'r' ;
S : 'S' | 's' ;
T : 'T' | 't' ;
U : 'U' | 'u' ;
V : 'V' | 'v' ;
W : 'W' | 'w' ;
X : 'X' | 'x' ;
Y : 'Y' | 'y' ;
Z : 'Z' | 'z' ;


More information about the antlr-interest mailing list