[antlr-interest] Can't generate the AST I want

Scott Smith ssmith at mainstreamdata.com
Mon Aug 8 16:36:34 PDT 2011


I will take a look at this.  Thanks for sending it.  I appreciate all of your help.

Interestingly (or unfortunately, depending on your perspective), my understanding is that the grammar I'm trying to parse doesn't have precedence.  It does support parentheses through which you can control precedence like you do in most languages.  

I've read one discussion which said (absent parentheses) the last operator wins.  So (a AND b OR c) would be interpreted as (a AND (b OR c)) but (a OR b AND c) would be interpreted as (a OR (b AND c)).  In fact, I just verified on our system that that's exactly how it works.  (a AND b OR c AND d) would be interpreted as (a AND (b OR (c AND d))).  Since this is an open source project (SOLR - lucene.apache.org), I don't have the option of changing how it works though it does seem a bit odd.

My experience is that people use a lot of parentheses with this system.  It's also very common to see long strings of items where the operator is the same (e.g., (a OR b OR c OR d... OR w)) though each of the terms (e.g., "a") could be several things ANDed together, but with parentheses around them to ensure they are treated as a single term.  This is why I want the AST to be (OR a b c d ... w).  For what I need to do (common sub-expression factoring/elimination), I think it's easier to deal with.

Thanks again

Scott

-----Original Message-----
From: John B. Brodie [mailto:jbb at acm.org] 
Sent: Monday, August 08, 2011 4:44 PM
To: Scott Smith
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Can't generate the AST I want

On Mon, 2011-08-08 at 19:41 +0000, Scott Smith wrote:
> I am writing a grammar to generate ASTs.  There are operators called "AND" and "OR" (defined in the lexer) in the language.  I want my parser to produce ASTs that group together all of the terms which are using the same operator.
> 
> Also, there can be cases where there is no operator given.  In that case, depending on the program configuration, it may be "AND" or "OR".  I've called it DFLT_OP for the time being.  Here are some examples:
> 
> (a OR b OR c) -> (OR a b c)
> a b -> (DFLT_OP a b)                                                       // no operator
> a AND (b OR c) AND d ->(AND a (OR b c) d) a AND b OR c AND d ->(AND 
> (OR (AND a b) c) d)  // I think this one is right; I don't need to 
> reorder terms because AND was used twice

i do not think this is correct.
you have (a AND b OR c AND d) as (((a AND b) OR c) AND d) but i think most people would expect it to be ((a AND b) OR (c AND d))

e.g. AND has higher precedence than OR (i think)

> 
> The grammar I've generated (somewhat simplified) is below.  I think everything is pretty vanilla except for the definition of "expression":
> 
> grammar testGr;
> 
> options {
>   language = Java;
>   output = AST;
>   ASTLabelType = CommonTree;
> }
> 
> tokens {
>   DFLT_OP;
> }
> 
> @header {
>   package a.b.c;
> }
> 
> @lexer::header {
>   package a.b.c;
> }
> 
> filter     :    expression EOF ;
> 
> term :    IDENTIFIER
>      |    '('! expression ')'!
>      ;
> 
> // parsing "(a AND b AND c)" (w/o quotes)
> 
> // ATTEMPT 1
> // This gives (AND (AND a b) c).  This is correct, but what I // 
> really want is (AND a b c) //expression
> //   :    term ((AND^ | OR^)? term)*
> //   ;
> 
> // ATTEMPT2
> // This doesn't compile due to "recursive rule invocations".  I'm // 
> also not thrilled the term order changes //expression
> //   :    (id1=term OR)+ id2=term -> ^(OR id2 id1+)
> //   |    (id1=term AND)+ id2=term -> ^(AND id2 id1+)
> //   |    id1=term id2=term+  -> ^(DFLT_OP id1 id2+)
> //   |    term -> term
> //   ;
> 
> // ATTEMPT3
> // This compiles but appears to produce (AND a c).  No "b".

you want to use the `+=` meta operator (not `=`) when collecting a list of elements on the left of the `->` meta operator. e.g. id2+=term not id2=term. using the = on a list just gets the last element of the list as you have observed.

> // expression
> //options {
> //   backtrack=true;
> //}
> //   :    id1=term (OR id2=term)+ -> ^(OR $id1 $id2+)
> //   |    id1=term (AND id2=term)+ -> ^(AND $id1 $id2+)
> //   |    id1=term id2=term+  -> ^(DFLT_OP $id1 $id2+)
> //   |    term -> term
> //   ;
> //
> 
> // This doesn't compile due to "recursive rule invocations".
> expression
>      :    id1=term OR id2=term (OR id3=term)+ -> ^(OR $id1 $id2 $id3+)
>      |    id1=term AND id2=term (AND id3=term)+ -> ^(AND $id1 $id2 $id3+)
>      |    id1=term id2=term+  -> ^(DFLT_OP $id1 $id2+)
>      |    term -> term
>      ;
> 
> AND  :    'AND' | '&&' ;
> OR   :    'OR' | '||' ;
> IDENTIFIER :    LETTER (LETTER | NUMBER)* ;
> fragment LETTER :    LOWER | UPPER ;
> fragment LOWER  :    'a'..'z' ;
> fragment UPPER  :    'A'..'Z' ;
> fragment NUMBER :    '0'..'9' ;
> WS  : (' ' | '\t' | '\n' | '\r') {$channel=HIDDEN; } ;
> 
> Is there a way to tell ANTLR what I want to do?  How should I write the expression?
> 
> Any thoughts about the minor problem of substituting AND or OR instead of the DFLT_OP.  I figured out how to pass a Boolean to my parser which tells it which one of these is the correct one.  Can I write something like:
> 
>                 ({dftlOpAND ? AND : OR} $id1 $id2)

I am not sure that you can do this in constructing a root of a sub-tree.

attached please find a complete grammar that solves all 3 of the problems identified:
        a) precedence of AND over OR
        b) making AND and OR n-ary trees rather than binary
        c) inserting an imaginary AND or OR as root when no connective
           operator is present.

Hope this helps...
   -jbb



More information about the antlr-interest mailing list