[antlr-interest] Parsing (a+b)+(c+d)

Sun Aug 17 11:54:35 PDT 2008

Tom,

There are some important differences between LALR and LL, and you will 
kind of have to bump up against them as you go along -- hearing about 
them doesn't really get it burned into your brain.

The "big thing" with expression parsing in antlr is implementing 
precedence and associativity. (If you don't know what those two words 
mean in this context, try Wikipedia.)

The simplest approach is to build a new rule for each level of 
precedence. In your (a+b)+(c+d) example, you have three different things 
going on:

1. Identifiers
2. The '+' operator
3. Parenthesized sub-expressions.

In general, an identifier is an atom -- it cannot be further divided for 
parsing, so it is the innermost thing. But a (subexpr) is also an atom 
for purposes of expression evaluation. So you'll see a lot of expression 
parsing examples that look like this:

atom: IDENTIFIER | '(' expression ')' ;

That represents the innermost layer of precedence. You proceed to wrap 
things around it. But one issue you have to stay aware of is that for a 
complex grammar, you aren't REQUIRED to have a certain operator in use. 
That is, just because you support both '+' and '*' doesn't mean that 
every expression has to use them. So your first consideration has to be 
for "passing through" an expression that doesn't use a particular operation.

plus_expr : atom ;

Then you make the operator, and the rest of the expression, optional:

plus_expr : atom ( '+' atom)? ;

But most languages support a+b+c syntax without parens, so you'll have 
to support more than just zero-or-one operators:

plus_expr : atom ('+' atom)* ;    // Note '*' instead of '?' at end

Then tie it up with a nice bow on it, by providing a top-level entry point:

expression: plus_expr ;
plus_expr : atom ('+' atom)* ;    // Note '*' instead of '?' at end
atom: IDENTIFIER | '(' expression ')' ;

One thing this does not address is associativity. If you have a 
left-associative operator (and most 'math' is left-assoc) you can build 
your tree as you go along:

plus_expr
    : ( atom -> atom )
      ( ( '+' atom ->  ^('+' $plus_expr atom) )
     // | ( '-' etc. )
      )*
    ;

This takes a+b+c and yields an AST like (+ (+ a b) c). The $plus_expr 
takes the "present result" of the parsing and inserts it into a larger 
AST expression.

On the other hand, exponentiation is right-associative. a**b**c should 
eval as (** a (** b c)). This is a question of mathematical convenience. 
If the associativity were left, then
a**b**c would be the same as a**(bc), which reduces the expressive power 
of the notation. Mathematicians are all about expressive power.

Since the expression is right-associative, you just have to recognize 
the leftmost bits, grab them, and recurse. BUT you have to make sure you 
have one absolutely distinctive leftmost bit to grab. If your left side 
is ambiguous, you're in hell. Make sure, if possible, that there's an 
explicit operator or token to use as the signal.

expon_expr : atom '**' expon_expr ;

Of course, the operator isn't required:

expon_expr
    : atom ('**' expon_expr  -> ^('**' atom expon_expr)
             | /*nothing*/ -> atom)
    ;

Finally, two things to keep in mind in "expression-land." First, in C 
and similar languages, there are some cases where the syntax changes if 
an operator is not used. In particular, assignment expressions (a = b) 
are lower priority than conditional expressions (a ? b : c). But a 
conditional expression is NOT a legal Lvalue, so an assignment has a 
different syntax than a pass-thru:

assignment
    : lvalue '=' assignment
    | conditional //passthru
    ;

Second, keep in mind that a token is not a magic cookie. Tokens get 
reused, a LOT. Sometimes it isn't completely obvious. C, et al, use 
commas as an operator and to delimit lists. You will have to handle 
these cases separately, and you will have to provide a entry point into 
your expression hierarchy so that a parse context that needs commas for 
list delimiters can skip over the comma-as-operator part of the 
expression chain. (Similarly, parens are both postfix-operators for 
calling functions, and circumfix operators for delimiting subexpressions.)

Good luck,

=Austin

Ana Nelson wrote:
> Tom,
>
> This example grammar might give you an idea:
> http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator
>
> Regards,
> Ana
>
>
> On 17 Aug 2008, at 11:47, Tom Edwards wrote:
>
>> Hi,
>>
>> Sorry if this is a bit of newb question however I am pretty new to
>> language translation and I can't seem to find the answer to this
>> question. I am trying to parse the following simple expression:
>>
>> (a+b)+(c+d)
>>
>> With the following grammar:
>>
>> grammar Vector;
>> options {
>>       output=AST;
>>       ASTLabelType=CommonTree;
>> }
>>
>> prog    :       (stmt
>> {System.out.println($stmt.tree.toStringTree());} )+;
>> stmt    :       expr NEWLINE?;
>>
>> expr    :       IDENTIFIER exprD
>>       |       '(' expr ')';
>> exprD   :       OP (exprD|IDENTIFIER);
>>
>>
>> OP      :       '+';
>>
>> IDENTIFIER
>>       :       ('a'..'z'|'A'..'Z');
>> NEWLINE :        '\r'?'\n';
>>
>> Which is based of an example I have found on the website.
>> The problem is that as Antlr is LL (am I right here?) you cannot have
>> left recursion, which is fine, however I am not sure how you can
>> construct something to parse the original problem. Splitting the
>> problem into expr and exprD is a technique I have borrowed from the
>> "dragon" book, however in this case it does work very well, but not
>> with the parenthesis.
>>
>> Again, sorry if this is a silly question however it is bugging me and
>> I don't have any other form of support.
>>
>> Kind regards and thanks for any help you can offer.
>>
>> Tom
>>
>
>
>
>