[antlr-interest] Expression parsing ideas for ANTLR v4

Sun Feb 21 20:03:41 PST 2010

Hi Sam,

I've followed your posts about expression parsing with much interest - both 
this one to which I'm replying and your earlier ones in Aug 2008 - but only 
now have starting looking into speeding up a parser using these ideas.

The code you posted is really helpful.  I can easily see how it works and 
should have no problem implementing it, especially because my expressions 
only really have left-associative binary operators (I don't need to optimize 
the handling of the unary ops and there are no assignments or ternary 
conditional).

But there's one aspect of both your newer code and the code that you posted 
in 2008 that either I'm not getting, or is incorrect: it really looks to me 
like the trees you're building for left-associative binary operators are 
treating those ops as right-associative, and the ops marked 
right-associative are getting left-associative trees made for them.  Am I 
wrong, or what have I missed?

The code scans the operators from left to right, finding the first 
occurrence (for left-assoc operators) of the lowest-precedence operator, and 
then building a tree with that op as the root and the parts of the 
expression to the left and right of that operator as the left and right 
children of the root.  That's what it looks like it's doing to me, that is.

But that means that for "a - b + c" the "-" will be chosen as the root, 
resulting in this tree:

         '-'
        /   \
       a    '+'
           /   \
          b     c

That's treating those operators as *right-associative*, though, not 
left-associative; it's implicitly parenthesizing the expression as "a - (b + 
c)".  The left-associative interpretation of "a - b + c" is "(a - b) + c", 
of course, which usually evaluates to a different value, and would have this 
tree:

            '+'
           /   \
         '-'    c
        /   \
       a     b

It looks to me like the code should just scan in the opposite direction - it 
should scan from right to left for the lowest-precedence operator, for 
left-associative binary operators.  (Or equivalently, find the last 
lowest-precedence operator, instead of the first, in a left-to-right scan.) 
  And right-associative operators would do the opposite, of course.

But have I missed something?  I've looked at this quite a bit, and I can't 
see how I'm misunderstanding your code, if I am.

Thanks muchly in advance for any enlightenment,

Alan

--------------------------
Alan Rooks
ON Semiconductor
Medical Division
Waterloo, Ontario, Canada
http://www.onsemi.com/
--------------------------

On Jan 19 2010 Sam Harwell wrote:
> Several expression parsers are limited to handling the binary operator
> portion of the expression. In addition to the obvious limitations, it
> poses an additional problem for languages like C++ where the assignment
> operators are split (in precedence) from the rest of the binary
> operators by the ternary operator (?:). My most complicated production
> ANTLR grammar (parses the UnrealScript language) currently uses a
> completely new expression parser that offers a great deal more
> flexibility than the previous approaches I tried. I don't think it's the
> end-all solution for integrating expression parsing into ANTLR for v4,
> but I believe it's a worthwhile example to show what's possible. Here
> are some pros and cons of the implementation:
> 
> Pros:
> 
> * The source code declaring the operator precedence and
>   associativity is very clean (see reference to
>   UnrealScriptParserHelper.cs below)
> 
> * Very fast execution
> 
> * Supports a great deal more operations than simply binary
>   operators
> 
> * Supports operator precedence and associativity in groups
> 
> * Directly supports changing the token type during AST
>   generation - for example if the token '-' is named MINUS, you could
>   produce an AST with AST_SUBTRACT when it appears as a binary operator
>   and AST_NEGATE when it appears as a unary operator.
> 
> Cons:
> 
> * Not currently integrated into the ANTLR language (executes in code)
> 
> * No compile-time detection of ambiguous operator rules
> 
> * Not implemented as fully as is possible
> 
> General idea: Parse every component of an expression into a list - this
> includes all operators and "atoms". The list is then passed to a
> "precedence processor" to produce a tree for that expression.
>  
> Operator categories: This parser was built with the following categories
> in mind, but the grouping operators are not implemented at this point.
> With this as a starting place, it's clear how the list might be expanded
> in the future:
> 
> * Unary operator: either prefix or postfix
> 
> * Binary operator
> 
> * Ternary operator
> 
> * Grouping operator: for example, the ( and ) in (expression)
> 
> * Postfix grouping operator: for example, the ( and ) in
>   methodName(args) or the [ and ] in var[index].
> 
> * Prefix grouping operator: for example, the ( and ) in
>   (TargetType)objectToCast.
> 
> Attached is:
> 
> * UnrealScriptParserHelper.cs: The complete code for declaring a
>   working precedence parser for UnrealScript.
> 
> * Antlr.Runtime.Expressions.zip: The current implementation of
>   this feature.
> 
> I'm very interested in any feedback y'all may have on this.
> 
> Thank you,
> 
> Sam Harwell