[antlr-interest] rules for re-factoring grammars?

Sat Mar 22 08:27:28 PDT 2008

I wrote the following last night and now I'm thinking it dissevers a bit 
more explanation.

I want to build a program that will take in an arbitrary grammar with 
attached actions and then re factor it to remove left recursion and also 
re factor the actions so that the end product finds the same parse tree 
and performs the same action as the original. (Clearly I need to find 
the parse tree in a different form than the first grammar does and then 
build the real tree from what I do find.) I want this program to operate 
on the widest possible range of inputs. This, I hope, will include 
ambiguous and redundant grammars. I also want to be able to handle 
different parsing strategies for handeling the ambiguities; greedy in 
each rule, recursive decent with first match wins, and such. That second 
one (first match) is paramount. One of my specifically targeted inputs 
uses it. The same thing also to some extent applies to the end product. 
I want to be able to generate grammars that can be fed into a number of 
different systems.

The replies I have gotten so far have convinced me that for the cases I 
have looked at so far, the transformation is valid in ANTLR 
(particularly if actions are not involved). I'm not convinced it's valid 
in all cases in my situation (I'll be playing with the constraints on 
that one with pen 'n paper later).

So taking the above in mind, this is what I wrote last night and am to 
lazy to totally re wright:

Gavin Lambert wrote:
> At 13:12 22/03/2008, Benjamin Shropshire wrote:
> >then parse an H out of "FBCG" (assuming the needed actions to
> >rebuild the left recursive tree)
> >
> >old:    FBCG -> h(a(F), i(B), i(C),G);
> >new:   FBCG -> h(a(a(a(F),B),C),G);
> >
> >Am I missing something here?
>
> Well, I'm not sure why you seem to think the "new" parse tree is a 
> product of removing the left recursion.

I'm assuming action code that takes the left most term and the list of 
right side terms and re creates the action that would otherwise be done.

>   The original recursive rule you specified will produce the "new" 
> tree just as easily as the new rule (actually more easily, since 
> that's not really the way the new rule works).  The only difference in 
> fact that I can see between the two trees is whether they're matching 
> greedily or not.  (The "old" tree is non-greedy while the "new" tree 
> is greedy.

that is a relevant difference in my case.

>   And ANTLR always matches greedily, so both rules should have netted 
> you the "new" tree, assuming ANTLR didn't get stuck on the left 
> recursion.)

I'll have to think about that but I might be able to come up with a case 
where it still parses differently (different parse tree).

>
> But yes, generally speaking any change in the grammar can lead to a 
> change in the parse tree, even if it's not changing the input language 
> structure.  This is why a parse tree is usually not an especially 
> useful output (except possibly for initial debugging), and why you 
> should either use direct actions or output an AST instead.  This is 
> because the parse tree directly reflects the grammar structure, which 
> is not necessarily a good match to the language structure.

In my case what I'm trying to do is factor out the left recursion of a 
grammar that has attached action. I want to build a grammar that has the 
same effect as the the original but can be used in an LL parser.

>
> Ignoring those new rules for the moment:
>
> old : FBCD -> a(a(a(F), B), C), a(D);
> new : FBCD -> a(F, B, C), a(D);
>
> That's the *real* difference in the parse tree between the two -- the 
> first requires recursive invocation while the second does not.  They 
> still match the same language though.
>
> >H : A I* G;
> >
> >I : B | C;
>
> If you do add those new rules in, then you've introduced an ambiguity 
> that can only be resolved by looking ahead far enough to see if 
> there's a G or not.  That's got nothing to do with whether there's 
> left recursion or not.

I'm not following why that's relevant. Please explain more (you might 
want to read the rest first though)

> But I'm not really sure why you'd want to introduce I anyway, since "A 
> I*" is exactly covered by "A" -- the matching input sets for the two 
> rules are completely identical.  In other words, it's not just 
> ambiguous, it's actually redundant.

I'm trying to come up with a case where the proposed transformation 
fails. Because I will have no control over the grammars I am fed I need 
to handle all grammars even "stupid" ones without to much analysis.