[antlr-interest] bad generated code?

Fri Sep 30 20:04:53 PDT 2005

I get why they have to be lookahead 1.  In my case, I know that all
the places that need lookahead 2 only look at the "same level" of the
tree.  That is, there's not subtrees at those locations.  In general
though, I see why the problem is a difficult one.  There should at
least be a check so that if the code generator is asked to do
lookahead > 1 for a treeparser it generates an exception or error with
the rule that caused the issue.  In my case I was looking all over my
grammar and eventually looked through the source.

On a tangent: I'm not so convinced that treeparsers are actually
working in two dimensions.  In reality, a treeparser simply does an
pre-order traversal of the AST trying to match tokens, right?  There
are many applications in computer science where a "2-dimensional" tree
is turned into linear sequence of "things" (in this case tokens, both
real and imaginary).  XML, for instance is a tree turned into a linear
sequence of tokens with special tokens to indicate where subtrees
start and end.  If I add two tokens, one designating that a new
subtree is beginning (BEGIN) and one designating that the current
subtree is ending (END) then the following rule:

rule1 : #(A #(B C D) E F);

can be translated into

rule2: A BEGIN B BEGIN C D END E F END;

I could create a token generator that does a pre-order traversal of
the tree spitting back tokens to a standard parser that knows about
BEGIN and AND and it's essentially the same thing as a tree parser. 
In this case it shouldn't be tough to do arbitrary lookahead, right? 
Sorry for the over-analysis.  A fellow grad-student here at Davis is a
tree-nut (HAHA) and I must have been around him a little too much
lately.

-- Chris

On 9/30/05, Monty Zukowski <monty at codetransform.com> wrote:
> Tree parsers have to be k=1 because they are actually two dimensional
> and it gets too weird to allow k>1.  Syntactic predicates should
> work, though.
>
> You might even try this with 2.7.4.  This seems pretty basic to get
> wrong.  I wonder if it is a newly introduced bug.
>
> Monty
>
>
> On Sep 30, 2005, at 6:59 PM, Christian Bird wrote:
>
> > In case anyone cares, I tracked down the problem in the source.  Here
> > is the lookahead code generator from JavaCodeGenerator.java:
> >
> >     private String lookaheadString(int k) {
> >         if (grammar instanceof TreeWalkerGrammar) {
> >             return "_t.getType()";
> >         }
> >         return "LA(" + k + ")";
> >     }
> >
> > Note that if it's a tree walker, k isn't examined at all. My solution
> > (though I'm not sure that it's always NullPointerException safe) is
> > this:
> >
> >     private String lookaheadString(int k) {
> >         if (grammar instanceof TreeWalkerGrammar) {
> >             System.out.println("k is " + Integer.toString(k) );
> >             String retStr = "_t";
> >             while (k > 1) {
> >                 retStr += ".getNextSibling()";
> >                 k--;
> >             }
> >             retStr += ".getType()";
> >             return retStr;
> >         }
> >         return "LA(" + k + ")";
> >     }
> >
> > A quick recompile and that seems to do the trick.  It's possible that
> > the check for getNextSibling could run off the end of the list and
> > generate a NullPointerException, but in my grammar I know that the
> > places where lookahead two is required, there's gauranteed to be a
> > next sibling.
> >
> > Terrence,
> > Any chance that this or something similar and safer (I'm not too
> > familiar with the codebase) could make it into 2.7.6?  I know that
> > it's probably not often that a tree parser needs k > 1, but (at least
> > in my case) it can occur.  Thanks.
> >
> > -- Chris
> >
> > On 9/30/05, Christian Bird <cabird at gmail.com> wrote:
> >
> >> That didn't seem to work either.  I tried using a syntactic
> >> predicate:
> >>
> >> name :
> >>         (ID DOT) => complexName
> >>         | (ID ~DOT) => identifier
> >>         ;
> >>
> >> and adding a rule that changes the followset of name:
> >>
> >> aname :
> >>         name SEMI;
> >>
> >> But the code still has issues:
> >>
> >> boolean synPredMatched98 = false;
> >> if (((_t.getType()==ID) && (_t.getType()==SEMI||_t.getType()
> >> ==ARROW))) {
> >>         AST __t98 = _t;
> >>         synPredMatched98 = true;
> >>         inputState.guessing++;
> >>         try {
> >>                 {
> >>                 AST tmp63_AST_in = (AST)_t;
> >>                 match(_t,ID);
> >>                 _t = _t.getNextSibling();
> >>                 AST tmp64_AST_in = (AST)_t;
> >>                 matchNot(_t,DOT);
> >>                 _t = _t.getNextSibling();
> >>                 }
> >>         }
> >>         catch (RecognitionException pe) {
> >>                 synPredMatched98 = false;
> >>         }
> >>         _t = __t98;
> >>         inputState.guessing--;
> >> }
> >> if ( synPredMatched98 ) {
> >>         identifier(_t);
> >>         _t = _retTree;
> >> }
> >> else {
> >>         throw new NoViableAltException(_t);
> >> }
> >>
> >>
> >> Oh well...
> >>
> >> -- Chris
> >>
> >> On 9/30/05, Monty Zukowski <monty at codetransform.com> wrote:
> >>
> >>> I dunno.  Try putting parenthesis around the two alternatives?
> >>>
> >>> Monty
> >>>
> >>> On Sep 30, 2005, at 5:48 PM, Christian Bird wrote:
> >>>
> >>>
> >>>> Good suggestion, but unfortunately the code generated for name is
> >>>> still the same.  I don't understand antlr could ever generate code
> >>>> that looks like:
> >>>>
> >>>> if ((_t.getType()==A) && (_t.getType()==B)) {}
> >>>>
> >>>> When A is not the same as B.  I'm guessing, however, that a
> >>>> treeparser
> >>>> generator is more complicated to write and probably not as often
> >>>> used/tested by antlr users as a normal parser generator (most
> >>>> people
> >>>> I've talked to here at UC Davis only use it for parsers and lexers,
> >>>> not AST traversals).
> >>>>
> >>>> Any other ideas?  I appreciate your taking a look at it.
> >>>>
> >>>> -- Chris
> >>>>
> >>>> On 9/30/05, Monty Zukowski <monty at codetransform.com> wrote:
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>>>> On Sep 30, 2005, at 4:57 PM, Christian Bird wrote:
> >>>>>
> >>>>>  zimport :
> >>>>>     #("import"
> >>>>>         (name ARROW complexNameList SEMI |
> >>>>>         "all" identifier SEMI) )
> >>>>>     ;
> >>>>> It does seem like a code gen bug.  I would recommend breaking this
> >>>>> up into
> >>>>> another rule if you can:
> >>>>>
> >>>>> zimport: #("import" importSuffix)
> >>>>> importSuffix: name ARROW complexNameList SEMI
> >>>>>                        | "all" identifier SEMI
> >>>>>                        ;
> >>>>>
> >>>>> See if that still triggers the problem.
> >>>>>
> >>>>> Monty
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Christian Bird
> >>>> cabird at gmail.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Christian Bird
> >> cabird at gmail.com
> >>
> >>
> >
> >
> > --
> > Christian Bird
> > cabird at gmail.com
> >
> >
> >
>
>

--
Christian Bird
cabird at gmail.com