[antlr-interest] Error nodes created upon syntax error

Fri Jan 11 14:45:35 PST 2008

I'll throw in my 2 cents.  Back when IBM was going to conquer the world with PL/I, there was a  compiler ("PL/C") for the language out of Cornell which had the "nifty" feature that it did spell checking on keywords.  If you typed "ptu", the compiler would interpret this as "put".  Somehow, this never made it into the mainstream--real programs have too many symbols that might be mangled by a spell-checker.  The real problem is that errors in programs epitomize Murphy's Law--whatever can go wrong will go wrong--and there is no magic solution for error handling.  Most errors are detail errors, there are often few or no error patterns that are easily recognized by a digital machine.  Human pattern recognition, on the other hand, is very good--especially if hints as to where the problem lies can be passed along.

The upshot of this is that there are three elements of an error-handling system:
    1.)  Resynchronization, usually via (multiple) token deletion.
    2.)  Recording the error state.
    3.)  Intelligible error reporting.

Resynchronization has to be done at the grammar level--only the developer can identify the points in the grammar that lend themselves to resynchronization.  Recording mechanisms depend somewhat on further processing steps but also on the choice of resynchronization points.  Intelligible error reporting is both art and highly error-prone:  rarely does the developer understand the user's perspective on errors well enough to give the right hints.  "The Devil is in the details!"

That said, I suspect that one of the tricks that could simplify ANTLR error handling is to take advantage of hidden "channels"--newlines, comments, and other forms of whitespace are often good hints for synchronization.

--Loring

----- Original Message ----
> From: Terence Parr <parrt at antlr.org>
> To: Alessandro <alessnet at gmail.com>
> Cc: antlr-interest Interest <antlr-interest at antlr.org>
> Sent: Friday, January 11, 2008 11:18:42 AM
> Subject: Re: [antlr-interest] Error nodes created upon syntax error
> 
> Hi Alessandro. thanks for the suggestion.  Yes, I've been thinking  
> about this problem and it is even more general.  What do you do about  
> actions that must execute after recovery even though they refer to a  
> token that does not exist?!
> 
> The unfortunate truth comes down to the following: single token  
> insertion and deletion recovery with in an alternative is really sexy  
> for journal papers, but I believe I've convinced myself that they are  
> not practical.  Well, at least in the presence of actions.
> 
> The  Simple solution is to turn this off, relying on a normal "exit  
> rule upon syntax error" mechanism but leave the insertion and
> deletion
> 

> mechanism as an option by overriding methods.
> 
> Ter
> On Jan 6, 2008, at 8:48 AM, Alessandro wrote:
> 
> > Hello,
> > (sorry for my bad english)
> >
> > I can see that there is a problem with token deletion/insertion
> if
> 
 you
> > are also building trees.
> > Take this rule, for exemple :
> >
> > test    :    'var' ID ';'    -> ^('var' ID);
> >
> > If the input if "var ;", the token insertion system detect that the
> > token "ID" is missing, then report the error, but continue parsing.
> >
> > If you look closer to the generated code, you will see :
> >
> > -----
> > ID2=(Token)input.LT(1); // save ID2
> > match(input,ID,FOLLOW_ID_in_test26);
> > stream_ID.add(ID2); // ID2 have a bad reference
> > ----
> >
> > ID2 contains a reference to the token ';' and not to the token
> ID.
> 
 The
> > "match" procedure doesn't thow any exception because of the "token
> > insertion" system.
> >
> > So the resulting tree will be in reality ^( 'var' ';') ....and it is
> > totally incorrect, am I right ?
> >
> > If I want to use the "token deletion/insertion symbol" with tree
> > building, can I modify the "match" procedure in order to modify, for
> > instance, the content of "ID2" ( without altering the reference) ?
> >
> > I imagined a workaround. ( LA(i) is the token at the index
> > current_pos+i in the stream ).
> >
> > if there is a token insertion, do this in the "match" procedure :
> >
> > 1. add the "special" imaginary token (matching the missing token) in
> > the stream at the postion LA(2) (position is wrong now). The stream
> > must allow token insertion.
> > 2. Swap (contents and not references) LA(1) and LA(2). (you have to
> > correct index informations)
> > 3. ID2 still has a reference to LA(1), but the content of the
> token
> 
 is
> > now "special imaginary ID token".
> >
> >
> > if there is a token deletion, do this in the "match" procedure :
> >
> > 1. Save LA(1) content to a temporary variable : temp_var
> > 2. Copy the content of LA(2) in LA(1)
> > 3. Copy the content of temp_var in LA(2)
> > 4. Swap (references only) LA(1) AND LA(2)
> > 5. ID2 has reference to **OLD** LA(1), but now it is LA(2).
> >
> >
> > In the commonTreeAdaptator.create procedure :
> >
> > 1. if the token is a "special" imaginary token : return an ERROR node
> > (like Terence proposal)
> > 2. else : create a node like usual.
> >
> > What do you think about this this (non-tested) workaround ?
> >
> > The best solution, I think, is that "match" procedure returns the
> > reference of the real matched token.
> >
> > On Dec 2, 2007 8:24 PM, Terence Parr  wrote:
> >> hi,
> >>
> >> Currently syntax errors cause invalid trees and possibly
> even
> 
 runtime
> >> exceptions when building ASTs. What we really need I believe is to
> >> have rules that encounter syntax errors return an ERROR node of some
> >> sort that records where the error occurred and, with luck,
> the
> 
 tokens
> >> consumed during recovery. I started an improvement request:
> >>
> >> http://www.antlr.org:8888/browse/ANTLR-193
> >>
> >> The basic idea is that ERROR nodes get used in place of ASTs that
> >> would normally be produced by rule indications.  For example, the
> >> following rule would return a valid AST except for the subtrees
> >> associated with rule refs in encountering syntax errors:
> >>
> >> forDecl : 'for' '(' decl ';' expr ';' expr ')' stat -> ... ;
> >>
> >> If there is an error inside decl, the tree would return
> >>
> >> ^('for' ERROR subtree-expr subtree-expr)
> >>
> >> This effectively means that I must turn off the single token
> >> insertion and deletion that occurs automatically within a single
> >> rule.  If a syntax error occurs, the immediately surrounding rule
> >> must terminate in return an error node.
> >>
> >> Does this make sense? I would like to stick this into 3.1 release.
> >>
> >> Ter
> >>
> 
> 

      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ