[antlr-interest] C target: unhelpful error messages from the default error handler in trivial cases

Wed Jul 20 20:41:29 PDT 2011

On Jul 20, 2011, at 9:53 PM, Jim Idle wrote:

> The standard error handler can only do its best as a generic handler. When
> your grammar is crap, and you feed it crap, then guess what?

Fair enough (I guess), assuming you can elaborate on why my simple grammar example is "crap"?

> 
> Also, your questions are not "with all due respect" at all; you don't
> understand what is going on, but would rather blame the generic error
> handler than your lack of knowledge (which you will improve if you offer a
> little more respect). The recovery mechanisms are the same for C as Java.

I do not deny that there are many things I do not yet understand about ANTLR, but I did spend a couple of days reading through the source code and stepping through scenarios in gdb. I've read most of the chapters from the ANTLR ref book and most of the LIP book. I have also searched the mailing list archives for tips on C targets. I am not a grammar beginner by a long shot. The point I am trying to make is I would like to *learn* what is going on with ANTLR specifically and it seems that the initial barrier is somewhat high. 

> However, if you spend some more time reading, then you will know to use
> real tokens and not inline 'int' and 'float'.

Are you referring to using something along the lines of

tokens
{
	INT = 'int';
	FLOAT = 'float'
}
? I have tried that as well prior to posting (this is very similar to what you'd do with JavaCC) and it changes nothing. 

Ok, let's nevertheless assume that I am not seeing something obvious and take someone else's grammar. On page 79 of Terrence's book he gives an example nearly identical to mine. If I just add target='C' it becomes:

grammar bookexample;

options
{
    language='C';
}

parse:  type ID (',' ID)* ;
type: 'int' | 'float' ;
ID: 'a'..'z'+ ;

When I feed the parse() non-terminal a string "badtype A" I get an error message error with <invalid> in it, just like I posted earlier. What gives? 

> You would also have read the
> long article on error recovery techniques in the Wiki and then know why
> you are dropping out of the loop. Read the C code a bit and you will see
> where the missing and invalid come from and would say "ahhhhh".

Could you kindly point to the exact article? I see

http://www.antlr.org/wiki/display/ANTLR3/Error+reporting+and+recovery

but that does not seem helpful enough.

> I love people spouting off about how bad things are, when they have made
> no effort to look in to the details. The type of exception is generated by
> the ANTLR analysis and not the C runtime. Or perhaps you have spent all
> the effort you can?

I hope not, even after your sharp response. The things I've referred to above should convince you that I did make an effort to look into details.

> In short then, I cannot know how you want to report errors, so there are a
> bunch of examples of finding out information. But the type of exception
> depends on how you construct your grammar.

What would be helpful is to know why the default error handler acts so strangely for the simple example I gave and how to fix that. What am I doing wrong to cause it to (a) switch between reporting "memory" or "unknown source" for the source name with minor input changes and (b) report "<invalid>" -- ever?

> Jim
> 
> 
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Vlad
>> Sent: Wednesday, July 20, 2011 6:50 PM
>> To: antlr-interest at antlr.org
>> Subject: [antlr-interest] C target: unhelpful error messages from the
>> default error handler in trivial cases
>> 
>> Greetings,
>> 
>> Like apparently many new ANTLR users, I've borrowed the implementation
>> from the default displayRecognitionError() to implement my own version.
>> Somewhat unfortunately, this version generates unhelpful/random errors
>> in rather trivial cases. Here is a full example:
>> 
>> grammar testerrors;
>> 
>> options
>> {
>>    language='C';
>> }
>> 
>> NAME    :   ( 'a'..'z' | 'A'..'Z' | '0'..'9' )+ ;
>> WS      :   ( ' ' | '\t' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
>> 
>> parse:
>>    decl ( options { greedy = true; }: ',' decl )* ','? EOF
>>    ;
>> 
>> decl:
>>    NAME ':' type
>>    ;
>> 
>> type:
>>    'int' | 'float'
>>    ;
>> 
>> Feeding "A : badtype" into parse() results in:
>> 
>> -memory-(1)  : error 10 : Unexpected token, at offset 3
>>    near [Index: 0 (Start: 0-Stop: 0) ='<missing <invalid>>', type<0>
>> Line:
>> 1 LinePos:3]
>>     : Missing <invalid>
>> 
>> What puzzles me is where the <invalid> comes from. It would seem easy
>> to compute that either 'int' or 'float' token was expected. In the
>> stock error handler this comes from tokenNames[ex->expecting] evaluated
>> for
>> ex->expecting being 0. What change to the default implementation is
>> necessary to make this work correctly?
>> 
>> Similary, attempting to parse "A :" results in:
>> 
>> -unknown source-(1)  : error 10 : Unexpected token, at offset -1
>>    near [Index: 0 (Start: 0-Stop: 0) ='<missing <invalid>>', type<0>
>> Line:
>> 1 LinePos:1]
>>     : Missing <invalid>
>> 
>> Note how the source became "unknown" and the offset became -1. In the
>> default handler this is determined by "streamName" as follows:
>> 
>> if (ex->streamName == NULL)
>> {
>> if (((pANTLR3_COMMON_TOKEN)(ex->token))->type == ANTLR3_TOKEN_EOF) {
>> ANTLR3_FPRINTF(stderr, "-end of input-("); } else {
>> ANTLR3_FPRINTF(stderr, "-unknown source-("); } } else { ftext = ex-
>>> streamName->to8(ex->streamName);
>> ANTLR3_FPRINTF(stderr, "%s(", ftext->chars); }
>> 
>> and it is frankly unexpected that a slightly different match error type
>> should have this impact since it does not impact the branches taken
>> here at all (that happens later in the function). Anyone trying to take
>> this function as a blueprint for their own handler would conclude that
>> ex->streamName is NULL in one case but not the other and that is set
>> somewhere *outside* of displayRecognitionError(): the problem of fixing
>> the default implementation begins to feel like it might snowball into
>> patching the runtime itself.
>> 
>> As the last example, trying to parse "A B" results in:
>> 
>> -memory-(1)  : error 1 : Unexpected token, at offset 1
>>    near [Index: 2 (Start: 15787098-Stop: 15787098) ='B', type<4> Line:
>> 1 LinePos:1]
>>     : syntax error...
>> 
>> The start/stop indices are bogus, i.e. some uninitialized variables --
>> on repeated parses they change randomly.
>> 
>> My second question follows. Good error handling is a big selling point
>> of ANTLR, but with all due respect it hardly seems so for the C target.
>> Is there documentation available for all context relevant to handling
>> main mismatch error conditions? I have scanned everything in the
>> available examples download and there are no examples of customizing
>> the error handler that I can find. Alternatively, could someone share a
>> workable version of their own that might be a good learning example?
>> 
>> Thank you,
>> Vlad
>> 
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>> email-address
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address