[antlr-interest] ANTLR 3.0.1: invalid character column in a mismatch character error message.

Thu Aug 14 15:46:35 PDT 2008

Hi!

On Aug 14, 2008, at 11:43 PM, Gavin Lambert wrote:

> At 13:05 14/08/2008, Kay Röpke wrote:
> >I don't get tired maintaining that handling all this should
> >be in user code, not in the default generated code.
> >It's not hard to add, not everyone needs it, and in the face of
> >syntax errors speed is not the primary concern anymore - so
> >it's not a huge problem going back and doing the extra bit of
> >computation to format an error message.
>
> You're right, it's not hard to do -- provided you're still in the  
> lexer.  Once you get out of the lexer though it could be trickier,  
> depending on where you're getting your character stream from (it  
> might no longer be accessible).

But then you need to get the tokens' text from somewhere, too, so I  
suppose it would copy it into the token proper. Generating an error  
message would then scan back in the token stream to expand tabs (again  
assuming you kept the whitespace tokens, which you need to anyway for  
this use case) and print out a hopefully relevant portion of it.

> But my main point in trying to get it as a standard implementation  
> is that the default error messages are currently suffering from it,  
> and implementing it "properly" should help to avoid some newbie- 
> confusion.  Why force everyone to go look up how to add this sort of  
> functionality to their code when almost *everyone* is going to need  
> it at least once?

I see that point and I concur in a way. The more I think about it, the  
more I believe we are arguing the same thing, see below.

> >Another thing: How often do your recognizers communicate with
> >other tools via text messages? (i.e. via printing line/col info
> >so that the other tool has to parse your output)
> >Mine most often are compiled into the application that needs a
> >parser, and thus I'm primarily interested in the actual character
> >index into the text buffer, not the column it is displayed in.
> >Might just be me, but I guess not many people are writing command
> >line tools that get invoked and need to communicate line/col info
> >that way.
>
> Perhaps this is our fundamental point of difference then.  I've  
> written a couple of grammars that work the way you describe, but by  
> far the most grammars I've written are used as little standalone  
> mini-compilers, in order to turn a DSL into code compilable by some  
> other host language (generally one of C/C++/C#).  These tend to get  
> integrated into the build script as just another compilation step,  
> so the only stuff visible to the outside are the generated code  
> files themselves and whatever error messages get printed to the  
> console.  Hence why I really want those messages to be *right* :)

I totally see your point. However, what would you intend to report as  
a column number in that case? Assume 8 char wide tabs? (I'm asking  
this out of interest, not spite.)

> >P.S.: Did anyone else notice that in the time we've discussing
> >this, everyone of us could've written the code and supplied to
> >interested parties? ;)
>
> Yes :)  And I'd be happy to do just that, if I could be assured that  
> it would end up as a standard part of the runtime.  I *really* don't  
> think it should be a separate addon.

Ok, I can't give you assurance of course, but I'd like to improve at  
least ANTLR's Tool messages (probably not changing the  
charPositionLine to mean column, but presenting it a bit differently  
so it's clear). The main goal would be to print the offending place  
with a bit of context in a nice manner.

Presumably we could conjure up an implementation that would make sense  
to optionally be used as a recognizer superClass, then if someone  
wants it they can just pop that into their grammar. That adds no space/ 
time cost to generated code at all, while still providing the feature  
"out of the box", meaning it's extremely easy to enable.
I'm not sure how many people actually use that option vs. delegating  
stuff out from via a pointer or two, I tend to use superClass and  
implement all sorts of goodies in there (like error reporting hooks  
and all the members for the recognizer - I don't actually use @members  
{} since it's not editable with my IDE that way). Would be interesting  
to know how other people handle that.

Again, my vote is strongly for keeping charPositionInLine as _the_  
definitive source of horizontal locations, because that's the most  
useful thing for anything else than printing (esp when communicating  
with other tools, since it's really the logical character offset not  
meddled with in any way). I believe tools that rely on expanded tabs  
to be broken, but I guess that's just part of my sucky life ;)
I'm strongly against adding any memory footprint to CommonToken, as  
that's the one most people use. It might sound silly to argue about  
adding a couple of bytes, but it really does make a difference for  
large input text. This is a question of trading a little time for  
potentially large amounts of memory and I think it's worth the effort.

To spell out the idea once more (in case it wasn't clear enough, can't  
remember): In face of an error in the input the time spent for report  
the error nicely is most likely to be very small compared to the time  
spent with valid input, which would most likely passed to other code  
for further processing. Scanning a few tokens back (until you hit the  
beginning of a line) and expanding tabs as you go along is fast enough  
to be ok for the case where you can't do anything else with the input  
anyway. It breaks down in the degenerate case where you have only one  
immensely long line, but that's easy to deal with, too, because  
chances are that you don't want to print the entire line anyway. Just  
cap it at a couple of tokens and all should be well.

On terminology, in case I wasn't clear about it: I say that this  
implementation is in "user code" because it is not generated by ANTLR,  
it doesn't not add any complexity to the generated code, neither time-  
nor spacewise, as opposed to implement columns in CommonToken (which  
would require more memory and at least one more parameter to specify,  
namely the tab width). It would also confuse people that have a  
different tab width setting from what ANTLR (or the grammar author)  
uses, so I don't buy the "it's easier to read"-part yet.

The reason I wouldn't like it to be the default in BaseRecognizer (or  
whatever overrides displayError) is that we don't know the token type  
to look for. This is another thing that needs to be configured by the  
developer.
OTOH I suppose we could be mean and expand \t in _every_ token we  
encounter, by getting its text.
mmh have to think about that, but it's probably the best way of  
handling it. Then my only objection to add it to BaseRecognizer would  
be the "confusion" part when printing the column, but that should be  
easy to solve by just not printing the column :)

cheers,
-k
-- 
Kay Röpke
http://classdump.org/