[antlr-interest] ANTLR 3.0.1: invalid character column in a mismatch character error message.

Kay Röpke kroepke at classdump.org
Wed Aug 13 14:38:59 PDT 2008


Hi!

On Aug 13, 2008, at 10:48 PM, Gavin Lambert wrote:

> [Ok, I'm mostly responding to Kay here, but I had to do it  
> indirectly since I didn't get the original message.]

it was accidentally sent off-list.

> At 08:22 14/08/2008, Foust wrote:
> >>Kay Röpke wrote:
> >> I'm just saying that adding a column and the tab-width handling
> >> doesn't make that much sense, because it's not something you
> >> generally need. If you do need it, it's almost trivial to add.
>
> You need it to produce any kind of useful error message when the  
> input file contains tabs.  I guess you could work around this by pre- 
> converting all tabs to spaces before passing it to ANTLR, but that's  
> effectively a whole 'nother lexing step, which seems like a waste.   
> And the error message would *still* be misleading, since it reports  
> the zero-based character offset as if it were a one-based column  
> number.

The trick is to not expand tabs at all, but work with \t == 1 char as  
long as possible, I think. When the time comes to output it, then yes,  
you should simply take an arbitrary tabwidth and print that. Then you  
know the exact column (and most tools punt here, and simply replace \t  
with n spaces).
As an aside: Both gcc and javac don't give you the column number ;)
The 0 vs 1 debate is moot, because in your error reporting functions  
you can simply +1 that, so I don't buy that argument.

> >> If I talk about column 1, then yes, I mean the first character.
> >> I'm human after all.
> >> But when I see charPosInLine, I think index (in c-speak).
>
> That's fine, if you're dealing with the object model.  But often  
> you're not -- the token attribute, for example, is simply called  
> '$X.position', which could be read either way.  And the error  
> messages simply dump the charPosInLine *as if it were a column*.   
> _That_ is what I object to, not the zero-based-ness of the  
> charPosInLine (I agree that this makes the most sense).

Yes, I think the presentation can and should be improved. Working on a  
proof-of-concept in a low priority thread :)

> >> Note: I'm not talking about solving the tab problem, but
> >> displaying a short portion of the input (whether charstream
> >> or tokenstream) with an indicator where the offending
> >> char/token was. That should make it easy to find the error,
> >> even if we can't provide column-accurate position
> >> info out of the box.
>
> While I think this is an excellent idea... how exactly are you going  
> to position the indicator if you don't know the column position?   
> You can't rely on outputting tabs for positioning because the tabs  
> in the input stream and the tabs on the console/output stream may  
> not have the same width.

Expand the tabs to an arbitrary value and punt if the terminal tries  
to be smart about tabs (which I think none is). It's just like you said.

> And I *still* haven't heard a convincing argument for why column  
> tracking can't be implemented correctly out of the box, at least for  
> input sources that use constant-spacing tabs (which is probably at  
> least 90% of cases).  The extra per-token overhead seems trivial and  
> it'd be much simpler to track the column as it's parsed rather than  
> after the fact.

I agree, the information is there if you assume a constant-spacing  
tab. But I'm arguing that including this feature wouldn't benefit that  
many people, because when you _are_ dealing with the object model of  
text storages, then you don't care for the display column: The UI  
layer does that for you. What you want to know is which _character_  
position the (multi-)byte character was in.

> >Yes. You're right. Cut to the chase and just give the offending
> >input, rather than make the user go search for it.
>
> You still need to give line/column information, so that IDEs can  
> jump straight to the location of the error themselves.  (I'm  
> assuming here that the IDE is separate from ANTLR and can't access  
> its internal structures -- and most IDEs expect errors to have a  
> line:column format.)


As I said, gcc and javac don't even give you the column in error  
messages (though javac does almost exactly what I describe, see below  
for an example). IDEs would deal with a character position indexing  
into a text storage, so they wouldn't actually want the display column  
because that's an added burden.

javac:
classdump:tmp kroepke$ javac Test.java
Test.java:3: ')' expected
	public static void main(String[] foo {}) {
                                              ^
Test.java:6: ';' expected
}
^
2 errors

Note that the `public` has a tab in front of it. javac expands that  
tab to 8 chars, but only in its marker line.
In fact (the tab has now been edited, since I can't influence your  
tabwidth...sic):

classdump:tmp kroepke$ tabs -4
classdump:tmp kroepke$ javac Test.java
Test.java:3: ')' expected
     public static void main(String[] foo {}) {
                                              ^
Test.java:6: ';' expected
}
^
2 errors

Who's in charge of javac around here!?

cheers,
-k (who's going to track that javac dude down :P)

P.S.: The clang people of LLVM got it right, they correctly expand  
both tabs, just for kudos...
-- 
Kay Röpke
http://classdump.org/








More information about the antlr-interest mailing list