[antlr-interest] ANTLR 3.0.1: invalid character column in a mismatch character error message.

Wed Aug 13 13:22:08 PDT 2008

Kay,

>Kay Röpke wrote:
> On Aug 13, 2008, at 12:33 PM, Foust wrote:
> 
> > Kay wrote:
> >> It adds up, as simple as that. The more you store, the greater your
> >> memory footprint is, the more pages it has to touch, the slower it
> >> gets. Especially if you are parsing huge input it makes a noticeable
> >> difference (and in most target languages the footprint of an int is
> >> not 4 or 8 bytes, it's much larger for all those managed languages).
> >
> > We could make that argument for any of Antlr's features, most of
> > which take
> > a lot more processing and memory than just a couple of ints. If
> > speed is the
> > only reason for not providing the intuitive functionality, then
> > there are
> > ways to architect it so that it only gets invoked for users who need
> > the
> > functionality (IoC is one, but there are others).
> 
> Which features are there that take a lot of memory based on the input?
> Apart from line information (which most of the time I don't "really"
> need to have stored, I can re-lex on error to produce the necessary
> information because an error will be expensive anyway), all I can see
> in CommonToken (for Java now) is necessary housekeeping information.
> E.g. the start/stop indices are used to grab the actual text from the
> input stream, rather than copying strings around.
> 
> I'm just saying that adding a column and the tab-width handling
> doesn't make that much sense, because it's not something you generally
> need. If you do need it, it's almost trivial to add.
> 
> >
> > Even in managed languages, an integer field only takes 4 or 8 bytes.
> > (There
> > is no need to allocate an entirely separate object for each character
> > position.)
> 
> That's certainly not the case for reference counted (and weakly typed)
> languages like Perl. I believe Python uses the same approach as Perl
> (opaque structs that store multiple representations of their value
> internally, converting when necessary).
> At least Perl uses way more than 4 bytes per int on a 32bit machine.
> 
> BTW, I meant tokens not characters here. I agree it would be silly to
> wrap each char in an object. But if you have a huge number of short
> tokens, this get's noticeable.
> Of course, for short input it hardly makes any sense to even think
> about it...
> 
> >> I think the runtime should be minimal, because it's much easier to
> >> add
> >> functionality than to remove it
> >
> > That's a valid point.
> >
> > And what Antlr already provides (an accurate line and character pos)
> > is already good. But most editors display a 1-based column number. So
> > if the intent is to provide the grammar developer accurate feedback in
> > order to quickly locate the problem, then an accurate 1-based column
number
> > should be provided.
> 
> The problem is tab handling, otherwise I'd not argue about 0 vs 1.
> Letting it start at 1 kinda implies that we are telling the "column"
> even when we are not. It's not a column and can't be in the presence
> of tabs. Hence my fierce arguments ;)

I see.

> 
> > When using Vi, it is easy to go to a line#/character offset. But
> > Eclipse
> > (without the Vi plugin) doesn't allow you to move the right n
> > characters. It
> > just displays the column number, which varies with the spaces-per-tab
> > setting. Editor plugins, such as AntlrDT probably already take this
> > into
> > account.
> >
> > But most users probably think that column #1 means the first
> > character, not
> > the 2nd.
> 
> If I talk about column 1, then yes, I mean the first character. I'm
> human after all.
> But when I see charPosInLine, I think index (in c-speak).

Yes. Whereas vertical tabs are no longer used, the Antlr "line" attribute is
1-based, but the horizontal coordinate, "charPosInLine" is 0-based (for
reasons you've described in detail). Maybe it would have been clearer with a
name like "charIndex".

Nevertheless, the question seems to be one of whether it is worthwhile to
handle tabs as a special case, and I hear you voting, "no."

> Actually, I think we need better error messages in the default
> implementation, then the problem goes away.
> Naturally this won't happen in 3.1 (as it's been released! hooray!),
> but maybe we can cook up something for 3.2.
> Note: I'm not talking about solving the tab problem, but displaying a
> short portion of the input (whether charstream or tokenstream) with an
> indicator where the offending char/token was. That should make it easy
> to find the error, even if we can't provide column-accurate position
> info out of the box.
> That above plan applies to other error messages as well:
> 
> warning(200): MySQL51.g:1610:74: Decision can match input such as
> "MINUS" using multiple alternatives: 1, 2
> As a result, alternative(s) 2 were disabled for that input

Yes. You're right. Cut to the chase and just give the offending input,
rather than make the user go search for it.

> 
> I'd like to see more verbose information about which alternatives were
> involved because counting them in the input grammar can be really
> tiring (if the alts are 15,34 for example).

Again, excellent suggestion.

> ANTLRWorks helps here, but sometimes I want to see it in the actual
> output. Shouldn't be hard to add in any case.

AntlrWorks has it's issues. It's difficult to rely on it, unless it is being
actively supported. (Are bugs being actively addressed in AntlrWorks?) 

I agree with you that more descriptive error messages are needed and would
probably solve most issues without resorting to character counting, anyway.

Brent