[antlr-interest] Strategy for mapping output to line numbers from a tree walker

Fri Aug 21 21:04:29 PDT 2009

At 13:06 22/08/2009, Stanislav Sokorac wrote:
>if (VALUE + a > 0) { echo "hi"; }
>
>where 'VALUE' is a macro that's defined in an include file. Your 
>lexer substituted VALUE with the defined value (say '1.0'), and 
>marked the char stream appropriately. Now, your tree walker comes 
>upon 1.0+a, and say your language doesn't allow additional of 
>reals and integers, so you want to mark/underline the expression 
>"VALUE + a" and say "No adding of reals and integers".

Well, the error itself is anchored on the + (since, after all, 
each operand is ok, it's only when you try to add them that 
there's a problem).  You could probably get by with just flagging 
the + itself as the error and not worrying about where the 
operands came from.  Failing that, you can use the location of the 
+ to decide whether you've found the "right" location for the 
operands or not.

>Now what, how do you underline 'VALUE + a'? I.e. how do you 
>figure out the starting and ending character of your expression 
>in 'main.c'? The user doesn't want to see the VALUE definition in 
>another file underlined as there's nothing wrong with the line of 
>code.

Either don't have your lexer do the substitution (which is 
impractical if this is a C-style preprocessor that can have 
complex replacements), or expand the token definition so that the 
resulting "1.0" still remembers where the call site (the VALUE) 
was, so that you can use that for error reporting.

>A similar problem occurs if you have a list of statements, and 
>the first (or last) few came from an include file.. if you wanted 
>to show the proper range in the original file, you can't 
>determine the location of the 'include' statement by only 
>examining your "list of statements" tree node and the tokens in 
>it.

You could add an extra value to the token definition and lexer 
members (an "include list").  This starts out empty.  When any 
token is generated, the current include list is attached to 
it.  When that happens to be an include statement, the current 
include list is cloned and a reference to the include statement 
itself is added to the copy, then the copy is passed into the 
sublexer as its include list.

That way, every token would carry with it the full chain of 
include files (and line numbers of the include statements) that it 
took to get there, which would make for very useful error 
messages.

Obviously this will increase memory usage a bit (depending on how 
many levels of nested includes you have), but it's probably fairly 
minimal.  Just make sure you don't clone the list for each 
individual token ;)