[antlr-interest] Bug? Line number off in generate lexer

Fri Aug 7 16:36:23 PDT 2009

Hi folks:

I see in the generated Lexer code that there are comments in each token method indicating the corresponding line:col in the xxx.g source file. 

For methods implementing lexer rules that line number is accurate.  However, for methods that recognize the tokens in the tokens{...} list, the line number is off by quite a bit.  (Generated with Antlr 3.1.3)

Example:
tokens{
    SemiColon         = ';';   // line 37

Generates
----------------------------------
    // $ANTLR start "SemiColon"
    public final void mSemiColon() throws RecognitionException {
        try {
            int _type = SemiColon;
            int _channel = DEFAULT_TOKEN_CHANNEL;
            // ipsum.g:7:11: ( ';' )      <------------- ERROR
            // ipsum.g:7:13: ';'
            {
            match(';'); if (state.failed) return ;

            }

            state.type = _type;
            state.channel = _channel;
        }
        finally {
        }
    }
    // $ANTLR end "SemiColon"

----------------------------------

In the case above, the line number is off by 30.  Items later in the token{...} list have line numbers progressively more undercounted to the extent there are any blank lines or comment lines in the list.

However I don't see a pattern explaining the undercount before the first token{} item. There may be a substantial size comment at the beginning, but its size doesn't match the initial undercount, so there's something more subtle going on there.

I attach a sample grammar file which demonstrates the problem.

-- Graham

-------------- next part --------------
/*
ipsum grammar
Graham Wideman
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Nam a
diam id urna lobortis vehicula non at

dolor. Nunc vulputate sem non odio ullamcorper
sodales. Nulla sed tortor a metus posuere tempor.
Etiam fringilla mi vel nisl pellentesque ac sodales dolor
ullamcorper. Suspendisse placerat sodales
turpis, ut auctor quam vestibulum sed. Maecenas

pulvinar blandit aliquet. Aliquam commodo ornare
pharetra. Maecenas a est ipsum, id porta tellus. In
sodales est vitae lectus adipiscing id sodales
nunc volutpat. Donec eu ipsum neque, at viverra o

rci. Sed a libero neque. Nulla eget nibh velit.
Vivamus enim nibh, pretium eget malesuada eu, vulputate
quis mauris. Praesent id mollis sapien. Pellentesque
ut dui eros, non varius nibh.
*/

grammar ipsum;

options {
    backtrack = true;
    memoize = true;
    k=2;
    output = AST;
    ASTLabelType = CommonTree;
}

tokens{
    //-------- Names and strings for non-alphabetical tokens ---------
    SemiColon         = ';';
    SkooglyBoogle     = '\\xx//';

    Comma             = ',';
    LParen            = '(';

    RParen            = ')';
}

@header{
package com.grahamwideman.ipsum.parser;
}
@lexer::header{
package com.grahamwideman.ipsum.parser;
}

prog : statement*;

statement
    : SkooglyBoogle SemiColon
    ;

//========================== Lexer ===================

//----------------------- Numbers -----------------

fragment
Digits
	: '0'..'9'+
	;

fragment
DNum
	:(('.' Digits)=>('.' Digits)|(Digits '.' Digits?))
	;

Boolean
    : 'true' | 'false' | 'TRUE' | 'FALSE'
    ;

Null
    : 'NULL'
    ;

//------------------ String-related -------------

Identifier
   : ('a'..'z' | 'A'..'Z' | '_')  ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*
   ;

//----------------- white space ------------

WhiteSpace
@init{
    $channel=HIDDEN;
}
	:	(' '| '\t'| '\n'|'\r')*
	;