[antlr-interest] complex lexer (at least to me)

Fri Oct 15 04:19:19 PDT 2010

Hi list,

while doing a parser I ran into the trouble of lexing correctly comments and 
non-comments that look like comments.

Comments starts with a '#' and ends at newline, they should be hidden.
BUT '#!something' is an ID
and ':#header' has its meaning too

I've tried several ways which never worked enough, synpreds, ...

This one eats everything in the last option:

COLUMN_NAMES_END
    : HASH HEADER {System.out.println(" ^~^ LEXER: found HEADER_COMMENT: " + 
$text); };
DBT_UNIT_NAME_START
    : HASH BANG {System.out.println(" ^~^ LEXER: found DBT_UNIT_NAME_START: " + 
$text); };
LINE_COMMENT_OR_ELSE
    : ( HASH BANG )     => DBT_UNIT_NAME_START{ $type = DBT_UNIT_NAME_START; 
System.out.println(" ^~^ LEXER: found HASH BANG: DBT_UNIT_NAME_START: " + 
$text); }
    | ( HASH HEADER )     => COLUMN_NAMES_END     { $type = COLUMN_NAMES_END;    
System.out.println(" ^~^ LEXER: found HASH HEADER: COLUMN_NAMES_END: " + $text); 
}
    | ( HASH (options {greedy=false;} : .)* NEWLINE )   => COMMENT 
{System.out.println(" ^~^ LEXER: LINE_COMMENT Ignoring LINE comment: " + $text); 
}
       ;
protected 
COMMENT    : HASH (options {greedy=false;} : .)* NEWLINE 
    {$channel=HIDDEN; System.out.println(" ^~^ LEXER: COMMENT: Ignoring LINE 
comment: " + $text); }
    ;

So every '#' line ends up caught by COMMENT and I get this unique error message 
on grammar generation:
     [java] error(208): JADATextGrammar.g:98:1: The following token definitions 
can never be matched because prior tokens match the same input: COMMENT

Any ideas??

Stanislas Herman Rusinsky.

P.S.: From the article "What makes a language problem hard?" ( 
http://www.antlr.org/wiki/pages/viewpage.action?pageId=1773 )it looks like I 
meet those:

	* Context sensitive lexer?  You can't decide what vocabulay symbol to match 
unless you know what kind of sentence you are parsing.
	* Is the set of all input fixed? If you have a fixed set of files to convert, 
your job is much easier because the set of language construct combinations is 
fixed. For example, building a general Pascal to Java translator is much harder 
than building a translator for a set of 50 existing Pascal files.
	* Are delimiters non-fixed for things like strings and comments?  That makes it 
tough to build an efficient lexer.
	* Are the source statements really similar; declarations vs expressions in C++?
	* Column sensitive input? E.g., are newlines significant like lines in a log 
file and does the position of an item change its meaning?
	* Does your input have comments as you do in programming languages that can 
occur anywhere in the input and need to go into the output in a sane location?
	* How much semantic information do you need to do the translation? For example, 
do you need to simply know that something is a type name or do you need to know 
that it is, say, an array whose indices are a set like (day,week,month) and 
contains records? Sometimes syntax alone is enough to do translation.
	* ...