[antlr-interest] Non-disjoint tokens

Sun Nov 25 23:44:56 PST 2007

At 00:41 26/11/2007, Steve Bennett wrote:
 >I'm finding that the with rules like this:
 >
 >HTML: '<HTML>';
 >LT: '<';
 >
 >The lexer falls into a hole when it hits a sequence of 
characters
 >with the same first two characters as '<HTML>'. In the debugger, 

 >the input pane shows that the <HT have been completely 
swallowed.

The usual trick with common-prefix literals (or perhaps the 
"other" usual trick, since Austin already posted the semantic 
predicate version) is to compose them into a single rule.  The key 
point is to explicitly give ANTLR the alternatives so that it 
doesn't try to plunge ahead without looking first.

tokens { HTML; }

LT
	:	'<'
	(	/* nothing */
	|	('HTML>') => 'HTML>' { $type=HTML; }
	)
	;

(It'd be nice if you didn't need that syntactic predicate, but 
sadly you do.  But this does work.)

I prefer this sort of approach over using semantic predicates; I 
try to use those as little as possible.  (Mainly because I think 
they're ugly, but also because they're target-specific.)