[antlr-interest] Non-disjoint tokens

Sun Nov 25 03:41:30 PST 2007

Given an input stream like this:
blah <HTML> <NOTHING blah <HTMNOTL> ...

I need to parse <HTML> as one token, and otherwise < as its own token.
<HTMNOTL> would ideally be 3 tokens (<, HTMNOTL, >) but it can be one
token.

I'm finding that the with rules like this:

HTML: '<HTML>';
LT: '<';

The lexer falls into a hole when it hits a sequence of characters with
the same first two characters as '<HTML>'. In the debugger, the input
pane shows that the <HT have been completely swallowed.

Is there a way to avoid this? A work around? It seems like a pretty
common thing to want - would this not mean that you couldn't have a
token which matches 'private' yet an identifier 'privort'?

I've tried refactoring along the lines of:

LT: '<';
HTML: LT 'HTML>;
LT_LITERAL: LT;

But this still doesn't seem to work.

Any suggestions? This is a complete show stopper if i can't find a way
around this. It's really crucial that the lexer can recognise this tag
so it can tokenise the following input differently...

Steve