[antlr-interest] Non-disjoint tokens
Steve Bennett
stevagewp at gmail.com
Sun Nov 25 03:41:30 PST 2007
Given an input stream like this:
blah <HTML> <NOTHING blah <HTMNOTL> ...
I need to parse <HTML> as one token, and otherwise < as its own token.
<HTMNOTL> would ideally be 3 tokens (<, HTMNOTL, >) but it can be one
token.
I'm finding that the with rules like this:
HTML: '<HTML>';
LT: '<';
The lexer falls into a hole when it hits a sequence of characters with
the same first two characters as '<HTML>'. In the debugger, the input
pane shows that the <HT have been completely swallowed.
Is there a way to avoid this? A work around? It seems like a pretty
common thing to want - would this not mean that you couldn't have a
token which matches 'private' yet an identifier 'privort'?
I've tried refactoring along the lines of:
LT: '<';
HTML: LT 'HTML>;
LT_LITERAL: LT;
But this still doesn't seem to work.
Any suggestions? This is a complete show stopper if i can't find a way
around this. It's really crucial that the lexer can recognise this tag
so it can tokenise the following input differently...
Steve
More information about the antlr-interest
mailing list