[antlr-interest] Non-disjoint tokens
Gavin Lambert
antlr at mirality.co.nz
Sun Nov 25 23:44:56 PST 2007
At 00:41 26/11/2007, Steve Bennett wrote:
>I'm finding that the with rules like this:
>
>HTML: '<HTML>';
>LT: '<';
>
>The lexer falls into a hole when it hits a sequence of
characters
>with the same first two characters as '<HTML>'. In the debugger,
>the input pane shows that the <HT have been completely
swallowed.
The usual trick with common-prefix literals (or perhaps the
"other" usual trick, since Austin already posted the semantic
predicate version) is to compose them into a single rule. The key
point is to explicitly give ANTLR the alternatives so that it
doesn't try to plunge ahead without looking first.
tokens { HTML; }
LT
: '<'
( /* nothing */
| ('HTML>') => 'HTML>' { $type=HTML; }
)
;
(It'd be nice if you didn't need that syntactic predicate, but
sadly you do. But this does work.)
I prefer this sort of approach over using semantic predicates; I
try to use those as little as possible. (Mainly because I think
they're ugly, but also because they're target-specific.)
More information about the antlr-interest
mailing list