[antlr-interest] Confused about backtracking in lexer rules
Gavin Lambert
antlr at mirality.co.nz
Sun Nov 16 10:44:43 PST 2008
At 01:39 17/11/2008, William Rose wrote:
>What I'm finding is that the lexer starts matching the URL,
>gets to a point where it can't match the character, then
>drops everything it has read so far and starts lexing from
>the next character, losing the initially matched tokens.
>
>I've tried to find out how to stop this, but the best I've
>come up with is options { backtrack = true; }, which didn't
>work. Syntactic or semantic predicates are often mentioned
>as helpful, but I don't see how I can write a predicate to
>help with this decision.
Backtracking is a parser concept -- it's not available in the
lexer.
>URL : LETTER (LETTER | DIGIT | HYPHEN)* COLON
> ~(SPACE | TAB | CR | LF)+
> ;
>
>TEXT : ~(COLON | SLASH | HYPHEN | ASTERISK |
> SPACE | TAB | CR | LF)*
> ;
When you have two toplevel rules like this, ANTLR basically looks
at the input stream and uses as little lookahead as it can to
choose between them. In this case, seeing a single letter with a
colon a little later is sufficient to choose URL, and once there
it can't switch back to TEXT. (I'm not even sure if it would look
for the colon -- it might decide that the LETTER by itself is
sufficient justification for choosing URL, since explicitly named
characters trump exclusion sets, and the standard lookahead has
trouble seeing through loops (because it can't use fixed
lookahead.)
One way you can resolve this is to make your URL rule more
specific -- eg. only consider it an URL if it starts with "http"
or "ftp" or "mailto" or whatever other schemes you're expecting.
Otherwise you'll need to merge these into one rule and use a
syntactic predicate to force complete lookahead (which is
functionally equivalant to backtracking):
fragment URL : LETTER (LETTER | DIGIT | HYPHEN)* COLON
~(SPACE | TAB | CR | LF)+
;
TEXT : (URL) => URL { $type = URL; }
| ~(COLON | SLASH | HYPHEN | ASTERISK | SPACE | TAB | CR |
LF)*
;
More information about the antlr-interest
mailing list