[antlr-interest] Confused about backtracking in lexer rules

Gavin Lambert antlr at mirality.co.nz
Sun Nov 16 10:44:43 PST 2008


At 01:39 17/11/2008, William Rose wrote:
 >What I'm finding is that the lexer starts matching the URL,
 >gets to a point where it can't match the character, then
 >drops everything it has read so far and starts lexing from
 >the next character, losing the initially matched tokens.
 >
 >I've tried to find out how to stop this, but the best I've
 >come up with is options { backtrack = true; }, which didn't
 >work.  Syntactic or semantic predicates are often mentioned
 >as helpful, but I don't see how I can write a predicate to
 >help with this decision.

Backtracking is a parser concept -- it's not available in the 
lexer.

 >URL    :    LETTER (LETTER | DIGIT | HYPHEN)* COLON
 >            ~(SPACE | TAB | CR | LF)+
 >    ;
 >
 >TEXT    :    ~(COLON | SLASH | HYPHEN | ASTERISK |
 >             SPACE | TAB | CR | LF)*
 >    ;

When you have two toplevel rules like this, ANTLR basically looks 
at the input stream and uses as little lookahead as it can to 
choose between them.  In this case, seeing a single letter with a 
colon a little later is sufficient to choose URL, and once there 
it can't switch back to TEXT.  (I'm not even sure if it would look 
for the colon -- it might decide that the LETTER by itself is 
sufficient justification for choosing URL, since explicitly named 
characters trump exclusion sets, and the standard lookahead has 
trouble seeing through loops (because it can't use fixed 
lookahead.)

One way you can resolve this is to make your URL rule more 
specific -- eg. only consider it an URL if it starts with "http" 
or "ftp" or "mailto" or whatever other schemes you're expecting.

Otherwise you'll need to merge these into one rule and use a 
syntactic predicate to force complete lookahead (which is 
functionally equivalant to backtracking):

fragment URL : LETTER (LETTER | DIGIT | HYPHEN)* COLON
                ~(SPACE | TAB | CR | LF)+
              ;
TEXT : (URL) => URL { $type = URL; }
      | ~(COLON | SLASH | HYPHEN | ASTERISK | SPACE | TAB | CR | 
LF)*
      ;



More information about the antlr-interest mailing list