[antlr-interest] Looking for reference to how ANTLR performs Lexing....

Thu Sep 10 12:43:02 PDT 2009

At 06:02 11/09/2009, Sylvain, Gregory [USA] wrote:
>I have a bunch of questions about how ANTLR (v3) is lexing it's 
>input stream.  I am continually chasing bugs about how ANTLR 
>lexed some text as one token when I was expecting it to Lex it as 
>another token.
>
>I've checked out this list and elsewhere on the ANTLR.org site 
>for references to lexical analysis and how ANTLR is doing it, but 
>I have not found much.
>
>Specifically, I am trying to understand the order of evaluation 
>of the lexer rules and how ANTLR combines them in order complete 
>the analysis of it's given input.

The best thing to do is to write a bunch of unit tests for your 
lexer.  Not only is this a good sanity check that you're getting 
the output you want, it's a safety net that will help ensure that 
adding or modifying rules don't break existing examples.

The first thing to realise is that the lexer runs independently, 
without any context from the parser.  So if you have two lexer 
rules that could potentially match the same input, then which ones 
you have used in the parser are not going to help disambiguate 
between them.

The general rule of thumb that ANTLR follows when deciding between 
lexer tokens is that longest-match wins; failing that, whichever 
rule is listed first will win.

Best practice is to ensure that your lexer rules are completely 
unambiguous; stick to generalities and don't try to assign 
semantic meaning until at least the parser stage.  It's also a 
good idea to merge rules that have a common left prefix (and "left 
factor" them) to reduce the amount of lookahead required to 
disambiguate between them, and also to give you more control over 
the disambiguation if you need it.  A classic example of this (on 
the wiki) is INT vs. FLOAT vs. RANGE.