[antlr-interest] Looking for reference to how ANTLR performs Lexing....
Gavin Lambert
antlr at mirality.co.nz
Thu Sep 10 12:43:02 PDT 2009
At 06:02 11/09/2009, Sylvain, Gregory [USA] wrote:
>I have a bunch of questions about how ANTLR (v3) is lexing it's
>input stream. I am continually chasing bugs about how ANTLR
>lexed some text as one token when I was expecting it to Lex it as
>another token.
>
>I've checked out this list and elsewhere on the ANTLR.org site
>for references to lexical analysis and how ANTLR is doing it, but
>I have not found much.
>
>Specifically, I am trying to understand the order of evaluation
>of the lexer rules and how ANTLR combines them in order complete
>the analysis of it's given input.
The best thing to do is to write a bunch of unit tests for your
lexer. Not only is this a good sanity check that you're getting
the output you want, it's a safety net that will help ensure that
adding or modifying rules don't break existing examples.
The first thing to realise is that the lexer runs independently,
without any context from the parser. So if you have two lexer
rules that could potentially match the same input, then which ones
you have used in the parser are not going to help disambiguate
between them.
The general rule of thumb that ANTLR follows when deciding between
lexer tokens is that longest-match wins; failing that, whichever
rule is listed first will win.
Best practice is to ensure that your lexer rules are completely
unambiguous; stick to generalities and don't try to assign
semantic meaning until at least the parser stage. It's also a
good idea to merge rules that have a common left prefix (and "left
factor" them) to reduce the amount of lookahead required to
disambiguate between them, and also to give you more control over
the disambiguation if you need it. A classic example of this (on
the wiki) is INT vs. FLOAT vs. RANGE.
More information about the antlr-interest
mailing list