[antlr-interest] Understanding Lexer rules

Wed Feb 20 06:54:51 PST 2008

On Feb 20, 2008 8:50 AM, Johannes Luber <jaluber at gmx.de> wrote:
> Mark Volkmann schrieb:
>   > Let's see if I can summarize the rules from this wiki section.
> >
> > If the upcoming characters in the stream match a non-imaginary token
> > defined in a token spec. then that is used. tokens { ... } comes
> > first.
>
> Correct.
>
> > After that, lexer rules are evaluated in the order in which they are
> > specified. The first one that matches the upcoming characters in the
> > stream is used, not the one that matches the greatest number of
> > characters.
>
> No. Rereading the text, I suppose one could be confused about that, as
> it isn't as clear as it could be. Is adding "Longer matches are
> preferred over shorter matches. If one has two tokens KEY='key'; and
> KEYWORD='keyword';, then the input 'keyword' will match KEYWORD, even if
> KEY comes first." enough?

Does the same rule about matching the maximum length apply when the
lexer rules use the cardinality operators (?, * and +)? I recommend
adding something about that and not just showing examples with fixed
literal values.

In case of a tie in the number of characters matched by multiple lexer
rules, does the first one win?

> > After that, literals specified in parser rules are considered. This
> > means that parser rules containing literals will not match the input
> > if there is a lexer rule that matches the same input.
> >
> > Does all that sound correct?
> >
> > At the end of the "How to define tokens" section in the wiki it says
> > that "lexer rules will greedily match the maximum of applicable
> > characters". There is an exception to this. When the patterns ".*" or
> > ".+" appear in a lexer rule, they do no match greedily.
>
> I'll add that.
>
> Johannes

Thanks Johannes!

-- 
R. Mark Volkmann
Object Computing, Inc.