[antlr-interest] question about lexer rules
Gavin Lambert
antlr at mirality.co.nz
Sat Dec 19 17:33:34 PST 2009
At 22:24 19/12/2009, codeman at bytefusion.de wrote:
>A question to lexer rules and its priorities. Is there any
>dependency between order of lexer rule definitions?
[...]
>My understanding of lexer rules is, the best rule will
>match. The best rule is the rule matching the most
>characters. But what about TIME and IDENTIFIER_LOWER? Both
>may match the same input sequence.
Both are true. In general, the best match will win. But in cases
where two rules can match the same input, then the one listed
first will win.
There are also some complications involved related to how ANTLR
generates the lookahead code; it stops looking ahead once it sees
enough input to make it eliminate all other rules, which is
sometimes early enough to get it into trouble with certain kinds
of input (hence the trouble with INT vs. FLOAT tokens discussed
here repeatedly).
I think in your case it'll be ok, but it's possible that ANTLR
might get into trouble with certain kinds of input -- for example,
"12h53" might be seen as a malformed TIME rather than a TIME
followed by a NUMBER.
There are some problems in that grammar, though.
1. The DIGIT, LOWERCASE, and UPPERCASE should almost certainly be
marked as fragment rules, since you don't really want to get
individual DIGIT or LOWERCASE tokens in the parser.
2. The IDENTIFIER_UPPER rule should use + instead of *; using *
means that a valid IDENTIFIER_UPPER can contain zero characters,
which can mean that ANTLR will get into an infinite loop producing
IDENTIFIER_UPPER tokens without consuming any input. In general,
no top-level lexer rule should ever permit zero consumption.
3. You have both a NEWLINE and a WS rule matching the same
characters, one skipped and one not skipped. If newlines are
significant to the parser then you should remove them from the WS
rule; if they're not then you should remove the NEWLINE rule, or
make it a fragment.
4. Your two identifier rules specify that identifiers cannot
contain digits, nor can they be mixed-case. Is this actually what
you wanted?
5. In the TIME rule, you are using + in a very bizarre
way. Remember, it denotes repetition, not concatenation. Are you
really trying to say that "12hhhhhh25mmm" is a valid TIME?
6. You should left-factor the TIME rule, so that all of the
alternatives with a common left prefix are expressed together (ie.
have the common left prefix followed by optional
alternatives). This reduces the amount of lookahead ANTLR
requires, improves performance, and helps to reduce problem
ambiguity cases.
More information about the antlr-interest
mailing list