[antlr-interest] Doesn't the lexer try rules in order?
John B. Brodie
jbb at acm.org
Sat May 7 09:12:59 PDT 2011
Greetings!
On Sat, 2011-05-07 at 10:06 -0400, Todd O'Bryan wrote:
> Can anyone explain to me why tabs, spaces, and greater-thans at the
> beginning of lines are ending up in TEXT tokens, rather than in INDENT
> or QUOTE tokens, as I think they should?
>
> fragment SPECIAL_CHARS
> : ('\n' | '[' | ']' | '*' | '/' |'=' | '^' | '_' | '8' | '@' | '#' |
> '$' | '!' | '(' | ')' | '{' | '}' );
>
> INDENT : { getCharPositionInLine() == 0 }?=> (' '|'\t')+;
> QUOTE : { getCharPositionInLine() == 0 }?=> '>';
> TEXT : (~SPECIAL_CHARS)+;
>
> This is in a lexer grammar and I've omitted some other rules that
> shouldn't (I don't think) have any bearing on this question.
Currently ANTLR lexers greedily consume the longest possible sequence of
acceptable characters for each token.
So I think that when the characters that follow the '>' match TEXT e.g.
are not one of the SPECIAL_CHARS then the entire sequence is matched as
TEXT. and the same drill for the INDENT token.
You can verify this by simply trying input such as ">$" or " $" -- each
on a line by itself. I would think you would then get either a QUOTE or
INDENT followed by whatever token matches a $. (Note, this may not parse
correctly but you should still see the 2 token sequence...)
Hope this helps...
-jbb
More information about the antlr-interest
mailing list