[antlr-interest] token precedence (and an ANTLRworks question)
Gavin Lambert
antlr at mirality.co.nz
Mon Nov 17 11:48:37 PST 2008
At 21:54 17/11/2008, Davyd Madeley wrote:
> TOKEN
> : ~(NEWLINE|','|'>')+
> ;
[...]
> LINE,1500,4,60,60
> **INPUT/NOSICHECK
>
>Into a token stream:
>
> |LINE|,|1500|,|4|,|60|,|60|
> |**|INPUT|/|NOSICHECK|
>
>But instead what I'm ending up with is:
>
> |LINE|,|1500|,|4|,|60|,|60|
> |**INPUT/NOSICHECK|
>
>This suggests to me that it's wrong of me to assume that the
first
>rule will be matched first. I can't find much discussion of
>precedence rules in the ANTLR book.
Essentially how it works is that at the top-level ANTLR uses the
least amount of lookahead it can get away with to choose between
the top-level rules. The order of the rules is unimportant
(except when it can't decide between them any other way). Once it
is "inside" a rule, however, it uses only that rules' own
conditions to decide when to stop, ignoring the possibility of
stopping earlier and generating additional tokens. (Which makes
sense when you think about it.)
In this case, on seeing '*', ANTLR enters the TOKEN rule (since no
other rule could possibly match that character), and won't leave
the rule until it hits any of a newline, comma, or '>'. Thus
everything on that second line is matched as a single TOKEN token.
If you want to split the TOKEN up at things that might be
IDENTIFIERs, then you'll need to add CHAR to the list of
terminating characters in the TOKEN rule.
>Also, the ANTLRworks debugger can show you the token stream with
>little red boxes around each token, but I can't seem to work out
>how to find out the token type for that token, is there
something
>I'm missing here?
Not that I know of; I've wished it'd do that for quite some
time. ANTLRworks is fairly weak at sorting out problems in the
lexer; I've usually found that it's better to write my own unit
tests for that purpose.
More information about the antlr-interest
mailing list