[antlr-interest] token precedence (and an ANTLRworks question)

Mon Nov 17 11:48:37 PST 2008

At 21:54 17/11/2008, Davyd Madeley wrote:
 >        TOKEN
 >        	: ~(NEWLINE|','|'>')+
 >        	;
[...]
 >	LINE,1500,4,60,60
 >	**INPUT/NOSICHECK
 >
 >Into a token stream:
 >
 >        |LINE|,|1500|,|4|,|60|,|60|
 >        |**|INPUT|/|NOSICHECK|
 >
 >But instead what I'm ending up with is:
 >
 >        |LINE|,|1500|,|4|,|60|,|60|
 >        |**INPUT/NOSICHECK|
 >
 >This suggests to me that it's wrong of me to assume that the 
first
 >rule will be matched first. I can't find much discussion of
 >precedence rules in the ANTLR book.

Essentially how it works is that at the top-level ANTLR uses the 
least amount of lookahead it can get away with to choose between 
the top-level rules.  The order of the rules is unimportant 
(except when it can't decide between them any other way).  Once it 
is "inside" a rule, however, it uses only that rules' own 
conditions to decide when to stop, ignoring the possibility of 
stopping earlier and generating additional tokens.  (Which makes 
sense when you think about it.)

In this case, on seeing '*', ANTLR enters the TOKEN rule (since no 
other rule could possibly match that character), and won't leave 
the rule until it hits any of a newline, comma, or '>'.  Thus 
everything on that second line is matched as a single TOKEN token.

If you want to split the TOKEN up at things that might be 
IDENTIFIERs, then you'll need to add CHAR to the list of 
terminating characters in the TOKEN rule.

 >Also, the ANTLRworks debugger can show you the token stream with 

 >little red boxes around each token, but I can't seem to work out 

 >how to find out the token type for that token, is there 
something
 >I'm missing here?

Not that I know of; I've wished it'd do that for quite some 
time.  ANTLRworks is fairly weak at sorting out problems in the 
lexer; I've usually found that it's better to write my own unit 
tests for that purpose.