[antlr-interest] Understanding priorities in lexing (newbie)

Thu Jul 12 00:54:59 PDT 2007

El 12/7/2007, a las 7:59, mail.acc at freenet.de escribió:

> I am trying to write a stand alone lexer
> which can cope with arbitrary input without
> reporting a "missmatched token"-error.
>
> It should however recognize some combination
> as Tokens.
>
> This is my first approach:
> --------------------------------------------
> lexer grammar LexerJava;
> KEYWORDA : 'int'|'float';
> KEYWORDB : 'public'|'static'|'void';
> COMMENT  : '/*' ( options {greedy=false;} : . )* '*/'
>     	 | '//' ~('n'|'r')* 'r'? 'n'
> 	 ;
> // fallback rule
> ELSE	 :.;
> --------------------------------------------
>
> On an input like the following it reports
> several errors:
> --------------------------------------------
> 01: public class Test {
> 02:     private int varclassTmp = 3;
> 03:     [...]
> 04:     /* Comment */
> 05:     public static void main(String[] av) {
> 06:          float i=0;
> 07:          float[] sum; // comment
> 08:          int tmp;
> 09:          [...]
> 10:          float internationalization = 4.;
> 11:          /* int float */
> 12:     }
> 13: }
> 14: /* Comment */
> --------------------------------------------
> line 1:17 mismatched character ' ' expecting 'a'
> line 5:24 mismatched character '(' expecting 't'
> line 5:30 mismatched character 'g' expecting 't'
>
> In some sense I am able to relate these errors,
> because every time a KEYWORD seem to match
> (Test->static; main->int; Strin->int) but I
> can not figure out why rule ELSE doesen't match
> in these cases.

As Ter has already stated, you need a filtering lexer for this. It's  
purpose is to try the lexer rules in order, using backtracking to  
catch failures, and if no rule matches then the input is advanced one  
character (discarded) and ANTLR starts trying all over again. In your  
case your ELSE rule means that all characters will match and nothing  
will be discarded.

This won't work in a non-filtering lexer because ANTLR is based on  
prediction. This may seem counter-intuitive at first but you just  
have to accept that as a basic premise ANTLR is all about speed and  
that means no backtracking in the event of an error (unless you  
explicitly turn it on); ANTLR tries to predict which rule will match  
as quickly as possible and once a predication is made race ahead and  
match as fast as possible.

For example, the first error occurs when ANTLR sees the "st" of  
"Test"; at that point ANTLR predicts that rule KEYWORDB should match  
and that the next character should be the "a" of "static", and of  
course it isn't. But note that this doesn't occur earlier on in the  
line where ANTLR sees the first "s" of "class". That is, seeing "s"  
is not enough to make ANTLR predict "static"; it will only predict  
that if it sees "st".

Without a filtering lexer you'd have to non-ambiguously specify how  
to recognise all the constructs in the language so that everything  
could be tokenized, but that isn't very efficient because you're only  
interested in a subset of the possible syntax and all the other  
tokens would just be thrown away.

To fully understand how all this works you should take a simple  
grammar like the one you post and look very, very carefully at the  
generated code. You will see how the generated methods are divided  
into two sections: a prediction part ("if" statements that use  
lookeahead) and a recognition part (code for matching according to  
the prediction inside a "switch" statement). I recommend reading  
Ter's ANTLR book; combine that with writing real grammars (and  
hitting problems like this one) and you will eventually "get" ANTLR.

Cheers,
Wincent