[antlr-interest] Understanding priorities in lexing (newbie)
Wincent Colaiuta
win at wincent.com
Thu Jul 12 00:54:59 PDT 2007
El 12/7/2007, a las 7:59, mail.acc at freenet.de escribió:
> I am trying to write a stand alone lexer
> which can cope with arbitrary input without
> reporting a "missmatched token"-error.
>
> It should however recognize some combination
> as Tokens.
>
> This is my first approach:
> --------------------------------------------
> lexer grammar LexerJava;
> KEYWORDA : 'int'|'float';
> KEYWORDB : 'public'|'static'|'void';
> COMMENT : '/*' ( options {greedy=false;} : . )* '*/'
> | '//' ~('n'|'r')* 'r'? 'n'
> ;
> // fallback rule
> ELSE :.;
> --------------------------------------------
>
> On an input like the following it reports
> several errors:
> --------------------------------------------
> 01: public class Test {
> 02: private int varclassTmp = 3;
> 03: [...]
> 04: /* Comment */
> 05: public static void main(String[] av) {
> 06: float i=0;
> 07: float[] sum; // comment
> 08: int tmp;
> 09: [...]
> 10: float internationalization = 4.;
> 11: /* int float */
> 12: }
> 13: }
> 14: /* Comment */
> --------------------------------------------
> line 1:17 mismatched character ' ' expecting 'a'
> line 5:24 mismatched character '(' expecting 't'
> line 5:30 mismatched character 'g' expecting 't'
>
> In some sense I am able to relate these errors,
> because every time a KEYWORD seem to match
> (Test->static; main->int; Strin->int) but I
> can not figure out why rule ELSE doesen't match
> in these cases.
As Ter has already stated, you need a filtering lexer for this. It's
purpose is to try the lexer rules in order, using backtracking to
catch failures, and if no rule matches then the input is advanced one
character (discarded) and ANTLR starts trying all over again. In your
case your ELSE rule means that all characters will match and nothing
will be discarded.
This won't work in a non-filtering lexer because ANTLR is based on
prediction. This may seem counter-intuitive at first but you just
have to accept that as a basic premise ANTLR is all about speed and
that means no backtracking in the event of an error (unless you
explicitly turn it on); ANTLR tries to predict which rule will match
as quickly as possible and once a predication is made race ahead and
match as fast as possible.
For example, the first error occurs when ANTLR sees the "st" of
"Test"; at that point ANTLR predicts that rule KEYWORDB should match
and that the next character should be the "a" of "static", and of
course it isn't. But note that this doesn't occur earlier on in the
line where ANTLR sees the first "s" of "class". That is, seeing "s"
is not enough to make ANTLR predict "static"; it will only predict
that if it sees "st".
Without a filtering lexer you'd have to non-ambiguously specify how
to recognise all the constructs in the language so that everything
could be tokenized, but that isn't very efficient because you're only
interested in a subset of the possible syntax and all the other
tokens would just be thrown away.
To fully understand how all this works you should take a simple
grammar like the one you post and look very, very carefully at the
generated code. You will see how the generated methods are divided
into two sections: a prediction part ("if" statements that use
lookeahead) and a recognition part (code for matching according to
the prediction inside a "switch" statement). I recommend reading
Ter's ANTLR book; combine that with writing real grammars (and
hitting problems like this one) and you will eventually "get" ANTLR.
Cheers,
Wincent
More information about the antlr-interest
mailing list