[antlr-interest] I can't understand lexer behaivour. Lexers are LL(*)?

Mon Oct 12 16:02:05 PDT 2009

On Mon, Oct 12, 2009 at 4:19 PM, Joan Pujol <joanpujol at gmail.com> wrote:
> Hi,
>
> I've a problem with the lexer. Given that grammar:
>
>
> TEXT    :       (~'$' | '$' ~'{')+;
> OTHER   :        .;
> program :       (TEXT|OTHER)+;
>
> If I try with input     'Hola ${ ' it does'nt works. And I can't understand why.
> If I look at generated code of rule TEXT it doesn't use a lookahead of
>  2. In fact when it finds a '$' it enters in the subrule '$' ~'{
> without using LA(2) to see that the next caracter is a { and that then
> it can't enter the subrule.
> In fact as it only compares LA(1)='$' it enters the subrule but then
> when it founds the{ it fails with a lexical error.
>
> And I can't understand why the lexer isn't capable to generate code to
> compare with LA(1)='$' and LA(2)!='{' to enter the subrule.
>
> Is that the correct behaivour? Why de lexer can aproximate the K=2?
> Someone can explain why?
>
> A lot of thanks in advance,
>

There are far too many threads that refer to the FAQ entry, but not
enough actual links (I've always found it hard to find while searching
the FAQ, maybe I'm just blind or something).

See here:
http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs

This is a common problem.  Terrance gave some discussion in this thread:

http://www.antlr.org/pipermail/antlr-interest/2007-October/024273.html

It'd be great if the lexer would emit errors/warnings noting that
there is input that can end in a no mans land and the lexer will blow
up.  I don't think it will do that as of 3.1.3 (I haven't played with
newer versions).

The crux of the problem is (with my limited understanding):  Once the
lexer/DFA decides: "A longer match *MIGHT* exist", there is no more
backing up, or use a rule that matched the longest prefix so far.  So
if a rule looks like it might match longer input, the rules that match
the shorter input are thrown away.  A trivial case would be something
like this:

(From memory, I haven't actually put this into ANTLRWorks, but I know
I've seen something similar to this setup have problems where
lookahead was the only solution):
ONE_BRACE: '{';
THREE_BRACE: '{{{';

If you feed it the input of '{{', it will blow up upon inside the
lexer, despite the fact that the would would match ONE_BRACE,
ONE_BRACE.  Once the lexer/DFA decides, there have been enough braces
that it could match a longer rule then ONE_BRACE any more, the lexer
will never emit "ONE_BRACE"...  So the THREE_BRACE is all that is
left, once the "match" is called, it gives some type of
RecognitionException.  It might work around that if backtracking is
enabled, I'm not sure.  I know I've had problems like this where
enabling backtracking resolved some of them, but not others.  It's
annoying, but ANTLR is useful enough I just get over it.

I'm assuming that the it is because the '$' ~'{' is longer and could
match, so the lexer/DFA has decided that the OTHER rule is no longer
an option.  You might try manually using the lookahead or a predicate
like the following:

// Syntactic Predicate I believe
TEXT    :       (~'$' |  ('$' ~'{') => ('$' ~'{'))+;

// or

// Semantic Predicate implementing simple look ahead I believe
TEXT    :       (~'$' | { input.LA(1) == '$' && input.LA(2) != '{' } |  '$')+;

The first one is slower, but I believe more portable, the second one
is faster, but the lexer will have to be adjusted for each language
target (the first one would work in the Java and C targets, the second
one would only work in the Java target as I understand it).

Cheers,
   Kirby

>
>
>
> --
> Joan Jesús Pujol Espinar
> http://www.joanpujol.cat
> http://lujop.deviantart.com
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>