[antlr-interest] I can't understand lexer behaivour. Lexers are LL(*)?

Mon Oct 12 16:42:00 PDT 2009

Thanks Kirby,

But if that is correct the lexer is LL(1), isn't it?
I think that for my grammar (~'$' | '$' ~'{')+ only LL(2) is needed to
see (without any backtracking) which is the correct rule, isn't it?

I've read the old thread you said but still I don't understand what is
the exact logic of the lexer. And why it needs the syntactic
predicates in my case.
In the old thread someone points that lexer rules can't see other
rules, but in my case the lexer only needs to see that rule to view
that when it encounters ${ it has to end the token because there are
no alternatives to continue the token.

Cheers,

On Tue, Oct 13, 2009 at 1:02 AM, Kirby Bohling <kirby.bohling at gmail.com> wrote:
> On Mon, Oct 12, 2009 at 4:19 PM, Joan Pujol <joanpujol at gmail.com> wrote:
>> Hi,
>>
>> I've a problem with the lexer. Given that grammar:
>>
>>
>> TEXT    :       (~'$' | '$' ~'{')+;
>> OTHER   :        .;
>> program :       (TEXT|OTHER)+;
>>
>> If I try with input     'Hola ${ ' it does'nt works. And I can't understand why.
>> If I look at generated code of rule TEXT it doesn't use a lookahead of
>>  2. In fact when it finds a '$' it enters in the subrule '$' ~'{
>> without using LA(2) to see that the next caracter is a { and that then
>> it can't enter the subrule.
>> In fact as it only compares LA(1)='$' it enters the subrule but then
>> when it founds the{ it fails with a lexical error.
>>
>> And I can't understand why the lexer isn't capable to generate code to
>> compare with LA(1)='$' and LA(2)!='{' to enter the subrule.
>>
>> Is that the correct behaivour? Why de lexer can aproximate the K=2?
>> Someone can explain why?
>>
>> A lot of thanks in advance,
>>
>
>
> There are far too many threads that refer to the FAQ entry, but not
> enough actual links (I've always found it hard to find while searching
> the FAQ, maybe I'm just blind or something).
>
> See here:
> http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs
>
> This is a common problem.  Terrance gave some discussion in this thread:
>
> http://www.antlr.org/pipermail/antlr-interest/2007-October/024273.html
>
> It'd be great if the lexer would emit errors/warnings noting that
> there is input that can end in a no mans land and the lexer will blow
> up.  I don't think it will do that as of 3.1.3 (I haven't played with
> newer versions).
>
> The crux of the problem is (with my limited understanding):  Once the
> lexer/DFA decides: "A longer match *MIGHT* exist", there is no more
> backing up, or use a rule that matched the longest prefix so far.  So
> if a rule looks like it might match longer input, the rules that match
> the shorter input are thrown away.  A trivial case would be something
> like this:
>
> (From memory, I haven't actually put this into ANTLRWorks, but I know
> I've seen something similar to this setup have problems where
> lookahead was the only solution):
> ONE_BRACE: '{';
> THREE_BRACE: '{{{';
>
>
> If you feed it the input of '{{', it will blow up upon inside the
> lexer, despite the fact that the would would match ONE_BRACE,
> ONE_BRACE.  Once the lexer/DFA decides, there have been enough braces
> that it could match a longer rule then ONE_BRACE any more, the lexer
> will never emit "ONE_BRACE"...  So the THREE_BRACE is all that is
> left, once the "match" is called, it gives some type of
> RecognitionException.  It might work around that if backtracking is
> enabled, I'm not sure.  I know I've had problems like this where
> enabling backtracking resolved some of them, but not others.  It's
> annoying, but ANTLR is useful enough I just get over it.
>
> I'm assuming that the it is because the '$' ~'{' is longer and could
> match, so the lexer/DFA has decided that the OTHER rule is no longer
> an option.  You might try manually using the lookahead or a predicate
> like the following:
>
> // Syntactic Predicate I believe
> TEXT    :       (~'$' |  ('$' ~'{') => ('$' ~'{'))+;
>
> // or
>
> // Semantic Predicate implementing simple look ahead I believe
> TEXT    :       (~'$' | { input.LA(1) == '$' && input.LA(2) != '{' } |  '$')+;
>
> The first one is slower, but I believe more portable, the second one
> is faster, but the lexer will have to be adjusted for each language
> target (the first one would work in the Java and C targets, the second
> one would only work in the Java target as I understand it).
>
> Cheers,
>   Kirby
>
>
>>
>>
>>
>> --
>> Joan Jesús Pujol Espinar
>> http://www.joanpujol.cat
>> http://lujop.deviantart.com
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>

-- 
Joan Jesús Pujol Espinar
http://www.joanpujol.cat
http://lujop.deviantart.com