[antlr-interest] simple lexical analysis question

Mon Dec 14 10:46:07 PST 2009

On Mon, Dec 14, 2009 at 12:18 PM, Jean-Claude Durand
<Jean-Claude.Durand at imag.fr> wrote:
> My lexical grammar (I use antlr v3.2):
> lexer grammar Lex;
> options
> { language=Java; }
> WS: ( ' ' | '\t' | '\n' )+ { $channel=HIDDEN; } ;
> FIN : '-FIN-' ;
> Moins : '-' ;
> // Identifiers:
> Idf : ('A'..'Z')+ ;
> I want to enumerate the tokens for the following example (Main.java is in
> the archive):
> VLEG-XLEG-FCINFZU
> And the output is:
> ~/Soft/Antlr/LexJava: java Main test
>  --> [@-1,0:3='VLEG',<7>,1:0]
>  --> [@-1,4:4='-',<6>,1:4]
>  --> [@-1,5:8='XLEG',<7>,1:5]
> line 1:11 mismatched character 'C' expecting 'I'
>  --> [@-1,12:16='INFZU',<7>,1:12]
>  --> [@-1,17:36='                    ',<4>,channel=99,1:17]
> ~/Soft/Antlr/LexJava:
> The lexer is looking for the keyword -FIN-  and not for minus sign followed
> by an identifier (which begins with an F).
>

This is one of the weirdism's (IMHO, the author of the tool, finds
this totally natural, and he's got 20 odd years of parsing
experience), of the lexer is that there is no backtracking.  Hopefully
I've said that correctly, so lets say you have rules like:

TOKEN_1:  'F';
TOKEN_2:  'FOO';
TOKEN_3:  'FOOBAR';
TOKEN_4:  'AR';

Now, if given the input of: 'FOOBFAR', I would expect to get:
TOKEN_2, TOKEN_1, TOKEN_4, however, you'll get an error (I forget which one).

As it moves along parsing the input, it keeps track of which tokens it
could be, and if the next character could be part of any token, all
the tokens it can't be are removed from consideration.

So my understanding of how my example works is as follows:
When the 'F' is consumed rules 1, 2, 3 are eligible, 4 has been removed.
When the 'O' is consumed rules 2, 3 are eligible, rule 1 has been removed.
When the 'O' is consumed rules 2, 3 are eligible.
When the 'B' is consumed rules 3 is eligible, rule 2 is removed (at
which point, you can never return TOKEN_2), which would be my expected
behavior.
When the 'F' is consumed no rules are eligible, throw some type of
ANTLR Exception.

I really wish it'd fall back and give me the last token that did
actually work.  From what I can gather, there is some performance
issue with doing that (I've never understood why).  Instead, you have
to manually implement all of the back tracking yourself.  If you
wanted to do something like that, you'd do something like:

// NOTE: This is inefficient, in reality, you'd declare TOKEN_3 in the
tokens declaration, and after matching 'FOO', see if the 'BAR'
matches, if it does, return TOKEN_3, if it doesn't return TOKEN_2 and
don't consume any more input.
TOKEN_1: 'F';
TOKEN_2: 'FOO';
// The lookahead syntax is from memory, it might be wrong...
// I might also have chosen the wrong one between semantic and
syntactic predicates, I defer to more experience folks again...
TOKEN_3: ('FOOBAR')=> 'FOOBAR';
TOKEN_4: 'AR';

The canonical example of this technique:
http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs

That's fairly complex, if you don't understand what it's doing, I'd
recommend picking up the ANTLR book.

So in your case, it sees the '-F', at which point, the only eligible
token is '-FINS-', if the input isn't that (which in your case it
isn't), an exception will be thrown on the first character that does
not match the lexer expectation.

I have no idea what the best approach for resolving this is.  Others
on the list have a much deeper understanding of language design then I
do.  I'll defer to them on the best approach for this problem.  My
e-mail is just trying to explain what I believe is going on (similar
problems I had confused, and frustrated me for a long time when I
started with ANTLR).

Kirby

>
> Thanks a lot for your help.
> Jean-Claude Durand
> LIG, équipe GETALP
> 385, rue de la Bibliothèque
> BP 53
> 38041 Grenoble cedex 9
> Jean-Claude.Durand at imag.fr
> tél: +33 (0)4 76 51 43 81
> fax: +33 (0)4 76 63 56 86
>
>
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>