[antlr-interest] simple lexical analysis question

Mon Dec 14 10:52:48 PST 2009

Crud, there's an error in my example... Sorry about that, it'd blow up
at the 'B'.

One more time:
TOKEN_1:  'F';
TOKEN_2:  'FOO';
TOKEN_3:  'FOOBAR';
TOKEN_4:  'A';
TOKEN_5: 'B';

The input of:
'FOOBAF'

'F' tokens 1, 2, 3 are eligible, 4, 5 is removed.
'O' tokens 2, 3 are eligible, 1 is removed.
'O' tokens 2, 3 are eligible.
'B' token 3 is eligible, 2 is removed.
'A' token 3 is eligible.
'F' token 3 is not eligible, exception thrown.

Rather then generating: TOKEN_2, TOKEN_5, TOKEN_4, TOKEN_1 (which
technically would match), unlike my previous example.

Kirby

On Mon, Dec 14, 2009 at 12:46 PM, Kirby Bohling <kirby.bohling at gmail.com> wrote:
> On Mon, Dec 14, 2009 at 12:18 PM, Jean-Claude Durand
> <Jean-Claude.Durand at imag.fr> wrote:
>> My lexical grammar (I use antlr v3.2):
>> lexer grammar Lex;
>> options
>> { language=Java; }
>> WS: ( ' ' | '\t' | '\n' )+ { $channel=HIDDEN; } ;
>> FIN : '-FIN-' ;
>> Moins : '-' ;
>> // Identifiers:
>> Idf : ('A'..'Z')+ ;
>> I want to enumerate the tokens for the following example (Main.java is in
>> the archive):
>> VLEG-XLEG-FCINFZU
>> And the output is:
>> ~/Soft/Antlr/LexJava: java Main test
>>  --> [@-1,0:3='VLEG',<7>,1:0]
>>  --> [@-1,4:4='-',<6>,1:4]
>>  --> [@-1,5:8='XLEG',<7>,1:5]
>> line 1:11 mismatched character 'C' expecting 'I'
>>  --> [@-1,12:16='INFZU',<7>,1:12]
>>  --> [@-1,17:36='                    ',<4>,channel=99,1:17]
>> ~/Soft/Antlr/LexJava:
>> The lexer is looking for the keyword -FIN-  and not for minus sign followed
>> by an identifier (which begins with an F).
>>
>
>
> This is one of the weirdism's (IMHO, the author of the tool, finds
> this totally natural, and he's got 20 odd years of parsing
> experience), of the lexer is that there is no backtracking.  Hopefully
> I've said that correctly, so lets say you have rules like:
>
> TOKEN_1:  'F';
> TOKEN_2:  'FOO';
> TOKEN_3:  'FOOBAR';
> TOKEN_4:  'AR';
>
> Now, if given the input of: 'FOOBFAR', I would expect to get:
> TOKEN_2, TOKEN_1, TOKEN_4, however, you'll get an error (I forget which one).
>
> As it moves along parsing the input, it keeps track of which tokens it
> could be, and if the next character could be part of any token, all
> the tokens it can't be are removed from consideration.
>
> So my understanding of how my example works is as follows:
> When the 'F' is consumed rules 1, 2, 3 are eligible, 4 has been removed.
> When the 'O' is consumed rules 2, 3 are eligible, rule 1 has been removed.
> When the 'O' is consumed rules 2, 3 are eligible.
> When the 'B' is consumed rules 3 is eligible, rule 2 is removed (at
> which point, you can never return TOKEN_2), which would be my expected
> behavior.
> When the 'F' is consumed no rules are eligible, throw some type of
> ANTLR Exception.
>
> I really wish it'd fall back and give me the last token that did
> actually work.  From what I can gather, there is some performance
> issue with doing that (I've never understood why).  Instead, you have
> to manually implement all of the back tracking yourself.  If you
> wanted to do something like that, you'd do something like:
>
> // NOTE: This is inefficient, in reality, you'd declare TOKEN_3 in the
> tokens declaration, and after matching 'FOO', see if the 'BAR'
> matches, if it does, return TOKEN_3, if it doesn't return TOKEN_2 and
> don't consume any more input.
> TOKEN_1: 'F';
> TOKEN_2: 'FOO';
> // The lookahead syntax is from memory, it might be wrong...
> // I might also have chosen the wrong one between semantic and
> syntactic predicates, I defer to more experience folks again...
> TOKEN_3: ('FOOBAR')=> 'FOOBAR';
> TOKEN_4: 'AR';
>
> The canonical example of this technique:
> http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs
>
> That's fairly complex, if you don't understand what it's doing, I'd
> recommend picking up the ANTLR book.
>
> So in your case, it sees the '-F', at which point, the only eligible
> token is '-FINS-', if the input isn't that (which in your case it
> isn't), an exception will be thrown on the first character that does
> not match the lexer expectation.
>
> I have no idea what the best approach for resolving this is.  Others
> on the list have a much deeper understanding of language design then I
> do.  I'll defer to them on the best approach for this problem.  My
> e-mail is just trying to explain what I believe is going on (similar
> problems I had confused, and frustrated me for a long time when I
> started with ANTLR).
>
> Kirby
>
>>
>> Thanks a lot for your help.
>> Jean-Claude Durand
>> LIG, équipe GETALP
>> 385, rue de la Bibliothèque
>> BP 53
>> 38041 Grenoble cedex 9
>> Jean-Claude.Durand at imag.fr
>> tél: +33 (0)4 76 51 43 81
>> fax: +33 (0)4 76 63 56 86
>>
>>
>>
>>
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>>
>