[antlr-interest] simple lexical analysis question

Wed Dec 16 02:39:01 PST 2009

Thanks for your answers, I now understand the stategy of lexers.
The left factoring you propose does not work better: because of the  
'F'  letter of the identifier following the minus sign, the
problem remains the same in the example '-FOO -FIN-' !

~/Soft/Antlr/LexJava: java Main test
line 1:2 mismatched character 'O' expecting 'I'
  --> [@-1,3:3='O',<6>,1:3]
  --> [@-1,4:4='\n',<7>,channel=99,1:4]
  --> [@-1,5:9='-FIN-',<5>,2:0]
  --> [@-1,10:30='                    \n',<7>,channel=99,2:5]

Jean-Claude Durand

LIG, équipe GETALP
385, rue de la Bibliothèque
BP 53
38041 Grenoble cedex 9
France

Jean-Claude.Durand at imag.fr
tél: +33 (0)4 76 51 43 81
fax: +33 (0)4 76 63 56 86

Le 14 déc. 09 à 19:35, John B. Brodie a écrit :

> Greetings!
> On Mon, 2009-12-14 at 19:18 +0100, Jean-Claude Durand wrote:
>> My lexical grammar (I use antlr v3.2):
>>
>> lexer grammar Lex;
>> options
>> { language=Java; }
>>
>>
>> WS: ( ' ' | '\t' | '\n' )+ { $channel=HIDDEN; } ;
>>
>>
>> FIN : '-FIN-' ;
>> Moins : '-' ;
>>
>>
>> // Identifiers:
>> Idf : ('A'..'Z')+ ;
>>
>>
>> I want to enumerate the tokens for the following example (Main.java  
>> is
>> in the archive):
>>
>>
>> VLEG-XLEG-FCINFZU
>>
>>
>> And the output is:
>>
>>
>> ~/Soft/Antlr/LexJava: java Main test
>> --> [@-1,0:3='VLEG',<7>,1:0]
>> --> [@-1,4:4='-',<6>,1:4]
>> --> [@-1,5:8='XLEG',<7>,1:5]
>> line 1:11 mismatched character 'C' expecting 'I'
>> --> [@-1,12:16='INFZU',<7>,1:12]
>> --> [@-1,17:36='                    ',<4>,channel=99,1:17]
>> ~/Soft/Antlr/LexJava:
>>
>>
>> The lexer is looking for the keyword -FIN-  and not for minus sign
>> followed by an identifier (which begins with an F).
>
> This is a well-known "feature" of ANTLR lexers. that once it sees the
> left prefix of a token it commits itself to only that token and will  
> not
> backup and consider other possibilities.
>
> you need to left factor your FIN and Moins rules. Something like the
> following (off the top of my head, untested, but gives the general
> idea):
>
> lexer grammar Lex;
> options { language=Java; }
> tokens { FIN; }
>
> WS: ( ' ' | '\t' | '\n' )+ { $channel=HIDDEN; } ;
>
> Moins : '-' ( 'FIN-' { $type = FIN; } )?;
>
> // Identifiers:
> Idf : ('A'..'Z')+ ;
>
>