[antlr-interest] v3 lexer cannot tell keyword from identifier (very strange)

Thu Feb 22 15:14:16 PST 2007

On Thu, 22 Feb 2007, Miguel Ping wrote:

> Doesn't it has to do with  precedence? My (maybe stupid) guess is that
> antlr is trying to match int before that trying to match int_id..

I tried putting the IDENTIFIER token definition first, and when I do that, 
I get:

       line 1:0 required (...)+ loop did not match anything at input 'int'

So I don't know what's going on. It's like the tokenizer is non-greedy for 
some reason.

As I said, it is very strange.

Martin

> On 2/22/07, Martin d'Anjou <martin.danjou at neterion.com> wrote:
>> Hi,
>> 
>> I have a very strange problem in 3.0b6. Given the input text:
>>
>>      int id;
>>      int int_id;
>> 
>> The error:
>>
>>     line 2:4 mismatched input 'int' expecting IDENTIFIER
>> 
>> It is mistaking "int_id" for "int", treating the underscore as a token
>> separator. The (ridiculous looking) lexer is:
>>
>>     lexer grammar DUMMY_Lexer;
>>     options { filter=true; }
>>
>>     MOD          : 'mod' ;
>>     END          : 'end' ;
>>     DEF          : 'def' ;
>>     INC          : 'inc' ;
>>     PAR          : 'par' ;
>>     INP          : 'inp' ;
>>     OUT          : 'out' ;
>>     INO          : 'ino' ;
>>     INT          : 'int' ;
>>     WER          : 'wer' ;
>>     COMMA        : ',' ;
>>     SEMI         : ';' ;
>>     L_PAREN      : '(' ;
>>     R_PAREN      : ')' ;
>>     ASSIGN       : '=' ;
>>     SHARP        : '#' ;
>>     LSHIFT       : '<<' ;
>>     MULT         : '*' ;
>>     MINUS        : '-' ;
>>     PLUS         : '+' ;
>>     COLON        : ':' ;
>>     LTEQ         : '<=' ;
>>     L_CURLY      : '{' ;
>>     R_CURLY      : '}' ;
>>     OR           : '|' ;
>>     SQUARE       :  '[]' ;
>>     QUOTE        :  '"' ;
>>     DIGIT        :  '0' ;
>>     WS           :  (  ' ' | EOL )+ {$channel=HIDDEN;} ;
>>     EOL          :  ('\r\n'|'\r'|'\n') ;
>>     LetterC      :  'c' |   Nothing ;
>>     Nothing      :   't' ;
>>     SL_COMMENT   :'a';
>>     ML_COMMENT   : '/' ;
>>     BASE         : 'b' ;
>>     BASE_NUM     : DIGIT+ (BASE DIGIT+)? ;
>>
>>     IDENTIFIER   : ('a'..'z'|UNDERSCORE)+ ;
>>
>>     fragment
>>     UNDERSCORE  :  '_' ;
>> 
>> The only token I was able to get out was the QUESTION : '?'; token. When I
>> remove any other token (like MOD or other), the error changes to:
>>
>>      line 1:0 required (...)+ loop did not match anything at input 'int'
>> 
>> Which makes it even weirder...
>> 
>> Now the parser is fairly minimal:
>>
>>     parser grammar DUMMY_Parser;
>>     options {
>>       tokenVocab=DUMMY_Lexer;
>>     }
>>
>>     source_text :
>>       int_defs+
>>       ;
>>
>>     int_defs :
>>       INT            { System.out.print("int "); }
>>       id=IDENTIFIER  { System.out.print($id.text); }
>>       SEMI           { System.out.println(";"); }
>>     ;
>> 
>> Help!!! (and thanks!)
>> Martin
>> 
>