[antlr-interest] bug in 3.0b6: identifier/keyword or underscore problem?

Mon Feb 26 10:32:59 PST 2007

On Monday 26 February 2007 15:46:03 Martin d'Anjou wrote:
>>>> lexer grammar DUMMY_Lexer;
>>>> options { filter=true; }
>>>>
>>>> INT          : 'int' ;
>>>> SEMI         : ';' ;
>>>> WS           :  (  ' '| '\t'| '\r' | '\n' )+ {$channel=HIDDEN;} ;
>>>> IDENTIFIER   : ('a'..'z'|'A'..'Z'|'_')+;
>>>
>>>Why are you using the filter option? This option causes ANTLR to try
>>>the
>>>tokens one-by-one. It continues at the next token if the current token 
>>>does not match. So on the input 'intt' it will match an INT token
>>>first,
>>>followed by the IDENTIFIER 't'. When you remove the filter option, it
>>>should match a single IDENTIFIER token.
>>
>> I guess the real reason is I am lazy. I did not want to tokenize
>> everything contained in the input (I could have used the skip feature -
>> but I was too lazy for that too!).
>>
>> I still don't understand why the lexer would break the token at a
>> character identified in a rule the lexer can match, and what it has to
>> do with the filter=true. Perhaps an example would help me get that.
>
> Suppose the input is 'id_int int_id' With filter it first tries to match 
> 'int' against the input, this fails. SEMI also fails, as does WS. 
> Finally, with IDENTIFIER there is a match, the entire 'id_int' is 
> matched. Now, it continues at the ' '. Again, it first tries INT and 
> SEMI, but only WS succeeds.
>
> Now, it continues with 'int_id'. First, it tries to match INT, which 
> succeeds.

This is where I do not understand the behavior. I see how it can match 
INT, but my question is: why isn't it trying for the longest match, which 
would be IDENTIFIER. Perhaps this is because some languages do not require 
the use of a separator between tokens (BASIC? FORTRAN? J?) and that's what 
ANTLR wants to support? If so, would it make more sense to have an option 
defining that's the desired behavior rather than lumping it in with 
filter=true, which also happens to mean "ignored unspecified tokens"?

Regards,
Martin