[antlr-interest] antlr-interest Digest, Vol 27, Issue 48

Mon Feb 26 07:19:52 PST 2007

On Monday 26 February 2007 15:46:03 Martin d'Anjou wrote:
> >> lexer grammar DUMMY_Lexer;
> >> options { filter=true; }
> >>
> >> INT          : 'int' ;
> >> SEMI         : ';' ;
> >> WS           :  (  ' '| '\t'| '\r' | '\n' )+ {$channel=HIDDEN;} ;
> >> IDENTIFIER   : ('a'..'z'|'A'..'Z'|'_')+;
> >
> >Why are you using the filter option? This option causes ANTLR to try the
> >tokens one-by-one. It continues at the next token if the current token
> >does not match. So on the input 'intt' it will match an INT token first,
> >followed by the IDENTIFIER 't'. When you remove the filter option, it
> >should match a single IDENTIFIER token.
>
> I guess the real reason is I am lazy. I did not want to tokenize
> everything contained in the input (I could have used the skip feature -
> but I was too lazy for that too!).
>
> I still don't understand why the lexer would break the token at a
> character identified in a rule the lexer can match, and what it has to
> do with the filter=true. Perhaps an example would help me get that.

Suppose the input is 'id_int int_id'
With filter it first tries to match 'int' against the input, this fails. SEMI 
also fails, as does WS. Finally, with IDENTIFIER there is a match, the 
entire 'id_int' is matched. Now, it continues at the ' '. Again, it first 
tries INT and SEMI, but only WS succeeds. Now, it continues with 'int_id'. 
First, it tries to match INT, which succeeds. The INT token is returned and 
lexing continues with '_id'. This results in an IDENTIFIER.

The filter option is only useful if you want to lex snippets of the input. 
These snippets should have clear delimiters. For more information on the 
filter option:
http://www.antlr.org/wiki/display/ANTLR3/Lexical+filters

Best regards,
Emond