[antlr-interest] Problem with lexer rule for an optional suffix

Sat Nov 14 10:15:10 PST 2009

Be wary of mng lexer rules into the parser! see below....
On Sat, 2009-11-14 at 09:17 -0800, Jim Idle wrote:
> 
> > -----Original Message-----
> > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> > bounces at antlr.org] On Behalf Of Scott Oakes
> > Sent: Saturday, November 14, 2009 1:08 AM
> > To: antlr-interest at antlr.org
> > Subject: [antlr-interest] Problem with lexer rule for an optional
> > suffix
> > 
> > Hoping for some newbie help on the following lexer.
> > 
> >   fragment DIGIT:      '0'..'9';
> >   fragment LETTER: ('a'..'z'|'A'..'Z');
> > 
> >   ID:  (LETTER | '.')+ ('.' DIGIT+)?
> >        | DIGIT+
> >       ;
> 
> Well this rule is wrong. It allows:
> 
> .....99
> A..44
> 
> But not A.56
> 
> You would need:
> 
> ID : (LETTER+) (('.' LETTER)=>'.' LETTER+)* (('.' DIGIT)=> '.' DIGIT+)? ;
> 
> 
> But you really want to do such things in the parser as you usually want to dissect the identifier. If a part of the id can only be numbers, then you could do it in the lexer, but then any errors will come out from the lexer and be very confusing.
> 
> The general idea is to cover everything in the lexer so it does not issue messages, but leave context out of the lexer. Then in the parser, defer as much error handling as possible to the tree walker. This way you get much better error messages. With your example:
> 
> a.b4.f.5
> 
> Lexer: Unexpected character at '4'
> Parser: Extraneous token '4'
> Walker (Though you can do this one in the parser): 'b4' is not a valid component of multipart identifier
> 
> So:
> 
> ID : LETTER+;
> NUM : DIGIT+;
> id : id_part (DOT^ id_part)*  { actions to check in Java go here if you have no tree walker } ;

Jim makes an excellent argument for hoisting these lexer rules up into
the parser -- better error checking.

But you need to be aware of interactions with other lexer rules, in
particular, any token you put on the HIDDEN channel will be accepted in
between (*any*) of your other tokens.

So if you have Jim's 3 rules plus a lexer rule to ignore whitespace,
like:

ID : LETTER+ ;
NUM : DIGIT+ ; 
id : id_part (DOT^ id_part)* ; //also rules for id_part and DOT
WS : (' ' | '\t' | '\n' | '\r')+ { $channel = HIDDEN; } ;

then the input string "foo   .   bar" (observe spaces surrounding the .)
would be accepted by your lexer as the three token sequence ID DOT ID
If this is acceptable in your language, you should definitely follow
Jim's advice...