[antlr-interest] Lexer bug?

Sun Oct 21 17:26:25 PDT 2007

This isn't a bug.

You need to specify your lexer rule such that it is easy to take the correct
path, rather than look to the lexer to work some magic for you ;-). Lexer
rules cannot see other lexer rules - you want your generated lexer to be as
fast as it can be because that is where most of your recognition time will
(probably go). It is a lot easier than you think, and here is a small
example that should let you work it out:

SOMETHINGDOTTY
  : DIGIT+
      (
         '.'  // Here might be decimal or range
           (
               '.' DIGIT+  { $type = RANGE; } // It was a range
             | DIGIT+      { $type = DECIMAL; } // Decimal
             | // Flag ill formatted number/range error
           )
        |  { $type = INTEGER; }
      )
   | '.' DIGIT+            { $type = DECIMAL; {
   ;

The token types are either fragments or entries in tokens section (but
entries in token section will give you the erroneous warning that they are
not defined as types when used in the lexer to set token type).

Note that the rule above traps things that look like they are typos (unless
you allow 5.) so that you decide what to do with it, rather than having a
lexer that spits dummies.

Think of this more in terms of how you would program nested if statements.
It would be inefficient to say if *c == '.' && *c != '.' else if *C == '.'
... and so on. Just simplify it out to the simples non-ambiguous path for a
single lexer rule. Then you will get an efficient and easy to maintain lexer
:-)

Hope this helps,

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Clifford Heath
> Sent: Sunday, October 21, 2007 5:12 PM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Lexer bug?
> 
> Austin Hastings wrote:
> > You're right. I looked at your definition of NUMBER and just assumed
> you
> > were using the common one. It looks like a bug.
> >
> > In fact, (some time later) I'm looking at the generated code now with
> > new disrespect. The tokenizer is doing a minimal look-ahead and then
> > committing to a token - when it sees '1' in your 10..20 example, it
> > commits to a NUMBER. When it comes to '.' it commits to FRACTION.
> There
> > doesn't appear to be any consideration that one path might fail and
> > another be chosen.
> 
> I feared this must be what was happening...
> And yet the DIGIT+ path must fail, but no error is reported.
> So there's *two* errors in the generated lexer, one where it
> takes the wrong path, and one where it doesn't report the error
> it sees.
> 
> This second error must be affecting other cases of invalid input as
> well...?
> 
> Clifford Heath.
> 
> 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.488 / Virus Database: 269.15.3/1082 - Release Date:
> 10/20/2007 2:59 PM
> 

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.488 / Virus Database: 269.15.3/1082 - Release Date: 10/20/2007
2:59 PM

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20071021/0b015f3a/attachment.html