[antlr-interest] Troubles lexing a decimal, (from an antlr beginner)

Tue Jul 24 11:05:48 PDT 2007

Igor,

This questions was asked and answered just a few days ago:

I think that this question points out that many of us expect ATNLR to
"just work it out" for us. All these problems are best solved with a
mind experiment first "How would you scan it with the eye?", then break
the rule at the different alternatives yourself and stick in the
lookahead you perform with your mind. It will result in better generated
code anyway:

grammar fred;

stat 

            : test+

            ;

test

            :           (INT DOT ID)

            |           FLOAT

            ;

fragment

DIGIT    : '0'..'9'

            ;

FLOAT : INT

                                    (

                                                  ('.' INT)=> '.' INT

                                                | {$type = INT; }

                                    )

                        ;

DOT     : '.' ;

Fragment                                  // Also ensures a token type
INT is present

INT       : DIGIT+;

ID         :           ('A'..'Z' | 'a'..'z')+

            ;

Jim

From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Igor Murashkin
Sent: Tuesday, July 24, 2007 9:45 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Troubles lexing a decimal, (from an antlr
beginner)

Hello,

Well let me just say, its my first time using ANTLR. I needed a C#
parser generator so using flex/bison as I have done before was simply
out of the question, and I figured learning an LL(k) parser should be a
nice variation to just using LR(k). 

Unfortunately before I can even get to the parsing, I need to fix my
lexing.. right now it doesn't work for matching decimals properly. Here
are the lexing rules in question:

===============

DOT        : '.'   ; 
INTEGER    :    Digit+;
DECIMAL    :    Digit+ '.' Digit+;
fragment Digit
    :    '0'..'9';
IDENT    :     ('a'..'z'|'A'..'Z'|'_')
('a'..'z'|'A'..'Z'|'0'..'9'|'_')*; 

NL    :    ('\r\n' // DOS/Windows
    |     '\r'  // Macintosh
          |     '\n') // Unix
          { $channel=HIDDEN; };

WS
      :     (' '
        |     '\t'
        |     '\f')
        { $channel=HIDDEN; };

===============

Unfortunately with simple output such as this it crashes with an
EarlyExitException:

===============
console.flushBuffer
general.holdMsec 1000
object 1.doSomeAction withThis
=============== 
The third line should produce "IDENT INTEGER DOT IDENT IDENT" but
instead it tries to match "1." as a DECIMAL and then once it sees the
"d" it fails and throws an EarlyExitException. 

I am completely unsure what is going on.. I tried to set k=2 in options
figuring that if it looked at the period AND the next character it would
get a ('.' , 'd') clearly that does not match the DECIMAL rule.. but
then I just got a bunch of warnings in my lexer grammar so I removed the
k=2 line altogether. Looking at the generated code though its always
calling LA(1) and maybe there should be a way to get it to call LA(2) ? 

Probably I am completely misunderstanding how the whole process of
lexing is working too. Looking at the generated code it is generating
some DFAs, which would imply some kind of regular language being at work
here? Or does it still use LL(k) parsing even for lexing? 

I'm going to try to get the book asap too, probably it explains some of
this...

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20070724/08db0eb1/attachment-0001.html