[antlr-interest] Re: Short circuit of the lexer

xadeck <decoret at graphics.lcs.mit.edu> decoret at graphics.lcs.mit.edu
Wed Jan 22 07:51:48 PST 2003


--- In antlr-interest at yahoogroups.com, "John D. Mitchell"
<johnm-antlr at n...> wrote:

> 
> (B) Are you sure that it's actually the lexing that's taking a long
time?
> Have you actually profiled your lexer and parser to determine that
or are
> you just guessing?
> 
> (C) If you're going to muck with nextToken(), you'll really need to make
> sure that your aren't violating the various assumptions that are
being made
> about the state that things are in at any point.
> 
> Take care,
> 	John

Hi, I guessed it was the lexing (I had a look at the generated c++
code) and it was confirmed later. I used a trick inspired by OpenVRML
developers: I specify the grammar so I have something that compiles
ans parses nicely (but slowly ;-)) my files. Then I overide the lexer
and rewrite the nextToken() function. Here is an example of how things
are done faster in this rewritten function. To recognize a float or an
int, I recognize first a digit, then I read everything until the next
whitespace, then I use C standard functions strtol or strtof to try to
convert this string first to an INT (if succeeds, return a token of
that type) and second to an FLOAT. If both fails, these functions
returns the character on which it failed so I can raise an exception.

This is really faster (about x20) so I am happy. Except that I guess
ANTLR could do something very quick too and I probably messed up (or
complexifie too much) the rules of my lexer. But when I try to
simplify them I have ANTLR warnings about non determinism. Below is
the int/float part.

I'll keep on investigating and will watch this mailing list to improve
my understanding.

Thanks for the hints.

Take care,



// . is a token 
// 2.34 +2.34 -2.34 2. .34 2.34E+12 2.34E-12 are valid floats
// 2 0xff are valid ints

// We need the following to disambiguate INT and FLOAT, and DOT
protected
DOT : '.' ;
protected
ROOTFLOAT
    :
('+'|'-')?('0'..'9')+('.'('0'..'9')*)?(('e'|'E')('+'|'-')?('0'..'9')+)?
       
    ;
protected
NOROOTFLOAT
    : ('+'|'-')?'.'('0'..'9')+(('e'|'E')('+'|'-')?('0'..'9')+)?
    ;
protected
FLOAT
    : (('+'|'-')?'.'('0'..'9')) => f:NOROOTFLOAT
    | f1:ROOTFLOAT
    ;
protected
ABS_INT
    : ('0'..'9')+
    | '0'('x'|'X')('0'..'9'|'a'..'f'|'A'..'F')+        
    ;
protected
INT
    : ('+'|'-')? ABS_INT
    ;
FLOAT_OR_INT_OR_DOT
    : (('+'|'-')?('0'..'9')+('.'|'e'|'E')) => FLOAT { $setType(FLOAT); }
    | (('+'|'-')?'.'('0'..'9')) => FLOAT { $setType(FLOAT); }
    | ('.') => DOT { $setType(DOT); }
    | INT { $setType(INT); }
    ;









 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list