[antlr-interest] Re: Short circuit of the lexer
xadeck <decoret at graphics.lcs.mit.edu>
decoret at graphics.lcs.mit.edu
Wed Jan 22 07:51:48 PST 2003
--- In antlr-interest at yahoogroups.com, "John D. Mitchell"
<johnm-antlr at n...> wrote:
>
> (B) Are you sure that it's actually the lexing that's taking a long
time?
> Have you actually profiled your lexer and parser to determine that
or are
> you just guessing?
>
> (C) If you're going to muck with nextToken(), you'll really need to make
> sure that your aren't violating the various assumptions that are
being made
> about the state that things are in at any point.
>
> Take care,
> John
Hi, I guessed it was the lexing (I had a look at the generated c++
code) and it was confirmed later. I used a trick inspired by OpenVRML
developers: I specify the grammar so I have something that compiles
ans parses nicely (but slowly ;-)) my files. Then I overide the lexer
and rewrite the nextToken() function. Here is an example of how things
are done faster in this rewritten function. To recognize a float or an
int, I recognize first a digit, then I read everything until the next
whitespace, then I use C standard functions strtol or strtof to try to
convert this string first to an INT (if succeeds, return a token of
that type) and second to an FLOAT. If both fails, these functions
returns the character on which it failed so I can raise an exception.
This is really faster (about x20) so I am happy. Except that I guess
ANTLR could do something very quick too and I probably messed up (or
complexifie too much) the rules of my lexer. But when I try to
simplify them I have ANTLR warnings about non determinism. Below is
the int/float part.
I'll keep on investigating and will watch this mailing list to improve
my understanding.
Thanks for the hints.
Take care,
// . is a token
// 2.34 +2.34 -2.34 2. .34 2.34E+12 2.34E-12 are valid floats
// 2 0xff are valid ints
// We need the following to disambiguate INT and FLOAT, and DOT
protected
DOT : '.' ;
protected
ROOTFLOAT
:
('+'|'-')?('0'..'9')+('.'('0'..'9')*)?(('e'|'E')('+'|'-')?('0'..'9')+)?
;
protected
NOROOTFLOAT
: ('+'|'-')?'.'('0'..'9')+(('e'|'E')('+'|'-')?('0'..'9')+)?
;
protected
FLOAT
: (('+'|'-')?'.'('0'..'9')) => f:NOROOTFLOAT
| f1:ROOTFLOAT
;
protected
ABS_INT
: ('0'..'9')+
| '0'('x'|'X')('0'..'9'|'a'..'f'|'A'..'F')+
;
protected
INT
: ('+'|'-')? ABS_INT
;
FLOAT_OR_INT_OR_DOT
: (('+'|'-')?('0'..'9')+('.'|'e'|'E')) => FLOAT { $setType(FLOAT); }
| (('+'|'-')?'.'('0'..'9')) => FLOAT { $setType(FLOAT); }
| ('.') => DOT { $setType(DOT); }
| INT { $setType(INT); }
;
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list