[antlr-interest] dfa-based lexers versus top-down antlr lexer
s
mzukowski at yci.com
mzukowski at yci.com
Fri Apr 25 14:37:36 PDT 2003
The GCC translation toolkit doesn't do too much interesting in the lexer.
The hardest thing was interpreting the #line directives, but if I could name
pieces of the token like any perl compatible regex library does then it'd be
easy to write the action after the match.
The rest of the world is just so familiar with regular expressions that it's
hard to explain why they can't use them. There's actually a nice python
library by Ka-Ping Yee that lets you specify regular expressions in a more
English fashion http://lfw.org/python/rxb15.py. I'm sure there are various
other nice regex builders out there.
Monty
-----Original Message-----
From: Terence Parr [mailto:parrt at jguru.com]
Sent: Friday, April 25, 2003 1:21 PM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] dfa-based lexers versus top-down antlr lexers
Does anyone have an opinion concerning ANTLR's construction of top-down
lexers versus more traditional dfa-based state machine solutions?
I just got really irritated this week building the PostScript
interpreter due to lexer lookahead issues.
PROS
Very readable lexers. ('0'..'9')+ turns into a while look you can
debug/read.
Very powerful...can literally parse (because you can call other rules)
in the lexer.
Consistent with all other antlr-generated recognizers. Lexers,
parsers, tree parsers all use the same analyzer and code generator.
Can execute an action anywhere during the recognition of a token...with
DFA based systems you have to wait until you know the complete token
has been match.
CONS
Some lexer rules are a huge pain to specify in ANTLR because of the
limitations of lookahead.
Lexer rules really confuse people since there are lots of special cases
in the lookahead and stuff.
I wonder if a compromise is possible. Use a DFA-based decision to
determine which rule will match; can do arbitrary lookahead and all of
that automatically to make a correct decision. The problem is speed:
you'd parse twice each token. Perhaps I can simply combine all rules
with common left edges automatically to avoid weirdness. For example:
INT : ('0'..'9')+ ;
FLOAT : ('0'..'9')+ ('.' ('0'..'9')*)? | '.' ('0'..'9')+ ;
could be turned into a combine rule by the system:
INT_FLOAT
: ('0'..'9')+ ('.' ('0'..'9')*)?
| '.' ('0'..'9')+
;
or whatever.
Still, my lookahead computations are really twisted for lexical rules
when you can see past end of a rule--literally any character can follow
a token (if you consider erroneous input).
ANYWAY, who has thoughts on this? I'd like thoughts also from people
with *no* experience using DFA-based tools like lex/flex. Do ANTLR
lexers seem "natural"?
Thanks!
Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Co-founder, http://www.peerscope.com link sharing, pure-n-simple
Lecturer in Comp. Sci., University of San Francisco
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list