[antlr-interest] dfa-based lexers versus top-down antlr lexers

Wed Apr 30 07:05:03 PDT 2003

I've done a couple projects using ANTLR and have not yet 
been able to use the lexer.  I've had to hand-craft my own 
lexer's because of column positional token's issues.

I think it's a strength of ANTLR that you have multiple 
choices on how to build a lexer. 

Adding a dfa-based lexer option is a great idea.
(I wouldn't change the current lexer, I would add second 
style.)

--- Terence Parr <parrt at jguru.com> wrote:
> Does anyone have an opinion concerning ANTLR's construction of
> top-down 
> lexers versus more traditional dfa-based state machine solutions?
> 
> I just got really irritated this week building the PostScript 
> interpreter due to lexer lookahead issues.
> 
> PROS
> 
> Very readable lexers.  ('0'..'9')+ turns into a while look you can 
> debug/read.
> 
> Very powerful...can literally parse (because you can call other
> rules) 
> in the lexer.
> 
> Consistent with all other antlr-generated recognizers.  Lexers, 
> parsers, tree parsers all use the same analyzer and code generator.
> 
> Can execute an action anywhere during the recognition of a
> token...with 
> DFA based systems you have to wait until you know the complete token 
> has been match.
> 
> CONS
> 
> Some lexer rules are a huge pain to specify in ANTLR because of the 
> limitations of lookahead.
> 
> Lexer rules really confuse people since there are lots of special
> cases 
> in the lookahead and stuff.
> 
> I wonder if a compromise is possible.  Use a DFA-based decision to 
> determine which rule will match; can do arbitrary lookahead and all
> of 
> that automatically to make a correct decision.  The problem is speed:
> 
> you'd parse twice each token.  Perhaps I can simply combine all rules
> 
> with common left edges automatically to avoid weirdness.  For
> example:
> 
> INT : ('0'..'9')+ ;
> 
> FLOAT : ('0'..'9')+ ('.' ('0'..'9')*)? | '.' ('0'..'9')+ ;
> 
> could be turned into a combine rule by the system:
> 
> INT_FLOAT
> 	:	('0'..'9')+ ('.' ('0'..'9')*)?
> 	|	'.' ('0'..'9')+
> 	;
> 
> or whatever.
> 
> Still, my lookahead computations are really twisted for lexical rules
> 
> when you can see past end of a rule--literally any character can
> follow 
> a token (if you consider erroneous input).
> 
> ANYWAY, who has thoughts on this?  I'd like thoughts also from people
> 
> with *no* experience using DFA-based tools like lex/flex.  Do ANTLR 
> lexers seem "natural"?
> 
> Thanks!
> Ter
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
> Co-founder, http://www.peerscope.com link sharing, pure-n-simple
> Lecturer in Comp. Sci., University of San Francisco
> 
> 
>  
> 
> Your use of Yahoo! Groups is subject to
> http://docs.yahoo.com/info/terms/ 
> 
> 

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/