[antlr-interest] dfa-based lexers versus top-down antlr lexer s

Fri Apr 25 14:37:36 PDT 2003

The GCC translation toolkit doesn't do too much interesting in the lexer.
The hardest thing was interpreting the #line directives, but if I could name
pieces of the token like any perl compatible regex library does then it'd be
easy to write the action after the match.  

The rest of the world is just so familiar with regular expressions that it's
hard to explain why they can't use them.  There's actually a nice python
library by Ka-Ping Yee that lets you specify regular expressions in a more
English fashion http://lfw.org/python/rxb15.py. I'm sure there are various
other nice regex builders out there.

Monty

-----Original Message-----
From: Terence Parr [mailto:parrt at jguru.com]
Sent: Friday, April 25, 2003 1:21 PM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] dfa-based lexers versus top-down antlr lexers

Does anyone have an opinion concerning ANTLR's construction of top-down 
lexers versus more traditional dfa-based state machine solutions?

I just got really irritated this week building the PostScript 
interpreter due to lexer lookahead issues.

PROS

Very readable lexers.  ('0'..'9')+ turns into a while look you can 
debug/read.

Very powerful...can literally parse (because you can call other rules) 
in the lexer.

Consistent with all other antlr-generated recognizers.  Lexers, 
parsers, tree parsers all use the same analyzer and code generator.

Can execute an action anywhere during the recognition of a token...with 
DFA based systems you have to wait until you know the complete token 
has been match.

CONS

Some lexer rules are a huge pain to specify in ANTLR because of the 
limitations of lookahead.

Lexer rules really confuse people since there are lots of special cases 
in the lookahead and stuff.

I wonder if a compromise is possible.  Use a DFA-based decision to 
determine which rule will match; can do arbitrary lookahead and all of 
that automatically to make a correct decision.  The problem is speed: 
you'd parse twice each token.  Perhaps I can simply combine all rules 
with common left edges automatically to avoid weirdness.  For example:

INT : ('0'..'9')+ ;

FLOAT : ('0'..'9')+ ('.' ('0'..'9')*)? | '.' ('0'..'9')+ ;

could be turned into a combine rule by the system:

INT_FLOAT
	:	('0'..'9')+ ('.' ('0'..'9')*)?
	|	'.' ('0'..'9')+
	;

or whatever.

Still, my lookahead computations are really twisted for lexical rules 
when you can see past end of a rule--literally any character can follow 
a token (if you consider erroneous input).

ANYWAY, who has thoughts on this?  I'd like thoughts also from people 
with *no* experience using DFA-based tools like lex/flex.  Do ANTLR 
lexers seem "natural"?

Thanks!
Ter
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Co-founder, http://www.peerscope.com link sharing, pure-n-simple
Lecturer in Comp. Sci., University of San Francisco

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/