[antlr-interest] Re: dfa-based lexers versus top-down antlr lexers

Mon Apr 28 15:39:26 PDT 2003

I see what you mean. Lately a friend of mine asked me how ANTLR 
decides which token rule to use for a production. This made me talk 
for half an hour but I could not really get it explained. He was 
thinking of something like "the most specific rule will be the one 
to match" and I was trying to say "no, it is all very 
deterministic". Anyway, "most specific" is similar to what you made 
up using left factoring or how you called it "combine left 
edges". "Most specific" seems to be the most natural way to 
interpret token definitions. One thing to notice there is

DIGIT1 : '1';

is also more specific than 

INT : ('0'..'9')+ ;

which can not be handeled with left factoring as it seems to me. For 
certain rule based systems it is instead very possible to make this 
distinction. How do they do this?!

Oliver

--- In antlr-interest at yahoogroups.com, Terence Parr <parrt at j...> 
wrote:
> Does anyone have an opinion concerning ANTLR's construction of top-
down 
> lexers versus more traditional dfa-based state machine solutions?
> 
> I just got really irritated this week building the PostScript 
> interpreter due to lexer lookahead issues.
> 
> PROS
> 
> Very readable lexers.  ('0'..'9')+ turns into a while look you can 
> debug/read.
> 
> Very powerful...can literally parse (because you can call other 
rules) 
> in the lexer.
> 
> Consistent with all other antlr-generated recognizers.  Lexers, 
> parsers, tree parsers all use the same analyzer and code generator.
> 
> Can execute an action anywhere during the recognition of a 
token...with 
> DFA based systems you have to wait until you know the complete 
token 
> has been match.
> 
> CONS
> 
> Some lexer rules are a huge pain to specify in ANTLR because of 
the 
> limitations of lookahead.
> 
> Lexer rules really confuse people since there are lots of special 
cases 
> in the lookahead and stuff.
> 
> I wonder if a compromise is possible.  Use a DFA-based decision to 
> determine which rule will match; can do arbitrary lookahead and 
all of 
> that automatically to make a correct decision.  The problem is 
speed: 
> you'd parse twice each token.  Perhaps I can simply combine all 
rules 
> with common left edges automatically to avoid weirdness.  For 
example:
> 
> INT : ('0'..'9')+ ;
> 
> FLOAT : ('0'..'9')+ ('.' ('0'..'9')*)? | '.' ('0'..'9')+ ;
> 
> could be turned into a combine rule by the system:
> 
> INT_FLOAT
> 	:	('0'..'9')+ ('.' ('0'..'9')*)?
> 	|	'.' ('0'..'9')+
> 	;
> 
> or whatever.
> 
> Still, my lookahead computations are really twisted for lexical 
rules 
> when you can see past end of a rule--literally any character can 
follow 
> a token (if you consider erroneous input).
> 
> ANYWAY, who has thoughts on this?  I'd like thoughts also from 
people 
> with *no* experience using DFA-based tools like lex/flex.  Do 
ANTLR 
> lexers seem "natural"?
> 
> Thanks!
> Ter
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
> Co-founder, http://www.peerscope.com link sharing, pure-n-simple
> Lecturer in Comp. Sci., University of San Francisco

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/