[antlr-interest] Re: dfa-based lexers versus top-down antlr lexers
Oliver Zeigermann
oliver at zeigermann.de
Mon Apr 28 15:39:26 PDT 2003
I see what you mean. Lately a friend of mine asked me how ANTLR
decides which token rule to use for a production. This made me talk
for half an hour but I could not really get it explained. He was
thinking of something like "the most specific rule will be the one
to match" and I was trying to say "no, it is all very
deterministic". Anyway, "most specific" is similar to what you made
up using left factoring or how you called it "combine left
edges". "Most specific" seems to be the most natural way to
interpret token definitions. One thing to notice there is
DIGIT1 : '1';
is also more specific than
INT : ('0'..'9')+ ;
which can not be handeled with left factoring as it seems to me. For
certain rule based systems it is instead very possible to make this
distinction. How do they do this?!
Oliver
--- In antlr-interest at yahoogroups.com, Terence Parr <parrt at j...>
wrote:
> Does anyone have an opinion concerning ANTLR's construction of top-
down
> lexers versus more traditional dfa-based state machine solutions?
>
> I just got really irritated this week building the PostScript
> interpreter due to lexer lookahead issues.
>
> PROS
>
> Very readable lexers. ('0'..'9')+ turns into a while look you can
> debug/read.
>
> Very powerful...can literally parse (because you can call other
rules)
> in the lexer.
>
> Consistent with all other antlr-generated recognizers. Lexers,
> parsers, tree parsers all use the same analyzer and code generator.
>
> Can execute an action anywhere during the recognition of a
token...with
> DFA based systems you have to wait until you know the complete
token
> has been match.
>
> CONS
>
> Some lexer rules are a huge pain to specify in ANTLR because of
the
> limitations of lookahead.
>
> Lexer rules really confuse people since there are lots of special
cases
> in the lookahead and stuff.
>
> I wonder if a compromise is possible. Use a DFA-based decision to
> determine which rule will match; can do arbitrary lookahead and
all of
> that automatically to make a correct decision. The problem is
speed:
> you'd parse twice each token. Perhaps I can simply combine all
rules
> with common left edges automatically to avoid weirdness. For
example:
>
> INT : ('0'..'9')+ ;
>
> FLOAT : ('0'..'9')+ ('.' ('0'..'9')*)? | '.' ('0'..'9')+ ;
>
> could be turned into a combine rule by the system:
>
> INT_FLOAT
> : ('0'..'9')+ ('.' ('0'..'9')*)?
> | '.' ('0'..'9')+
> ;
>
> or whatever.
>
> Still, my lookahead computations are really twisted for lexical
rules
> when you can see past end of a rule--literally any character can
follow
> a token (if you consider erroneous input).
>
> ANYWAY, who has thoughts on this? I'd like thoughts also from
people
> with *no* experience using DFA-based tools like lex/flex. Do
ANTLR
> lexers seem "natural"?
>
> Thanks!
> Ter
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
> Co-founder, http://www.peerscope.com link sharing, pure-n-simple
> Lecturer in Comp. Sci., University of San Francisco
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list