[antlr-interest] dfa-based lexers versus top-down antlr lexers
Terence Parr
parrt at jguru.com
Tue Apr 29 09:42:05 PDT 2003
On Friday, April 25, 2003, at 04:53 PM, Matthew Ford wrote:
> I am including (again) a revised spec for handling tree generation for
> your
> consideration.
>
> So I suggest this be relaxed a little to say that
> No tree generation code is output except that labels in the rule are
> initialized with the
> appropiate minimal tree.
> For example
>> drop_table_statement!
>> : "drop" "table" t3:table_name t4:drop_behavior
> ;
> results in #t3 containing tree resulting from the rule table_name
> and
> statement!
> : INT PLUS i:INT
> ;
> would set up a tree for label i consisting of a single root node
> containing
> the INT
> token
I believe it works this way now. At least, I'm using this feature for
my course at the moment. :)
> Section 2: Syntax for manual modification of trees in the Parser
Interesting @ operator. Loring and I are talking about tree rewriting
stuff too while I contemplate future directions. Thanks for the
thoughts.
Ter
> Elements of the current rule can be addressed using the following
> ## is a short cut for AST_return, the current result tree.
> #id is a short cut for the current tree rooted at the location
> originally
> occupied by the
> node labelled by id
> @id is a short cut for the root node of the tree rooted at the location
> originally
> occupied by the node labelled by id
> When these occur on the rhs of = they are replaced by clones of their
> respective nodes
> or trees. This prevents deadly loops. As an optimisation ## =
> #(#[token],##) could be
> done without cloning ##.
> When these occur on the lhs of = they refer to that location in the
> tree.
> This allows
> subtree replacements.
>
> eg
> statement
> : lhsVar e:EQUALS^ a:addition
> {
> #a = #(@a,#[INT,"5"],#[INT,"6"]);
> // the children of the addition subtree in the result (##) have
> been
> replace with 5,6
> // a: now refers to the new subtree, the original subtree is has
> been
> replaced by it.
> @a = #[MINUS]
> // the root of the new subtree a is now MINUS
> ## = #(#[STATEMENT],##);
> // add a node to the top of the result tree. a: and e: still point
> to
> the same subtrees.
> ## = #[DIV];
> // where do a: and e: point now? They still point to there subtrees
> which
> are not
> released until the rule returns.
> // so the following is valid
> ## = #(@#,#a,#[INT,"3"],#a);
> // @# is the root node of ## which is now just #[DIV]
> // note it is valid to use #a twice as it is cloned.
> }
> ;
>
>
>
> ----- Original Message -----
> From: "Terence Parr" <parrt at jguru.com>
> To: <antlr-interest at yahoogroups.com>
> Sent: Saturday, April 26, 2003 6:20 AM
> Subject: [antlr-interest] dfa-based lexers versus top-down antlr lexers
>
>
>> Does anyone have an opinion concerning ANTLR's construction of
>> top-down
>> lexers versus more traditional dfa-based state machine solutions?
>>
>> I just got really irritated this week building the PostScript
>> interpreter due to lexer lookahead issues.
>>
>> PROS
>>
>> Very readable lexers. ('0'..'9')+ turns into a while look you can
>> debug/read.
>>
>> Very powerful...can literally parse (because you can call other rules)
>> in the lexer.
>>
>> Consistent with all other antlr-generated recognizers. Lexers,
>> parsers, tree parsers all use the same analyzer and code generator.
>>
>> Can execute an action anywhere during the recognition of a
>> token...with
>> DFA based systems you have to wait until you know the complete token
>> has been match.
>>
>> CONS
>>
>> Some lexer rules are a huge pain to specify in ANTLR because of the
>> limitations of lookahead.
>>
>> Lexer rules really confuse people since there are lots of special
>> cases
>> in the lookahead and stuff.
>>
>> I wonder if a compromise is possible. Use a DFA-based decision to
>> determine which rule will match; can do arbitrary lookahead and all of
>> that automatically to make a correct decision. The problem is speed:
>> you'd parse twice each token. Perhaps I can simply combine all rules
>> with common left edges automatically to avoid weirdness. For example:
>>
>> INT : ('0'..'9')+ ;
>>
>> FLOAT : ('0'..'9')+ ('.' ('0'..'9')*)? | '.' ('0'..'9')+ ;
>>
>> could be turned into a combine rule by the system:
>>
>> INT_FLOAT
>> : ('0'..'9')+ ('.' ('0'..'9')*)?
>> | '.' ('0'..'9')+
>> ;
>>
>> or whatever.
>>
>> Still, my lookahead computations are really twisted for lexical rules
>> when you can see past end of a rule--literally any character can
>> follow
>> a token (if you consider erroneous input).
>>
>> ANYWAY, who has thoughts on this? I'd like thoughts also from people
>> with *no* experience using DFA-based tools like lex/flex. Do ANTLR
>> lexers seem "natural"?
>>
>> Thanks!
>> Ter
>> --
>> Co-founder, http://www.jguru.com
>> Creator, ANTLR Parser Generator: http://www.antlr.org
>> Co-founder, http://www.peerscope.com link sharing, pure-n-simple
>> Lecturer in Comp. Sci., University of San Francisco
>>
>>
>>
>>
>> Your use of Yahoo! Groups is subject to
>> http://docs.yahoo.com/info/terms/
>>
>>
>
>
>
>
> Your use of Yahoo! Groups is subject to
> http://docs.yahoo.com/info/terms/
>
>
>
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Co-founder, http://www.peerscope.com link sharing, pure-n-simple
Lecturer in Comp. Sci., University of San Francisco
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list