[antlr-interest] dfa-based lexers versus top-down antlr lexers

Terence Parr parrt at jguru.com
Tue Apr 29 09:42:05 PDT 2003


On Friday, April 25, 2003, at 04:53  PM, Matthew Ford wrote:
> I am including (again) a revised spec for handling tree generation for 
> your
> consideration.
>
> So I suggest this be relaxed a little to say that
> No tree generation code is output except that labels in the rule are
> initialized with the
> appropiate minimal tree.
> For example
>> drop_table_statement!
>> : "drop" "table" t3:table_name t4:drop_behavior
> ;
> results in #t3 containing tree resulting from the rule table_name
> and
> statement!
> : INT PLUS i:INT
> ;
> would set up a tree for label i consisting of a single root node 
> containing
> the INT
> token

I believe it works this way now.  At least, I'm using this feature for 
my course at the moment. :)


> Section 2: Syntax for manual modification of trees in the Parser

Interesting @ operator.  Loring and I are talking about tree rewriting 
stuff too while I contemplate future directions.  Thanks for the 
thoughts.

Ter

> Elements of the current rule can be addressed using the following
> ## is a short cut for AST_return, the current result tree.
> #id is a short cut for the current tree rooted at the location 
> originally
> occupied by the
> node labelled by id
> @id is a short cut for the root node of the tree rooted at the location
> originally
> occupied by the node labelled by id
> When these occur on the rhs of = they are replaced by clones of their
> respective nodes
> or trees. This prevents deadly loops.  As an optimisation ## =
> #(#[token],##) could be
> done without cloning ##.
> When these occur on the lhs of = they refer to that location in the 
> tree.
> This allows
> subtree replacements.
>
> eg
> statement
> :   lhsVar e:EQUALS^ a:addition
>   {
>       #a = #(@a,#[INT,"5"],#[INT,"6"]);
>      // the children of the addition subtree in the result (##) have 
> been
> replace with 5,6
>     // a: now refers to the new subtree, the original subtree is has 
> been
> replaced by it.
>     @a = #[MINUS]
>     // the root of the new subtree a is now MINUS
>    ## = #(#[STATEMENT],##);
>    // add a node to the top of the result tree.  a: and e: still point 
> to
> the same subtrees.
>   ## = #[DIV];
>   // where do a: and e: point now?  They still point to there subtrees 
> which
> are not
> released until the rule returns.
>  // so the following is valid
>   ## = #(@#,#a,#[INT,"3"],#a);
>    // @# is the root node of ##  which is now just #[DIV]
>    // note it is valid to use #a twice as it is cloned.
> }
> ;
>
>
>
> ----- Original Message -----
> From: "Terence Parr" <parrt at jguru.com>
> To: <antlr-interest at yahoogroups.com>
> Sent: Saturday, April 26, 2003 6:20 AM
> Subject: [antlr-interest] dfa-based lexers versus top-down antlr lexers
>
>
>> Does anyone have an opinion concerning ANTLR's construction of 
>> top-down
>> lexers versus more traditional dfa-based state machine solutions?
>>
>> I just got really irritated this week building the PostScript
>> interpreter due to lexer lookahead issues.
>>
>> PROS
>>
>> Very readable lexers.  ('0'..'9')+ turns into a while look you can
>> debug/read.
>>
>> Very powerful...can literally parse (because you can call other rules)
>> in the lexer.
>>
>> Consistent with all other antlr-generated recognizers.  Lexers,
>> parsers, tree parsers all use the same analyzer and code generator.
>>
>> Can execute an action anywhere during the recognition of a 
>> token...with
>> DFA based systems you have to wait until you know the complete token
>> has been match.
>>
>> CONS
>>
>> Some lexer rules are a huge pain to specify in ANTLR because of the
>> limitations of lookahead.
>>
>> Lexer rules really confuse people since there are lots of special 
>> cases
>> in the lookahead and stuff.
>>
>> I wonder if a compromise is possible.  Use a DFA-based decision to
>> determine which rule will match; can do arbitrary lookahead and all of
>> that automatically to make a correct decision.  The problem is speed:
>> you'd parse twice each token.  Perhaps I can simply combine all rules
>> with common left edges automatically to avoid weirdness.  For example:
>>
>> INT : ('0'..'9')+ ;
>>
>> FLOAT : ('0'..'9')+ ('.' ('0'..'9')*)? | '.' ('0'..'9')+ ;
>>
>> could be turned into a combine rule by the system:
>>
>> INT_FLOAT
>> : ('0'..'9')+ ('.' ('0'..'9')*)?
>> | '.' ('0'..'9')+
>> ;
>>
>> or whatever.
>>
>> Still, my lookahead computations are really twisted for lexical rules
>> when you can see past end of a rule--literally any character can 
>> follow
>> a token (if you consider erroneous input).
>>
>> ANYWAY, who has thoughts on this?  I'd like thoughts also from people
>> with *no* experience using DFA-based tools like lex/flex.  Do ANTLR
>> lexers seem "natural"?
>>
>> Thanks!
>> Ter
>> --
>> Co-founder, http://www.jguru.com
>> Creator, ANTLR Parser Generator: http://www.antlr.org
>> Co-founder, http://www.peerscope.com link sharing, pure-n-simple
>> Lecturer in Comp. Sci., University of San Francisco
>>
>>
>>
>>
>> Your use of Yahoo! Groups is subject to 
>> http://docs.yahoo.com/info/terms/
>>
>>
>
>
>
>
> Your use of Yahoo! Groups is subject to 
> http://docs.yahoo.com/info/terms/
>
>
>
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Co-founder, http://www.peerscope.com link sharing, pure-n-simple
Lecturer in Comp. Sci., University of San Francisco


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 




More information about the antlr-interest mailing list