[antlr-interest] dfa-based lexers versus top-down antlr lexers

Matthew Ford Matthew.Ford at forward.com.au
Fri Apr 25 16:53:37 PDT 2003


Interesting you are looking at the lexer
My first language (a simulation language) had a hand built top down parser
(using procedure calls) and a hand build lexer using a state machine. I
would not do that again.

Now I do all my translations in Antlr.  I cann't say I have actually had
many problems with the lexer.  I just follow the examples and design my
languages to simplify issues.  Sometimes I cheat the use a switch in the
lexer controlled by the parser (just have to watch the lookaheads) to change
the token to a skip in certain parse situations.  I have more problems with
the parser sorting out ambiguities, generating sensible errors and
recovering from them and deciding between expressions and other things
(infinite look ahead is great using Syntactic Predicates).

My real bug bear is the tree generator/parser and the inconsistent way it
generates nodes when there are mixes of !.  In my latest project (an
automatic differentiator) I had to write a lot of the tree parser by hand.

I am including (again) a revised spec for handling tree generation for your
consideration.

matthew


Draft specification for Antlr Tree Generation
By Matthew Ford
Revision 1 25th Sept 2001

Section 1:Control of automatic tree generation in the Parser
By default the Parser will automatically generate AST trees.  This
generation can be
disabled globally by setting buildAST=false.

When buildAST=false ALL code related to AST tree building is removed and the
only ways to build your own tree are :-
i) to update a global tree or
ii) use Antlr's return syntax to pass your own tree back.
(Note: in this mode you should not need to link or load any AST code unless
you
reference it yourself from action, etc)

With buildAST = true (that is the default) you can selectively disable tree
generation
by using the ! syntax.  This can be used on either a rule or token basis.
Example of a rule based use of ! to disable tree generation

addition!
    :   INT PLUS i:INT
;

In this case no tree generation code is generated for this rule.  If you
want to create a
tree by hand for this rule you need to return it as shown below
addition returns [AST return_tree]!
   :   INT PLUS i:INT  { .. code to generate return_tree }
;


So I suggest this be relaxed a little to say that
No tree generation code is output except that labels in the rule are
initialized with the
appropiate minimal tree.
For example
> drop_table_statement!
> : "drop" "table" t3:table_name t4:drop_behavior
;
results in #t3 containing tree resulting from the rule table_name
and
statement!
: INT PLUS i:INT
;
would set up a tree for label i consisting of a single root node containing
the INT
token
This allows the user to control what tree code is added to their code if the
tree
generation is turned off for a rule. If there are not labels then no code.


To suppress a single token use ! after the token.  It will not be added to
the tree, eg.
statement
:  lhsVar EQUALS rhs SEMI!  // SEMI is not added to the tree.
;

Note as far as the rule statement is concerned
statement
:   lhsVar EQUALS addition!  // suppress addition of tree returned from
addition
;
addition:
    :   INT PLUS i:INT
;

Is the same as
statement
:   lhsVar EQUALS addition
;
addition!   // suppress generation tree
    :   INT PLUS i:INT
;

But in the second case no rule in the parser can get a tree from the
addition rule.

and
statement
 :  lhsVar EQUALS addition!
;
addition!
    :   INT PLUS i:INT
;

is redundant but legal.

You would probably actually use something like
statement
{AST addTree;}
:   lhsVar EQUALS^ addTree=addition!
   { ## =   build tree here using ## and addTree }
;

addition returns [AST returnTree]
    :   INT PLUS^ i:INT
{ returnTree = ##}  // pick up the autogenerated tree
;

Note: It makes no sense in this system to allow ! to be applied to
alternative of rules
that is :-
statement
{AST addTree;}
:   lhsVar EQUALS^ addTree=addition!
    { ## =   build tree here using ## and addTree }
|!  printstatement
;

is now illegal

In all other cases (that is when buildAST is true and ! is not used) the
return tree is
always generated and assigned to the global AST_return to be picked up by
the parent
rule.  This AST_return can be modified/overwritten using the syntax
discussed below.

Section 2: Syntax for manual modification of trees in the Parser
Note this is for modification of trees that have been automatically created.
If you set
buildAST=false or use ! on a rule, you are on your own as no tree code is
generated
for you.

Tree nodes are created using
#[TOKEN_TYPE] or #[TOKEN_TYPE,"text"]

Trees are created using
#(root, c1, ..., cn)
where
 root must be a node
 c1,to cn are the 1st to nth children which may be either nodes or
other trees.

Elements of the current rule can be addressed using the following
## is a short cut for AST_return, the current result tree.
#id is a short cut for the current tree rooted at the location originally
occupied by the
node labelled by id
@id is a short cut for the root node of the tree rooted at the location
originally
occupied by the node labelled by id
When these occur on the rhs of = they are replaced by clones of their
respective nodes
or trees. This prevents deadly loops.  As an optimisation ## =
#(#[token],##) could be
done without cloning ##.
When these occur on the lhs of = they refer to that location in the tree.
This allows
subtree replacements.

eg
statement
:   lhsVar e:EQUALS^ a:addition
  {
      #a = #(@a,#[INT,"5"],#[INT,"6"]);
     // the children of the addition subtree in the result (##) have been
replace with 5,6
    // a: now refers to the new subtree, the original subtree is has been
replaced by it.
    @a = #[MINUS]
    // the root of the new subtree a is now MINUS
   ## = #(#[STATEMENT],##);
   // add a node to the top of the result tree.  a: and e: still point to
the same subtrees.
  ## = #[DIV];
  // where do a: and e: point now?  They still point to there subtrees which
are not
released until the rule returns.
 // so the following is valid
  ## = #(@#,#a,#[INT,"3"],#a);
   // @# is the root node of ##  which is now just #[DIV]
   // note it is valid to use #a twice as it is cloned.
}
;



----- Original Message -----
From: "Terence Parr" <parrt at jguru.com>
To: <antlr-interest at yahoogroups.com>
Sent: Saturday, April 26, 2003 6:20 AM
Subject: [antlr-interest] dfa-based lexers versus top-down antlr lexers


> Does anyone have an opinion concerning ANTLR's construction of top-down
> lexers versus more traditional dfa-based state machine solutions?
>
> I just got really irritated this week building the PostScript
> interpreter due to lexer lookahead issues.
>
> PROS
>
> Very readable lexers.  ('0'..'9')+ turns into a while look you can
> debug/read.
>
> Very powerful...can literally parse (because you can call other rules)
> in the lexer.
>
> Consistent with all other antlr-generated recognizers.  Lexers,
> parsers, tree parsers all use the same analyzer and code generator.
>
> Can execute an action anywhere during the recognition of a token...with
> DFA based systems you have to wait until you know the complete token
> has been match.
>
> CONS
>
> Some lexer rules are a huge pain to specify in ANTLR because of the
> limitations of lookahead.
>
> Lexer rules really confuse people since there are lots of special cases
> in the lookahead and stuff.
>
> I wonder if a compromise is possible.  Use a DFA-based decision to
> determine which rule will match; can do arbitrary lookahead and all of
> that automatically to make a correct decision.  The problem is speed:
> you'd parse twice each token.  Perhaps I can simply combine all rules
> with common left edges automatically to avoid weirdness.  For example:
>
> INT : ('0'..'9')+ ;
>
> FLOAT : ('0'..'9')+ ('.' ('0'..'9')*)? | '.' ('0'..'9')+ ;
>
> could be turned into a combine rule by the system:
>
> INT_FLOAT
> : ('0'..'9')+ ('.' ('0'..'9')*)?
> | '.' ('0'..'9')+
> ;
>
> or whatever.
>
> Still, my lookahead computations are really twisted for lexical rules
> when you can see past end of a rule--literally any character can follow
> a token (if you consider erroneous input).
>
> ANYWAY, who has thoughts on this?  I'd like thoughts also from people
> with *no* experience using DFA-based tools like lex/flex.  Do ANTLR
> lexers seem "natural"?
>
> Thanks!
> Ter
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
> Co-founder, http://www.peerscope.com link sharing, pure-n-simple
> Lecturer in Comp. Sci., University of San Francisco
>
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 




More information about the antlr-interest mailing list