[antlr-interest] comments on Tom Moog's notes
Terence Parr
parrt at cs.usfca.edu
Sun Mar 14 13:16:29 PST 2004
Hi Tom,
Tom,
Thanks for the very impressive notes. There are linked to from "antlr
2 bashing page" and here is a direct link:
http://www.antlr.org/blog/antlr3/moog.txt
A few comments enclosed.
Terence
> Follow set caching
> ------------------
> There are problems analyzing expressions of the form:
>
> rule: (A1|B1) {A2|B2) ... (A25|B25)
>
> because follow sets are cached only for rules. In this case 2**25
> alternative would need to be computed. However, if each subrule
> had its follow set cached, it would require only 50 evaluations.
The new NFA to DFA conversion will not simulate the NFA to compute a
finite, acyclic DFA (i.e., lookahead tree). It will compute a possibly
cyclic DFA. Now, lookahead computation will not be a function of k!
It will be a function of the number of states. :)
> There are two kinds of semantic predicates needed. One kind is used
> to creat a "virtual token" which combines token information with a
> semantic test. This is the classic C typename test. The other kind
> of semantic predicate is used to control flow and does not depend on
> context. In short, type 1 semantic predicate always depend onOB
> lookahead while type 2 semantic predicate never depend on lookahead.
Can't you just define a predicate that doesn't reference lookahead? Is
that what you mean? Can you give me an example?
> Support for Unicode up to at least 0x10ffff (current xml range).
>
> Tolerance for binary zero would be desirable. After all, flex
> can do it.
I have an example that matches binary stuff :) The UNICODE works up to
0xFFFFE :)
> Consider the following rule:
>
> r1: A ( B | epsilon )* C ;
>
> with input:
>
> A C
>
> In pccts this results in an ambiguity warning because it can't decide
> whether whether to enter the epsilon branch or bypass the entire
> subrule. When there is no action in the epsilon alternative antlr3
> should allow it and handle it in the "natural" way - bypassing
> the subrule
I'm not sure I can agree...you have provided two ways to match the same
input: an ambiguity by definition, right? On the other hand, are you
saying for action processing this is very desirable? For example,
(A | B)*
matches the same thing as
(A | B | {foo} )*
but {foo} is executed as it exits or if nothing is matched?
> | call : ("@" ID)? && <<isVarName(LT(2))>>? fun_call
> |
> | | (ID)? && <<isExtCmdName>>? command
> |
> | | (ID "(")? fun_call
> |
> | | command
> |
> | ;
> |
How does isExtCmdName test the input symbol? Is it a variable? If so,
where is it set?
I'm hoping to "hoist > 0 lookahead distance" with the combined NFA->DFA
conversion + collect predicts. I'll run this past you when I figure it
out ;)
> The volume of memory required for bit sets grows as N**2 because the
> number of terminals tends to grow as N and the number of tests grows
> as N. For very large grammars this is a bit of an annoyance (pun).
> You might consider a technique which would try for greater locality of
> bit sets so that it for non-pathological cases it would be N**1.5 or
> N*log(N). An alternative would be run-length encoding for sparse
> bit sets. The bits sets could be expanded at startup time.
> I believe JFlex uses this.
I think that a simple initial offset like "my first non-zero bit is n"
will lop off a big chunk off the front and still provide equally fast
access at analysis and parse-time with no uncompression. Not sure, but
should help. I'm definitely a bit concerned with the size of the
bitsets needed to build DFAs from a large grammar. Could end up with
30,000 states for a big grammar. I note that there are 4515 words in
the java.g file. Each word will result in perhaps 3 or 4 states.
Note that I stop the DFA conversion the instant an alternative is
uniquely predicted. Even if sets are very big, I won't be computing
that many of them. For degenerate cases like
a : A Q R S | B X Y Z ;
The DFA would be simply three states:
o -A-> o predict alt 1
|
| -B-> o predict alt 2
I won't even consider the Q or X or beyond. Slick, eh?
Now, UNICODE in the lexer could be an issue, but again, I don't build a
DFA for entire lexer grammar, just for the left edges for prediction.
:)
Concerning exceptions for code gen. I agree that try/catch is the
easiest to use as I want to use exceptions for error handling. What
would we do for C? I'd resist the longjmp this time probably ;)
> Set Difference
> --------------
> Is there any way to say: this token matches all the tokens *not* in
> the FIRST set of some other alternative or not in the union of the
> FIRST sets of some other alternatives ?
Do you mean like
( x
| y
| .
)
This means match not x and not y. It removes the lookhead from '.'
from previous alts...this is an "idiom" i guess. Are you suggesting
something more formal?
Thanks again for the thoughts.
--
Professor Comp. Sci., University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list