[antlr-interest] comments on Tom Moog's notes

Sun Mar 14 13:16:29 PST 2004

Hi Tom,

Tom,

Thanks for the very impressive notes.  There are linked to from "antlr 
2 bashing page" and here is a direct link:

http://www.antlr.org/blog/antlr3/moog.txt

A few comments enclosed.

Terence

> Follow set caching
> ------------------
> There are problems analyzing expressions of the form:
>
> rule: (A1|B1) {A2|B2) ... (A25|B25)
>
> because follow sets are cached only for rules.  In this case 2**25
> alternative would need to be computed.  However, if each subrule
> had its follow set cached, it would require only 50 evaluations.

The new NFA to DFA conversion will not simulate the NFA to compute a 
finite, acyclic DFA (i.e., lookahead tree).  It will compute a possibly 
cyclic DFA.  Now, lookahead computation will not be a function of k!  
It will be a function of the number of states. :)

> There are two kinds of semantic predicates needed.  One kind is used
> to creat a "virtual token" which combines token information with a
> semantic test.  This is the classic C typename test.  The other kind
> of semantic predicate is used to control flow and does not depend on
> context.  In short, type 1 semantic predicate always depend onOB
> lookahead while type 2 semantic predicate never depend on lookahead.

Can't you just define a predicate that doesn't reference lookahead?  Is 
that what you mean?  Can you give me an example?

> Support for Unicode up to at least 0x10ffff (current xml range).
>
> Tolerance for binary zero would be desirable.  After all, flex
> can do it.

I have an example that matches binary stuff :)  The UNICODE works up to 
0xFFFFE :)

> Consider the following rule:
>
>     r1: A ( B | epsilon )* C ;
>
> with input:
>
>     A C
>
> In pccts this results in an ambiguity warning because it can't decide
> whether whether to enter the epsilon branch or bypass the entire
> subrule.  When there is no action in the epsilon alternative antlr3
> should allow it and handle it in the "natural" way - bypassing
> the subrule

I'm not sure I can agree...you have provided two ways to match the same 
input: an ambiguity by definition, right?  On the other hand, are you 
saying for action processing this is very desirable?  For example,

(A | B)*

matches the same thing as

(A | B | {foo} )*

but {foo} is executed as it exits or if nothing is matched?

>  |   call       : ("@" ID)? && <<isVarName(LT(2))>>? fun_call          
>     |
>  |              | (ID)?     && <<isExtCmdName>>?     command           
>     |
>  |              | (ID "(")?                          fun_call          
>     |
>  |              |                                    command           
>     |
>  |              ;                                                      
>     |

How does isExtCmdName test the input symbol?  Is it a variable?  If so, 
where is it set?

I'm hoping to "hoist > 0 lookahead distance" with the combined NFA->DFA 
conversion + collect predicts.  I'll run this past you when I figure it 
out ;)

> The volume of memory required for bit sets grows as N**2 because the
> number of terminals tends to grow as N and the number of tests grows
> as N.  For very large grammars this is a bit of an annoyance (pun).
> You might consider a technique which would try for greater locality of
> bit sets so that it for non-pathological cases it would be N**1.5 or
> N*log(N).  An alternative would be run-length encoding for sparse
> bit sets.  The bits sets could be expanded at startup time.
> I believe JFlex uses this.

I think that a simple initial offset like "my first non-zero bit is n" 
will lop off a big chunk off the front and still provide equally fast 
access at analysis and parse-time with no uncompression.  Not sure, but 
should help.  I'm definitely a bit concerned with the size of the 
bitsets needed to build DFAs from a large grammar.  Could end up with 
30,000 states for a big grammar.  I note that there are 4515 words in 
the java.g file.  Each word will result in perhaps 3 or 4 states.

Note that I stop the DFA conversion the instant an alternative is 
uniquely predicted.  Even if sets are very big, I won't be computing 
that many of them.  For degenerate cases like

a : A Q R S | B X Y Z ;

The DFA would be simply three states:

o -A-> o predict alt 1
|
| -B-> o predict alt 2

I won't even consider the Q or X or beyond.  Slick, eh?

Now, UNICODE in the lexer could be an issue, but again, I don't build a 
DFA for entire lexer grammar, just for the left edges for prediction. 
:)

Concerning exceptions for code gen.  I agree that try/catch is the 
easiest to use as I want to use exceptions for error handling.  What 
would we do for C?  I'd resist the longjmp this time probably ;)

> Set Difference
> --------------
> Is there any way to say: this token matches all the tokens *not* in
> the FIRST set of some other alternative or not in the union of the
> FIRST sets of some other alternatives ?

Do you mean like

( x
| y
| .
)

This means match not x and not y.  It removes the lookhead from '.' 
from previous alts...this is an "idiom" i guess.  Are you suggesting 
something more formal?

Thanks again for the thoughts.
--
Professor Comp. Sci., University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/