[antlr-interest] suggestions for 2.7.2

Sat Jan 5 15:35:53 PST 2002

On Thursday, January 3, 2002, at 12:10  AM, Stdiobe wrote:

>
> Hi Ter,
>
> in a message on jguru.com you asked for suggestions for 2.7.2.
> Well .... here are my suggestions (hope you don't mind I use
> yahoo instead of jguru; I prefer email).
>
> * With antlr you specify a lookahead for the complete grammar;
> it is not possible to specify a lookahead per (sub)rule. I believe
> javacc does support this, and I think it would be great if antlr
> would allow different lookaheads on (sub)rule basis.
> (That would eliminate many of the syntactic predicates I have
> to use now).
>
>    e.g.   myrule {options k=4;} :  A B C D | A B C E ;

This is an option I have considered.  It is only an optimization as 
ANTLR automatically determines the min lookahead to use (I think JavaCC 
*forces* you to specify the lookahead).

In this example, can you not set k=4 for the whole grammar?  You'll see 
in the output that k=1 is used whenever possible.

> * it used to be possible to have multiple lexers and parsers in
> one grammar file. But that has been disabled (for whatever reason).
> It would be great if this was enabled again, cause now i have to
> put lexers in multiple grammar files (antlr doesn't support states, so
> I need multiple lexers).

Streams/multiple lexers are much more powerful than lexer states.  I 
can't remember why Iimited 1 lexer to a file.  I remember there being a 
good reason though! ;)

> * as said in previous point, the lexer in antlr doesn't support
> lexical states; the alternative is to use multiple lexers.
> However, now I have to duplicate several lexer rules accross
> different lexers; that's not a good thing!

Can you subclass effectively?

> Now, if it were possible to specify semantic predicates with
> the "highest" lexer rules (the ones that match a complete token),
> then I could mimic lexer states using semantic predicates!
> That would be a great improvement, e.g.
>
>     TOKEN_A  :  (mystate == 1)? "a" ;
>     TOKEN_A2 : (mystate == 2)? "a" ;

Well, it's easy enough to just do

TOKEN_A_A2
	:	(mystate == 1)? "a"
	|	(mystate == 2)? "b"
	;

Remember thought that feedback from the parser to the lexer is often 
dangerous (specially at k>1).

> * another problem with lexers in antlr is the fact that I have to
> combine conflicting rules into a common rule and use setType
> to return the correct token.
> Couldn't it be possible to specify syntactic predicates with the
> "highest" lexer rules (the ones that match a complete token).
> The first lexer rule would be tried first, and then the next one, etc.
> (similar to multiple sub-rules with syntactic predicates).
> That way I could keep my lexer a lot more readable/maintainable.
>
> e.g.
>     TOKEN_FLOAT: (INT '.')=> INT '.' INT ;
>     TOKEN_INTEGER:  INT ;

See above solution.  BTW, you should left factor for efficiency:

TOKEN_F_I : INT ('.' INT) ;

> * with grammar inheritance it is possible to redefine a rule, but
> as far as I know it is not possible to extend a rule without having
> to repeat it's contents.
> e.g    myRule : super.myRule | someExtraRule ;

correct...decided the syntactic complexity was not worth it.

> * when using grammar inheritance I have to specify all base
> grammars on the command line. It would be nice if I could
> specificy a list of search directories instead. (easier to manage)

True.  it would be nice.

> * the java code of the antlr parser generator is mixed with the
> java code for the antlr runtime. It would be GREAT if you
> could move the runtime classes to a separate package so there
> is a clear distinction between runtime antlr classes and parser
> generator classes.

Problem is that ANTLR is written in ANTLR.  A tough proposition to pull 
the stuff apart, but better separation  would be great :)

> * although resolving conflicts in antlr is a lot easier than in yacc,
> it can still be difficult to understand why a conflict occurs.
> It would be great if antlr would report the two different derivations
> that conflict (and not only the rule where the conflict occurs).

Agreed.  I would like to see a highlighted grammar in HTML or something.

> * it would be nice if antlr would generate a _nextToken() function
> instead of nextToken(). The default implementation of nextToken() would
> just call _nextToken(), but I would have the possibility of redefining
> the nextToken() function.

super.nextToken() should work for ya.

> * I use {options greedy=true} a lot; can't you introduce a special
> symbol for this construct? e.g. [a]?  or  [a]* where the brackets
> indicate greedy (or something like that)

Well, in all of my grammars I have not needed that much since it is the 
default behavior (in that case it only shuts off a warning message).  
Are you sure your grammars are clean as you can make them?

> * i have several parsers, and each parser defines some common
> support methods/functions. Currently these functions are duplicated
> between the parsers. I tried to put them in a common class
> and derive my parsers from this common class, but antlr insists
> that i defined this common class in a .g file; it can't be a plain java
> class (that extends class LLkParser).
> It would be nice if I could derive my parsers from my own
> parser class; not only from an antlr generated parser class.

True, but you'd have to make sure your parser class derived from 
LLkParser or whatever.  Easier to just use delegation, right?  
commonToolsObject.mycommonMethod().

> * a nice feature I found in a different tool was the concept of defining
> parser rules for "hidden" tokens. This can be used to parse
> preprocessor statements, like (#include or COPY in cobol) by
> specifying special grammar rules in the parser.
> The concept is simple: hidden tokens are ignored unless there
> are "special" rules that matche them.
> (I'm not sure how this could be integrated in antlr, but I liked the
> concept).

The token stream stuff should be a super set of that.  For example, you 
can send comments, preproc stuff, whitespace, and real tokens to the 
parser all on different "channels".  The parser's actions are free to do 
what they want with them including running the stream through a piece of 
the grammar.  This is similar to my javadoc example.

>
> * finally, I understand that the way you combine lookahead sets
> can cause incorrect parsers (I experienced that just a week ago ...).
> It would be nice if I could overrule the "merging of lookahead sets"
> for specific rules.

More precisely, my lookahead computation only allows you to specify 
linear-approximate lookahead LL(k) not full LL(k) parsers.  This 
computation differs from what your human brain sees as intuitive.  I do 
this for efficiency reasons O(n x k) vs O(n^k). :)  Typically syn preds 
will take care of any issues you have. :)

Thanks much for the effort involved in making the suggestions. :)

Best regards,
Ter
--
Chief Scientist & Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/