[antlr-interest] Re: Merging token vocabularies; Ter adds grammar composition

Thu Jun 24 15:59:22 PDT 2004

On Jun 24, 2004, at 12:52 PM, FranklinChen at cmu.edu wrote:
> I think I need one huge token vocabulary in order to avoid overloading
> integers when embedding ASTs, as I mentioned.
>
> Suppose (for illustration only; this is nothing like what I'm actually
> doing) I want to process a file that looks like
>
> ===
> This is a document with C code and Java code:
> <c>
> class Bar {};
> </c>
> Java is here.
> <java>
> public class Foo {};
> </java>
> ===
>
> Presumably, I want to build an AST that looks like
> (DOCUMENT (TEXT "This is a document with C code and Java code:\n")
>   (C (CLASS (NAME "Bar") ...))
>   (TEXT "\nJava is here.\n")
>   (JAVA (CLASS (NAME "Foo") ...)))
>
> and then use a tree walker to do my work.
>
> Assume that I have an independently created C parser and Java parser
> that I want to just use.  I can't just use in my own parser
>
> options {
>   importVocab = CParser;
>   importVocab = JavaParser;
> }
>
> so I have to do something else.  As I mentioned, what I am doing is
> generating a CommonTokenTypes.txt by parsing the CParserTokenTypes.txt
> and JavaParserTokenTypes.txt and merging the vocabularies, and then
> importing Common back into CParser and JavaParser and regenerating.
> This seems to be exactly what you are proposing:  ensuring that a
> Common vocabulary be imported!  Or am I misunderstanding you?

Hi Gang,

[warning: this got long] ;)

Good question / interesting problem.  Since I'm working on the vocab 
stuff for 3.0 at the moment, perhaps I'll think about this for a 
paragraph or two or ten.

The problem you have is that both a C and a Java grammar will export 
the same token, say, IF with different token types.  Moreover, the 
parsers have generated constant defs that say IF=342 and IF=2 or 
something.  Seems like it would be pretty hard to get a single lexer to 
generate two different values.  Ok, so that means one needs to have 
separate lexers for both C and for Java.  No problems as the token 
"spaces" are then totally independent.  One simply needs an overall 
"controller" (could be a higher-level grammar) that invokes the C or 
Java parser.  Problem is the trees of course...

Second choice is as you did: create a merged vocab and have both C and 
Java parsers/lexers import that merged vocab.  Only trouble is that the 
vocab files needed to merge are created after ANTLR has generated the 
parsers so one needs to run ANTLR again on the C and Java parser to get 
their token types in sync.

One could have the Java parser import the C parser's vocab then the 
output vocab of the Java parser would be the combined one.  The C 
parser would have to be regenerated though.  This is a drag.  I wonder 
if we shouldn't change the problem around.  You are really saying that 
you want to build a new grammar composed of other grammars.  That is, 
you want to import the grammar not the vocabulary really.  I plan on 
*not* allowing inheritance ala ANTLR 2.x.  It will become grammar 
composition or I may leave it to an IDE tool living above ANTLR to do 
the composition, but for now let's look at what it might look like:

grammar combined;

import C; // include all rules from C
import Java.expr; // include rule expr and all rules invoked by expr

In this way, the token type issue is resolved because I don't have to 
rely on two different recognizers generated with independent token 
types.  I would literally cut/paste (like I did with inheritance) the 
appropriate rules into the new combined grammar.  Yes, i have to 
cut/paste because rules are not independent.  If I override a rule in 
combined that is invoked by a rule in Java or C grammars then every 
rule could change it's lookahead (this is very unlike method 
inheritance).

Now on to trees.  Problem is solved because ID for both input grammars 
is now the single ID of combined grammar.  Any trees created have the 
unique ID token type value. :)

There are systems such as Visser's SDF that allow grammars to be reused 
when they are placed in so-called "modules".  You can import whatever 
rules you want or whole grammars.  You can set parameters too.  I can 
imagine a generic rule for a list of IDs that could handle 
comma-separated, colon-separated, or whatever lists.  You could define 
the grammar like a template:

grammar commonStuff;
idList<separator> : ID (<separator> ID)* ;

Then in another grammar, you could do this:

grammar whatever;

import commonStuff.idList<COMMA> as idList;

decl : "int" idList SEMI ;

Something like that.  Looks cool, but you know every time I try to 
think of a really great example of re-usage with parameterized rule 
imports, I get stuck.  Seems like just plain "import a bunch of rules" 
is what we want.  The normal usage would be to import an existing 
grammar and then change a few rules (for syntax or action changes) or 
to import rules from a few grammars to make a new one.  The latter case 
is the only time it's more "powerful" than plain single inheritance.

Still I question whether even this is that useful (ignoring tree 
walkers for now).  Are you really ever going to build a new grammar by 
grabbing the Java expr rule set and the C statement set and then 
somehow hope the merged grammar will work?  Perhaps if we made smaller 
grammars that were really just grammar subsets that were useful.  
Hmm...even then, I've never ever been in a situation where I felt I 
could wholesale grab large chunks of another grammar and reuse without 
serious modifications (read that cut/paste and modify).  Now, I start 
most grammars by having to type or cut/paste lexer rules such as ID, 
WS, INT, and so on.  I can see importing lexer rules as they are 
independent of each other, but an IDE could simply let you pick from a 
list of standard lexical rules to get you started.  Seems hardly worth 
complicating the tool for this simple application.

Back to trees with vocab not grammar imports...I'd like to hear from 
Monty to hear more about his use of combined vocabs in his tree walkers 
and his use of inheritance.  I think Monty made a tree supergrammar 
that had all the tokens in it and then each tree walker phase of the 
translator subclassed that grammar to get all the tokens.  That is 
basically the same as doing an importVocab on a combined vocab you 
built.

In summary, i have no good solution ;)  Can anybody add to this 
conversation about importing grammars and such?

Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/