[antlr-interest] Re: Merging token vocabularies; Ter adds grammar composition

Thu Jun 24 17:09:42 PDT 2004

Ter

There are a few tricks that can be applied to ANTLR implementation. 
One is simply to sort token types by name before numbering them; it
might also be desirable to add an aliasing option to go along with
this, and to make the "=<number>" part of the <tokens>.txt entries
optional (number assigned by sequence, as in "C" enums) to support
merges and use of diff'ed output.  Back-translating items such as
LITERAL_if (to "if") might be useful, as well.

Another approach at the application level is to support a renumbering
"visitor" pass that translates between two token mappings.

I've worried at this issue off and on because I think that it may be
practical to use parse trees as an output formatting device.  Parse
tree grammars can be automatically generated for any input grammar; by
transforming another language into that parse tree format, it might be
possible to quickly develop a language translator for paradigmatically
similar languages.  More flexible token numbering support is worth
considering for ANTLR 3.

--Loring

--- In antlr-interest at yahoogroups.com, Terence Parr <parrt at c...> wrote:
> On Jun 24, 2004, at 12:52 PM, FranklinChen at c... wrote:
> > I think I need one huge token vocabulary in order to avoid overloading
> > integers when embedding ASTs, as I mentioned.
> >
> > Suppose (for illustration only; this is nothing like what I'm actually
> > doing) I want to process a file that looks like
> >
> > ===
> > This is a document with C code and Java code:
> > <c>
> > class Bar {};
> > </c>
> > Java is here.
> > <java>
> > public class Foo {};
> > </java>
> > ===
> >
> > Presumably, I want to build an AST that looks like
> > (DOCUMENT (TEXT "This is a document with C code and Java code:\n")
> >   (C (CLASS (NAME "Bar") ...))
> >   (TEXT "\nJava is here.\n")
> >   (JAVA (CLASS (NAME "Foo") ...)))
> >
> > and then use a tree walker to do my work.
> >
> > Assume that I have an independently created C parser and Java parser
> > that I want to just use.  I can't just use in my own parser
> >
> > options {
> >   importVocab = CParser;
> >   importVocab = JavaParser;
> > }
> >
> > so I have to do something else.  As I mentioned, what I am doing is
> > generating a CommonTokenTypes.txt by parsing the CParserTokenTypes.txt
> > and JavaParserTokenTypes.txt and merging the vocabularies, and then
> > importing Common back into CParser and JavaParser and regenerating.
> > This seems to be exactly what you are proposing:  ensuring that a
> > Common vocabulary be imported!  Or am I misunderstanding you?
> 
> Hi Gang,
> 
> [warning: this got long] ;)
> 
> Good question / interesting problem.  Since I'm working on the vocab 
> stuff for 3.0 at the moment, perhaps I'll think about this for a 
> paragraph or two or ten.
> 
> The problem you have is that both a C and a Java grammar will export 
> the same token, say, IF with different token types.  Moreover, the 
> parsers have generated constant defs that say IF=342 and IF=2 or 
> something.  Seems like it would be pretty hard to get a single lexer to 
> generate two different values.  Ok, so that means one needs to have 
> separate lexers for both C and for Java.  No problems as the token 
> "spaces" are then totally independent.  One simply needs an overall 
> "controller" (could be a higher-level grammar) that invokes the C or 
> Java parser.  Problem is the trees of course...
> 
> Second choice is as you did: create a merged vocab and have both C and 
> Java parsers/lexers import that merged vocab.  Only trouble is that the 
> vocab files needed to merge are created after ANTLR has generated the 
> parsers so one needs to run ANTLR again on the C and Java parser to get 
> their token types in sync.
> 
> One could have the Java parser import the C parser's vocab then the 
> output vocab of the Java parser would be the combined one.  The C 
> parser would have to be regenerated though.  This is a drag.  I wonder 
> if we shouldn't change the problem around.  You are really saying that 
> you want to build a new grammar composed of other grammars.  That is, 
> you want to import the grammar not the vocabulary really.  I plan on 
> *not* allowing inheritance ala ANTLR 2.x.  It will become grammar 
> composition or I may leave it to an IDE tool living above ANTLR to do 
> the composition, but for now let's look at what it might look like:
> 
> grammar combined;
> 
> import C; // include all rules from C
> import Java.expr; // include rule expr and all rules invoked by expr
> 
> In this way, the token type issue is resolved because I don't have to 
> rely on two different recognizers generated with independent token 
> types.  I would literally cut/paste (like I did with inheritance) the 
> appropriate rules into the new combined grammar.  Yes, i have to 
> cut/paste because rules are not independent.  If I override a rule in 
> combined that is invoked by a rule in Java or C grammars then every 
> rule could change it's lookahead (this is very unlike method 
> inheritance).
> 
> Now on to trees.  Problem is solved because ID for both input grammars 
> is now the single ID of combined grammar.  Any trees created have the 
> unique ID token type value. :)
> 
> There are systems such as Visser's SDF that allow grammars to be reused 
> when they are placed in so-called "modules".  You can import whatever 
> rules you want or whole grammars.  You can set parameters too.  I can 
> imagine a generic rule for a list of IDs that could handle 
> comma-separated, colon-separated, or whatever lists.  You could define 
> the grammar like a template:
> 
> grammar commonStuff;
> idList<separator> : ID (<separator> ID)* ;
> 
> Then in another grammar, you could do this:
> 
> grammar whatever;
> 
> import commonStuff.idList<COMMA> as idList;
> 
> decl : "int" idList SEMI ;
> 
> Something like that.  Looks cool, but you know every time I try to 
> think of a really great example of re-usage with parameterized rule 
> imports, I get stuck.  Seems like just plain "import a bunch of rules" 
> is what we want.  The normal usage would be to import an existing 
> grammar and then change a few rules (for syntax or action changes) or 
> to import rules from a few grammars to make a new one.  The latter case 
> is the only time it's more "powerful" than plain single inheritance.
> 
> Still I question whether even this is that useful (ignoring tree 
> walkers for now).  Are you really ever going to build a new grammar by 
> grabbing the Java expr rule set and the C statement set and then 
> somehow hope the merged grammar will work?  Perhaps if we made smaller 
> grammars that were really just grammar subsets that were useful.  
> Hmm...even then, I've never ever been in a situation where I felt I 
> could wholesale grab large chunks of another grammar and reuse without 
> serious modifications (read that cut/paste and modify).  Now, I start 
> most grammars by having to type or cut/paste lexer rules such as ID, 
> WS, INT, and so on.  I can see importing lexer rules as they are 
> independent of each other, but an IDE could simply let you pick from a 
> list of standard lexical rules to get you started.  Seems hardly worth 
> complicating the tool for this simple application.
> 
> Back to trees with vocab not grammar imports...I'd like to hear from 
> Monty to hear more about his use of combined vocabs in his tree walkers 
> and his use of inheritance.  I think Monty made a tree supergrammar 
> that had all the tokens in it and then each tree walker phase of the 
> translator subclassed that grammar to get all the tokens.  That is 
> basically the same as doing an importVocab on a combined vocab you 
> built.
> 
> In summary, i have no good solution ;)  Can anybody add to this 
> conversation about importing grammars and such?
> 
> Ter
> --
> CS Professor & Grad Director, University of San Francisco
> Creator, ANTLR Parser Generator, http://www.antlr.org
> Cofounder, http://www.jguru.com
> Cofounder, http://www.knowspam.net enjoy email again!
> Cofounder, http://www.peerscope.com pure link sharing

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/