[antlr-interest] Re: Merging token vocabularies; Ter adds grammar composition
lgcraymer
lgc at mail1.jpl.nasa.gov
Thu Jun 24 17:09:42 PDT 2004
Ter
There are a few tricks that can be applied to ANTLR implementation.
One is simply to sort token types by name before numbering them; it
might also be desirable to add an aliasing option to go along with
this, and to make the "=<number>" part of the <tokens>.txt entries
optional (number assigned by sequence, as in "C" enums) to support
merges and use of diff'ed output. Back-translating items such as
LITERAL_if (to "if") might be useful, as well.
Another approach at the application level is to support a renumbering
"visitor" pass that translates between two token mappings.
I've worried at this issue off and on because I think that it may be
practical to use parse trees as an output formatting device. Parse
tree grammars can be automatically generated for any input grammar; by
transforming another language into that parse tree format, it might be
possible to quickly develop a language translator for paradigmatically
similar languages. More flexible token numbering support is worth
considering for ANTLR 3.
--Loring
--- In antlr-interest at yahoogroups.com, Terence Parr <parrt at c...> wrote:
> On Jun 24, 2004, at 12:52 PM, FranklinChen at c... wrote:
> > I think I need one huge token vocabulary in order to avoid overloading
> > integers when embedding ASTs, as I mentioned.
> >
> > Suppose (for illustration only; this is nothing like what I'm actually
> > doing) I want to process a file that looks like
> >
> > ===
> > This is a document with C code and Java code:
> > <c>
> > class Bar {};
> > </c>
> > Java is here.
> > <java>
> > public class Foo {};
> > </java>
> > ===
> >
> > Presumably, I want to build an AST that looks like
> > (DOCUMENT (TEXT "This is a document with C code and Java code:\n")
> > (C (CLASS (NAME "Bar") ...))
> > (TEXT "\nJava is here.\n")
> > (JAVA (CLASS (NAME "Foo") ...)))
> >
> > and then use a tree walker to do my work.
> >
> > Assume that I have an independently created C parser and Java parser
> > that I want to just use. I can't just use in my own parser
> >
> > options {
> > importVocab = CParser;
> > importVocab = JavaParser;
> > }
> >
> > so I have to do something else. As I mentioned, what I am doing is
> > generating a CommonTokenTypes.txt by parsing the CParserTokenTypes.txt
> > and JavaParserTokenTypes.txt and merging the vocabularies, and then
> > importing Common back into CParser and JavaParser and regenerating.
> > This seems to be exactly what you are proposing: ensuring that a
> > Common vocabulary be imported! Or am I misunderstanding you?
>
> Hi Gang,
>
> [warning: this got long] ;)
>
> Good question / interesting problem. Since I'm working on the vocab
> stuff for 3.0 at the moment, perhaps I'll think about this for a
> paragraph or two or ten.
>
> The problem you have is that both a C and a Java grammar will export
> the same token, say, IF with different token types. Moreover, the
> parsers have generated constant defs that say IF=342 and IF=2 or
> something. Seems like it would be pretty hard to get a single lexer to
> generate two different values. Ok, so that means one needs to have
> separate lexers for both C and for Java. No problems as the token
> "spaces" are then totally independent. One simply needs an overall
> "controller" (could be a higher-level grammar) that invokes the C or
> Java parser. Problem is the trees of course...
>
> Second choice is as you did: create a merged vocab and have both C and
> Java parsers/lexers import that merged vocab. Only trouble is that the
> vocab files needed to merge are created after ANTLR has generated the
> parsers so one needs to run ANTLR again on the C and Java parser to get
> their token types in sync.
>
> One could have the Java parser import the C parser's vocab then the
> output vocab of the Java parser would be the combined one. The C
> parser would have to be regenerated though. This is a drag. I wonder
> if we shouldn't change the problem around. You are really saying that
> you want to build a new grammar composed of other grammars. That is,
> you want to import the grammar not the vocabulary really. I plan on
> *not* allowing inheritance ala ANTLR 2.x. It will become grammar
> composition or I may leave it to an IDE tool living above ANTLR to do
> the composition, but for now let's look at what it might look like:
>
> grammar combined;
>
> import C; // include all rules from C
> import Java.expr; // include rule expr and all rules invoked by expr
>
> In this way, the token type issue is resolved because I don't have to
> rely on two different recognizers generated with independent token
> types. I would literally cut/paste (like I did with inheritance) the
> appropriate rules into the new combined grammar. Yes, i have to
> cut/paste because rules are not independent. If I override a rule in
> combined that is invoked by a rule in Java or C grammars then every
> rule could change it's lookahead (this is very unlike method
> inheritance).
>
> Now on to trees. Problem is solved because ID for both input grammars
> is now the single ID of combined grammar. Any trees created have the
> unique ID token type value. :)
>
> There are systems such as Visser's SDF that allow grammars to be reused
> when they are placed in so-called "modules". You can import whatever
> rules you want or whole grammars. You can set parameters too. I can
> imagine a generic rule for a list of IDs that could handle
> comma-separated, colon-separated, or whatever lists. You could define
> the grammar like a template:
>
> grammar commonStuff;
> idList<separator> : ID (<separator> ID)* ;
>
> Then in another grammar, you could do this:
>
> grammar whatever;
>
> import commonStuff.idList<COMMA> as idList;
>
> decl : "int" idList SEMI ;
>
> Something like that. Looks cool, but you know every time I try to
> think of a really great example of re-usage with parameterized rule
> imports, I get stuck. Seems like just plain "import a bunch of rules"
> is what we want. The normal usage would be to import an existing
> grammar and then change a few rules (for syntax or action changes) or
> to import rules from a few grammars to make a new one. The latter case
> is the only time it's more "powerful" than plain single inheritance.
>
> Still I question whether even this is that useful (ignoring tree
> walkers for now). Are you really ever going to build a new grammar by
> grabbing the Java expr rule set and the C statement set and then
> somehow hope the merged grammar will work? Perhaps if we made smaller
> grammars that were really just grammar subsets that were useful.
> Hmm...even then, I've never ever been in a situation where I felt I
> could wholesale grab large chunks of another grammar and reuse without
> serious modifications (read that cut/paste and modify). Now, I start
> most grammars by having to type or cut/paste lexer rules such as ID,
> WS, INT, and so on. I can see importing lexer rules as they are
> independent of each other, but an IDE could simply let you pick from a
> list of standard lexical rules to get you started. Seems hardly worth
> complicating the tool for this simple application.
>
> Back to trees with vocab not grammar imports...I'd like to hear from
> Monty to hear more about his use of combined vocabs in his tree walkers
> and his use of inheritance. I think Monty made a tree supergrammar
> that had all the tokens in it and then each tree walker phase of the
> translator subclassed that grammar to get all the tokens. That is
> basically the same as doing an importVocab on a combined vocab you
> built.
>
> In summary, i have no good solution ;) Can anybody add to this
> conversation about importing grammars and such?
>
> Ter
> --
> CS Professor & Grad Director, University of San Francisco
> Creator, ANTLR Parser Generator, http://www.antlr.org
> Cofounder, http://www.jguru.com
> Cofounder, http://www.knowspam.net enjoy email again!
> Cofounder, http://www.peerscope.com pure link sharing
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list