[antlr-interest] Re: Merging token vocabularies; Ter adds grammar composition
Terence Parr
parrt at cs.usfca.edu
Thu Jun 24 15:59:22 PDT 2004
On Jun 24, 2004, at 12:52 PM, FranklinChen at cmu.edu wrote:
> I think I need one huge token vocabulary in order to avoid overloading
> integers when embedding ASTs, as I mentioned.
>
> Suppose (for illustration only; this is nothing like what I'm actually
> doing) I want to process a file that looks like
>
> ===
> This is a document with C code and Java code:
> <c>
> class Bar {};
> </c>
> Java is here.
> <java>
> public class Foo {};
> </java>
> ===
>
> Presumably, I want to build an AST that looks like
> (DOCUMENT (TEXT "This is a document with C code and Java code:\n")
> (C (CLASS (NAME "Bar") ...))
> (TEXT "\nJava is here.\n")
> (JAVA (CLASS (NAME "Foo") ...)))
>
> and then use a tree walker to do my work.
>
> Assume that I have an independently created C parser and Java parser
> that I want to just use. I can't just use in my own parser
>
> options {
> importVocab = CParser;
> importVocab = JavaParser;
> }
>
> so I have to do something else. As I mentioned, what I am doing is
> generating a CommonTokenTypes.txt by parsing the CParserTokenTypes.txt
> and JavaParserTokenTypes.txt and merging the vocabularies, and then
> importing Common back into CParser and JavaParser and regenerating.
> This seems to be exactly what you are proposing: ensuring that a
> Common vocabulary be imported! Or am I misunderstanding you?
Hi Gang,
[warning: this got long] ;)
Good question / interesting problem. Since I'm working on the vocab
stuff for 3.0 at the moment, perhaps I'll think about this for a
paragraph or two or ten.
The problem you have is that both a C and a Java grammar will export
the same token, say, IF with different token types. Moreover, the
parsers have generated constant defs that say IF=342 and IF=2 or
something. Seems like it would be pretty hard to get a single lexer to
generate two different values. Ok, so that means one needs to have
separate lexers for both C and for Java. No problems as the token
"spaces" are then totally independent. One simply needs an overall
"controller" (could be a higher-level grammar) that invokes the C or
Java parser. Problem is the trees of course...
Second choice is as you did: create a merged vocab and have both C and
Java parsers/lexers import that merged vocab. Only trouble is that the
vocab files needed to merge are created after ANTLR has generated the
parsers so one needs to run ANTLR again on the C and Java parser to get
their token types in sync.
One could have the Java parser import the C parser's vocab then the
output vocab of the Java parser would be the combined one. The C
parser would have to be regenerated though. This is a drag. I wonder
if we shouldn't change the problem around. You are really saying that
you want to build a new grammar composed of other grammars. That is,
you want to import the grammar not the vocabulary really. I plan on
*not* allowing inheritance ala ANTLR 2.x. It will become grammar
composition or I may leave it to an IDE tool living above ANTLR to do
the composition, but for now let's look at what it might look like:
grammar combined;
import C; // include all rules from C
import Java.expr; // include rule expr and all rules invoked by expr
In this way, the token type issue is resolved because I don't have to
rely on two different recognizers generated with independent token
types. I would literally cut/paste (like I did with inheritance) the
appropriate rules into the new combined grammar. Yes, i have to
cut/paste because rules are not independent. If I override a rule in
combined that is invoked by a rule in Java or C grammars then every
rule could change it's lookahead (this is very unlike method
inheritance).
Now on to trees. Problem is solved because ID for both input grammars
is now the single ID of combined grammar. Any trees created have the
unique ID token type value. :)
There are systems such as Visser's SDF that allow grammars to be reused
when they are placed in so-called "modules". You can import whatever
rules you want or whole grammars. You can set parameters too. I can
imagine a generic rule for a list of IDs that could handle
comma-separated, colon-separated, or whatever lists. You could define
the grammar like a template:
grammar commonStuff;
idList<separator> : ID (<separator> ID)* ;
Then in another grammar, you could do this:
grammar whatever;
import commonStuff.idList<COMMA> as idList;
decl : "int" idList SEMI ;
Something like that. Looks cool, but you know every time I try to
think of a really great example of re-usage with parameterized rule
imports, I get stuck. Seems like just plain "import a bunch of rules"
is what we want. The normal usage would be to import an existing
grammar and then change a few rules (for syntax or action changes) or
to import rules from a few grammars to make a new one. The latter case
is the only time it's more "powerful" than plain single inheritance.
Still I question whether even this is that useful (ignoring tree
walkers for now). Are you really ever going to build a new grammar by
grabbing the Java expr rule set and the C statement set and then
somehow hope the merged grammar will work? Perhaps if we made smaller
grammars that were really just grammar subsets that were useful.
Hmm...even then, I've never ever been in a situation where I felt I
could wholesale grab large chunks of another grammar and reuse without
serious modifications (read that cut/paste and modify). Now, I start
most grammars by having to type or cut/paste lexer rules such as ID,
WS, INT, and so on. I can see importing lexer rules as they are
independent of each other, but an IDE could simply let you pick from a
list of standard lexical rules to get you started. Seems hardly worth
complicating the tool for this simple application.
Back to trees with vocab not grammar imports...I'd like to hear from
Monty to hear more about his use of combined vocabs in his tree walkers
and his use of inheritance. I think Monty made a tree supergrammar
that had all the tokens in it and then each tree walker phase of the
translator subclassed that grammar to get all the tokens. That is
basically the same as doing an importVocab on a combined vocab you
built.
In summary, i have no good solution ;) Can anybody add to this
conversation about importing grammars and such?
Ter
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
Cofounder, http://www.knowspam.net enjoy email again!
Cofounder, http://www.peerscope.com pure link sharing
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list