[antlr-interest] Merging token vocabularies

Wed Jun 23 18:36:34 PDT 2004

Frank,

Here goes, I'm fairly new to ANTLR so I would appreciate feedback from
the regulars if my advice is floored.

Even though it works, I agree that one huge Token vocabulary with all
possible tokens across all languages is a bit of a nightmare. The main
hassle would actually be managing the tokens to ensure that the same
concepts had the same name and that you don't start duplicating. 

However, depending what you are trying to achieve, I still believe that
it may be worth while importing a file that has common concepts defined,
eg. IDENTIFIER, LOOP, STATEMENT etc. Things that are common across all
languages involved.

Anyway, it is possible to translate a tree from one vocab to another.
The "int" types that ANTLR uses are really for internal purposes. You
can recurse through a tree and change the types to a difference
vocabulary. 

This can be done in one of two ways. 

1) Use the TokenNames array accessible from a parser/walkers. Lookup the
array using the int type you want to change, get its name, then search
the array of the target by name, get its int type, then update the AST. 

2) Do the same as 1, except use reflection against the
xxxxTokenType.java interfaces instead of using the TokenNames arrays. 

Ofcourse, both token vocabularies must atleast overlap for the types
found in the tree. 

Regards,
Don.

-----Original Message-----
From: FranklinChen at cmu.edu [mailto:FranklinChen at cmu.edu] 
Sent: Thursday, June 24, 2004 5:28 AM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] Merging token vocabularies

I have been writing multiple parsers to and from various formats, and
have reached a point at which it is impossible to use pure ANTLR
mechanisms to manage all the different token vocabularies.  For
example, I parse language C to language X, where I do this by building
AST for C and then using a tree parser from C AST to a X AST, then
tree walker for X AST to serialize it.  I also parse languages A and B
to C.  Furthermore, most of the parsers actually call other parsers in
order to do the complete job, e.g., a parser for E might get an E
token but then create a D lexer and D parser in order to build an AST
that gets embedded into the E AST.

In short, because of the rampant sharing of token space, the only way
to manage all this seems to be to have all of the lexers, parsers, and
tree parsers importVocab from a single huge CommonTokenTypes, which is
the union of everything that exists.  So scripts have to be used in
order to make sure that CommonTokenTypes is generated properly and
ANTLR rerun until a fixed point is reached.  Does anyone else have an
alternate strategy for handling the problem of token space?

-- 
Franklin

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/