[antlr-interest] passing stuff from lexer to parser

Wed Jan 2 13:54:47 PST 2008

Tom, Gavin,

As you suggest, my design is a bit strange, but it's driven by the  
the arrangement of the source files.  I don't control the source  
files.  They are class definitions of the Common Information Model  
(see http://www.dmtf.org/standards/cim/cim_schema_v217/).  Each class  
definition is in a single file, in a language called Managed Object  
Format, or MOF.  I'm translating them into Java files.  The source  
files are arranged on per class, as several hundred separate files in  
about 13 subdirectories.  A single top-level file contains include  
statements that include all the other files, preceded by 2 files that  
define language elements used in every other file.  The include file  
starts like this:

#pragma include ("qualifiers.mof")
#pragma include ("qualifiers_optional.mof")
#pragma include ("Core/CIM_ManagedElement.mof")
#pragma include ("Core/CIM_ManagedSystemElement.mof")
#pragma include ("Core/CIM_SystemStatisticalInformation.mof")
#pragma include ("Database/CIM_CommonDatabaseSettingData.mof")
#pragma include ("Database/CIM_CommonDatabaseStatistics.mof")
#pragma include ("Database/CIM_DatabaseResourceStatistics.mof")

etc.

The top-level file contains nothing but include statements, and none  
of the other files contain include statements, but the first 2  
included files contain code that is needed by each of the other  
files.  The files are included in order such that superclasses are  
defined before subclasses, but that isn't really important for  
translation.

When I wrote the translator, I just implemented the "how to handle  
include files" scheme, before I learned that there were no other  
include statements to be found in any of the other files.  I found  
out that ANTLR has no trouble lexing the entire set of files into a  
single stream of tokens.  It seemed "big", but it works, and I'm not  
sure it's a good idea to change to an approach that would parse the  
files one-at-a-time.  Such an approach would assume that the layout  
of the source files is static.  New versions of the Common  
Information Model appear fairly regularly, and there's no guarantee  
that the layout of the source files won't change.

Would it be possible to inject a token into the token stream just  
before I switch to the include file and call reset?  In the  
PragmaInclude lexer rule, can I call "emit" to do it, and make the  
token contain the include file name?  I haven't done anything like  
this before, I just wonder if it's reasonable.

-- Pete

On Jan 1, 2008, at 10:34 PM, Thomas Brandon wrote:

> On Jan 2, 2008 3:48 PM,  <siemsen at ucar.edu> wrote:
>> Gavin,
>>
>> Thanks, that makes perfect sense.  It's certainly better than what I
>> was trying to do with a HashMap.  I think I'm thinking about this
>> more clearly now.
>>
>> I understand the idea, but I can't seem to implement it.  I have a
>> "PragmaInclude" lexer rule that reads each include statement and
>> switches the input stream to the new file.  It works.  I'd like to do
>> what you suggest, and access the PragmaInclude token in the parser,
>> so the parser can see the file name.  The odd thing is that the lexer
>> doesn't seem to generate a PragmaInclude token.
>>
>> Attached is the grammar.  In it, the "compilerDirective" parser rule
>> uses the PragmaInclude token.  I couldn't get compilerDirective to
>> "fire" while parsing.  I discovered that I could comment the
>> compilerDirective rule completely and the translator would still
>> behave the same.  It seems to me that the lexer never creates a
>> PragmaInclude token, even though the PragmaInclude definitely  
>> executes.
>>
>> What am I missing?
> The call to Lexer.reset() clears the information token information
> from the PragmaInclude rule. In fact the call to setCharStream calls
> reset() also so this seems redundant and removing it won't solve the
> issue (the extra call will additionally seek the new token stream to 0
> but this shouldn't be needed). Rather than calling setCharStream you
> could update input directly and not call reset though this is not
> really advisable as future versions of ANTLR could easily break this
> (I think 3.1 will).
> Your design seems somewhat strange. Can the top level file also
> include normal statements or only includes? Where does the output for
> normal statements go? Can the included files contain includes and if
> so what happens with the output for them?
> It looks like you're processing a list of different input files to be
> separately processed, not a file with includes. In that case I think
> Gavin's suggestion of separately processing each file is better. Then
> you top level grammar would just handle the include syntax and end up
> with a list linking include file names to ASTs or templates or
> whatever the result of processing each include is.
>
> Tom.
>>
>> -- Pete
>>
>>
>>
>>
>> On Jan 1, 2008, at 3:13 PM, Gavin Lambert wrote:
>>
>>> At 10:02 2/01/2008, siemsen at ucar.edu wrote:
>>>> To handle the include statements, I use the mechanism described in
>>>> the ANTLR Wiki page titled "How do I implement include files?".
>>>> It works great.  It does its magic during the lexer phase.  So all
>>>> the source files are lexed first into one big token stream, then
>>>> the parser starts.
>>>>
>>>> Currently, my translator just emits output to standard out, as one
>>>> text stream.  Now I'm ready to make it put the output into
>>>> directories and files.  The source text is a set of things with
>>>> names like CIM_DatabaseResourceStatistics, so I know what to name
>>>> each output file.  I just need to know what directory to put each
>>>> output file in.
>>>
>>>> During the lexer phase, I store the name-to-directory information
>>>> in a HashMap.  So for example, the HashMap tells me that the
>>>> output file named CIM_DatabaseResourceStatistics.java belongs in
>>>> the output subdirectory named "Database".
>>>>
>>>> I need to pass the HashMap from the lexer to the parser.  Is there
>>>> a good way to do it?  Am I thinking about the problem correctly?
>>>
>>> Probably the easiest way to do this is to pass an INCLUDE token up
>>> to the parser that contains the full filename, and let the parser
>>> reconstruct the HashMap itself.  Or you could use it in a scope
>>> instead, since presumably everything else is logically contained
>>> within one or more INCLUDEs.
>>>
>>
>>
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080102/5d1d68ed/attachment-0001.html