[antlr-interest] passing stuff from lexer to parser

Sun Jan 6 18:25:11 PST 2008

Gavin,

My comments inline...

On Jan 2, 2008, at 3:59 PM, Gavin Lambert wrote:

> At 10:54 3/01/2008, siemsen at ucar.edu wrote:
>> The top-level file contains nothing but include statements, and  
>> none of the other files contain include statements, but the first  
>> 2 included files contain code that is needed by each of the other  
>> files.  The files are included in order such that superclasses are  
>> defined before subclasses, but that isn't really important for  
>> translation.
>
> What sort of code?  Constants?  Superclasses?  Support classes/ 
> methods?  Any of those could be dealt with as a separate file  
> easily enough.

The 2 included files contain code that defines about 50 "qualifiers"  
that can be applied to the classes, methods and fields defined in the  
other files.  The 2 includes need to be parsed before each of the  
included files.  As Thomas Brandon and you have suggested, the source  
files are organized as a bunch of separately included files, and it  
seems reasonable to parse them separately.  I could do so, but then  
the translator would rely on the current structure of the source  
files.  I don't control that structure, and those who do might change  
it, so I don't want to "hard code" it into the translator.  ANTLR has  
no trouble reading the source and producing one large token stream,  
so I don't see a compelling reason not to let it do so.

My problem is that the suggested way of handling include files with  
ANTLR has an annoying feature - the lexer eats the include statements  
without producing tokens for the actual include statements  
themselves.  Other than that, include processing works fine.  If my  
parser didn't need to know the names of the included files, there  
would be no issue.  I just need a slightly better mechanism for  
handling include files - one that allows the parser to see the file  
names.  Hence my next question...

>> Would it be possible to inject a token into the token stream just  
>> before I switch to the include file and call reset?  In the  
>> PragmaInclude lexer rule, can I call "emit" to do it, and make the  
>> token contain the include file name?  I haven't done anything like  
>> this before, I just wonder if it's reasonable.
>
> Lexer operation is basically just calling nextToken to retrieve one  
> token at a time.  Calling emit sets the data for that token; not  
> calling it will lead to generating a default token based on all the  
> characters matched by the rule.
>
> I'm not really familiar with the Java runtime, so I'm not sure what  
> the reset call affects.  It might destroy an emit as well (and you  
> probably can't emit afterwards successfully either).  Still, it  
> could be worth a try.
>
> The rule must currently be returning *something*, though, since  
> every top-level lexer rule called must return a token.  Trace it  
> through with a debugger and see what's going on.

I tried adding a call to emit right before the calls to setCharStream  
and reset.  As Thomas Brandon predicted, nothing happened, probably  
because the calls to setCharStream and reset destroy the token(s)  
created by the lexer rule.  I tried putting the call to emit right  
after the call to reset, even though that's not of much value to me -  
I want the parser to know the include file name before it sees tokens  
from the include file.  That generated this:

Exception in thread "main" java.lang.ClassCastException:  
org.antlr.runtime.ClassicToken
         at cimmof2javaLexer.nextToken(cimmof2javaLexer.java:111)
         at org.antlr.runtime.CommonTokenStream.fillBuffer 
(CommonTokenStream.java:119)
         at org.antlr.runtime.CommonTokenStream.LT 
(CommonTokenStream.java:238)
         at cimmof2javaParser.mofSpecification(cimmof2javaParser.java: 
141)
         at cimmof2java.main(cimmof2java.java:24)

Line 111 in cimmof2javaLexer.java is

     		if (((CommonToken)token).getStartIndex() < 0)

So when the token is cast to a CommonToken, boom.  I confess that I'm  
not sure how to handle this.  If you're still interested, it may help  
to see a current version of the grammar, which I've attached.

I'll start a new antlr-interest thread that focuses on the mechanism  
for handling include files.  I think the parser should see the tokens  
in the include statement, and that the tokens from the included file  
should appear after the tokens that represent the include statement  
itself.

Thanks for all your help!

-- Pete
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cimmof2java.g
Type: application/octet-stream
Size: 18536 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20080106/32f71f86/attachment-0001.obj 
-------------- next part --------------