[antlr-interest] RE: Combination of TokenStreamHiddenTokenFilter and TokenStreamSelector

Fri May 5 17:44:01 PDT 2006

Last week I posted this question to the list, but I haven't received a single reply?

Do we not have any subscribers to this list that have experience with using the TokenStreamHiddenTokenFilter together with the TokenStreamSelector?

I am in a dead-end here and all help will be greatly appreciated

Best regards,

Arni

________________________________

From: Arni J. Reginsson 
Sent: 27. apríl 2006 08:08
To: 'antlr-interest at antlr.org'
Subject: Combination of TokenStreamHiddenTokenFilter and TokenStreamSelector

Hello,

First the context that we are working in: 

        We are using antlr-2.7.6

        The C# version of the library

Then the solution we are building using ANTLR:

We have successfully implemented a Microsoft Pascal ==> C# translator that uses multiple lexers and the TokenStreamSelector to translate complicated Pascal programs (that use interfaces and include statements) to C#. We have already translated more than 200K of pascal lines to c#.

So far we have not found a sound solution to extend the translator to preserve pascal comments. But we have been extending the translator using the whitepaper "Preserving Whitespace During Translation <http://antlr.org/article/whitespace/index.html>" as a guideline, and here we step into a problem.

The problem:

On a change of lexers (forced by our PascalLexer uponEOF()) the TokenStreamHiddenTokenFilter method IToken nextToken() 

skips the last token of the file currently being parsed/lexed and returns as the next token the first token of next $include (pascal interface) statement.

The pascal file that we are translating begins with several $include statements liken:

(*$INCLUDE:'MIS_INTF.PAS'*)

(*$INCLUDE:'CIO_INTF.PAS'*)

...

The first $include (MIS_INTF.PAS) is lexed/parsed correctly upon the point where it reaches the end of the file:

...

        BEGIN

        END;

At this point the parser expects SEMI token as the last token in the stream, but the TokenStreamHiddenTokenFilter nextToken returns the first statement of CIO_INTF.PAS (i.e, the token INTERFACE) instead of SEMI.

The code that is giving us problems is (from the PascalParser.cs file):

public void interfacePart() //throws RecognitionException, TokenStreamException

{

            returnAST = null;

            ASTPair currentAST = new ASTPair();

            OpusAllt.MekkanisAST interfacePart_AST = null;

            interfaceHeading();

            astFactory.addASTChild(ref currentAST, (AST)returnAST);

            match(BEGIN);

            match(END);

            match(SEMI);

            interfacePart_AST = (OpusAllt.MekkanisAST)currentAST.root;

            returnAST = interfacePart_AST;

}

The match(SEMI) calls ends in (TokenStreamHiddenTokenFilter.cs file):

            override public IToken nextToken()

            {

< -- clip -- >

                  IHiddenStreamToken monitored = LA(1);

                  // point to hidden tokens found during last invocation

                  monitored.setHiddenBefore(lastHiddenToken);

                  lastHiddenToken = null;

                  // Look for hidden tokens, hook them into list emanating

                  // from the monitored tokens.

                  consume();

                  IHiddenStreamToken p = monitored;

                  // while hidden or discarded scarf tokens

                  while (hideMask.member(LA(1).Type) || discardMask.member(LA(1).Type))

                  {

                        if (hideMask.member(LA(1).Type))

                        {

                              // attach the hidden token to the monitored in a chain

                              // link forwards

                              p.setHiddenAfter(LA(1));

                              // link backwards

                              if (p != monitored)

                              {

                                    //hidden cannot point to monitored tokens

                                    LA(1).setHiddenBefore(p);

                              }

                              p = (lastHiddenToken = LA(1));

                        }

                        consume();

                  }

                  return monitored;

            }

When we reach the point after the < -- clip -- > statement the "monitored" variable and "nextMonitoredToken" contains {[";",<30>,line=51,col=4]} which is correct.

BUT after the first "consume ()" (before the while loop) method has been called "nextMonitoredToken" (LA(1)) is still {[";",<30>,line=51,col=4]} (again still correct) but the "monitored" token now contains {["INTERFACE",<31>,line=24,col=1]} (incorrect) which is returned as the next token.

I therefore think that the implementation of TokenStreamHiddenTokenFilter nextToken might not been correct with regards to the multilexer implementations like we are using. 

The code for uponEOF, which we have been successfully using for a while now is also shown below - in case there is some better way to switch back the lexers after we hit an EOF in an include file:

    public override void uponEOF() {

            PascalMain.currentInstance.currentFileName = PascalMain.currentInstance.oldtranslateFileName;

            if ( PascalMain.currentInstance.selector.getCurrentStream() != PascalMain.currentInstance.mainLexer ) {

                  // don't allow EOF until main lexer.  Force the

                  // selector to retry for another token.

                  PascalMain.currentInstance.selector.pop(); // return to old lexer/stream

                  PascalMain.currentInstance.selector.retry();

            }

            else {

                  Console.WriteLine ("Hit EOF of main file");

            }

      }

Is this a known problem? Do you have a workaround?

Best regards,

Arni Jon Reginsson

www.mekkanis.is

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20060506/1f394f48/attachment-0001.html