[antlr-interest] Combination of TokenStreamHiddenTokenFilter and TokenStreamSelector

Thu Apr 27 01:07:59 PDT 2006

Hello,

First the context that we are working in: 

        We are using antlr-2.7.6

        The C# version of the library

Then the solution we are building using ANTLR:

We have successfully implemented a Microsoft Pascal ==> C# translator
that uses multiple lexers and the TokenStreamSelector to translate
complicated Pascal programs (that use interfaces and include statements)
to C#. We have already translated more than 200K of pascal lines to c#.

So far we have not found a sound solution to extend the translator to
preserve pascal comments. But we have been extending the translator
using the whitepaper "Preserving Whitespace During Translation
<http://antlr.org/article/whitespace/index.html>" as a guideline, and
here we step into a problem.

The problem:

On a change of lexers (forced by our PascalLexer uponEOF()) the
TokenStreamHiddenTokenFilter method IToken nextToken() 

skips the last token of the file currently being parsed/lexed and
returns as the next token the first token of next $include (pascal
interface) statement.

The pascal file that we are translating begins with several $include
statements liken:

(*$INCLUDE:'MIS_INTF.PAS'*)

(*$INCLUDE:'CIO_INTF.PAS'*)

...

The first $include (MIS_INTF.PAS) is lexed/parsed correctly upon the
point where it reaches the end of the file:

...

        BEGIN

        END;

At this point the parser expects SEMI token as the last token in the
stream, but the TokenStreamHiddenTokenFilter nextToken returns the first
statement of CIO_INTF.PAS (i.e, the token INTERFACE) instead of SEMI.

The code that is giving us problems is (from the PascalParser.cs file):

public void interfacePart() //throws RecognitionException,
TokenStreamException

{

            returnAST = null;

            ASTPair currentAST = new ASTPair();

            OpusAllt.MekkanisAST interfacePart_AST = null;

            interfaceHeading();

            astFactory.addASTChild(ref currentAST, (AST)returnAST);

            match(BEGIN);

            match(END);

            match(SEMI);

            interfacePart_AST = (OpusAllt.MekkanisAST)currentAST.root;

            returnAST = interfacePart_AST;

}

The match(SEMI) calls ends in (TokenStreamHiddenTokenFilter.cs file):

            override public IToken nextToken()

            {

< -- clip -- >

                  IHiddenStreamToken monitored = LA(1);

                  // point to hidden tokens found during last invocation

                  monitored.setHiddenBefore(lastHiddenToken);

                  lastHiddenToken = null;

                  // Look for hidden tokens, hook them into list
emanating

                  // from the monitored tokens.

                  consume();

                  IHiddenStreamToken p = monitored;

                  // while hidden or discarded scarf tokens

                  while (hideMask.member(LA(1).Type) ||
discardMask.member(LA(1).Type))

                  {

                        if (hideMask.member(LA(1).Type))

                        {

                              // attach the hidden token to the
monitored in a chain

                              // link forwards

                              p.setHiddenAfter(LA(1));

                              // link backwards

                              if (p != monitored)

                              {

                                    //hidden cannot point to monitored
tokens

                                    LA(1).setHiddenBefore(p);

                              }

                              p = (lastHiddenToken = LA(1));

                        }

                        consume();

                  }

                  return monitored;

            }

When we reach the point after the < -- clip -- > statement the
"monitored" variable and "nextMonitoredToken" contains
{[";",<30>,line=51,col=4]} which is correct.

BUT after the first "consume ()" (before the while loop) method has been
called "nextMonitoredToken" (LA(1)) is still {[";",<30>,line=51,col=4]}
(again still correct) but the "monitored" token now contains
{["INTERFACE",<31>,line=24,col=1]} (incorrect) which is returned as the
next token.

I therefore think that the implementation of
TokenStreamHiddenTokenFilter nextToken might not been correct with
regards to the multilexer implementations like we are using. 

The code for uponEOF, which we have been successfully using for a while
now is also shown below - in case there is some better way to switch
back the lexers after we hit an EOF in an include file:

    public override void uponEOF() {

            PascalMain.currentInstance.currentFileName =
PascalMain.currentInstance.oldtranslateFileName;

            if ( PascalMain.currentInstance.selector.getCurrentStream()
!= PascalMain.currentInstance.mainLexer ) {

                  // don't allow EOF until main lexer.  Force the

                  // selector to retry for another token.

                  PascalMain.currentInstance.selector.pop(); // return
to old lexer/stream

                  PascalMain.currentInstance.selector.retry();

            }

            else {

                  Console.WriteLine ("Hit EOF of main file");

            }

      }

Is this a known problem? Do you have a workaround?

Best regards,

Arni Jon Reginsson

www.mekkanis.is

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20060427/d5397aa9/attachment-0001.html