[antlr-interest] GUnit failing / Parser appending EOF to input tokens?

Tue Jul 10 07:15:08 PDT 2012

Hello All,
I'm having a not too complex Parser and Lexer nicely generated by
ANTLR v3.4, with more than 2,500 tests GUnit tests.

The Parser seems to work as expected, but it fails some GUnit test as
the number of tokens doesn't match an expectation in the GUnit code
itself.

Debugging, it fails at this point of
org.antlr.gunit.gUnitExecutor :

(prefixing lines numbers):
----------

378: /** Invalid input */
379: if ( tokens.index()!=tokens.size()-1 ) {
380:    //throw new InvalidInputException();
381:    ps2.print("Invalid input");
382: }

----------

And this output into ps2 marks the test as failed even while it would
otherwise succeed.

Debugging the tokens of the failing input test "from Animal", I print
the Lexer output with this code snippet:

CommonTokenStream tokens = new CommonTokenStream( lexer );
System.out.println( tokens.getTokens() );

Which produces:
[[@0,0:3='from',<74>,1:0], [@1,5:10='Animal',<74>,1:5],
[@2,11:11='<EOF>',<-1>,1:11]]

*before* passing it through the generated Parser, but produces this
different output after the parser:

[[@0,0:3='from',<74>,1:0], [@1,5:10='Animal',<74>,1:5],
[@2,11:11='<EOF>',<-1>,1:11], [@3,11:11='<EOF>',<-1>,1:11]]

Note the duplication of the EOF token at the end.
This output with duplicate EOFs is generated even if I invoke
"tokens.reset();", or if I parse the same token stream multiple times,
it's not going to append additional EOF tokens.
Today I'm skipping white space, but even using an hidden channel it
behaves the same.

I would strongly appreciate some advise of why this could happen, and
what I could look into:

1# is the check done by GUnit really suggesting a problem, or should
it be relaxed?

2# I tried to have my generated Lexer extend a custom class to
override the getTokens() to artificially cleanup duplicates.. but it
seems I can't define a superclass for the Lexer [1], or I couldn't
figure out the syntax.

3# Is the Parser somehow changing the input token stream? Would that
be expected? (and needed?)

4# Since GUnit directly creates the components and controls all the
wiring, I couldn't find a way to workaround these.. I can't "cleanup"
the duplicate EOF or have it ignore them at token count time.

While I'd love some enlightenment about what might be happening, I
would also appreciate some practical advice to "workaround" this, as
the Parser I have is otherwise working fine.
I would rather not delete otherwise valid tests..

TiA,
Sanne

1 - http://www.antlr.org/wiki/display/ANTLR3/add+superClass+to+lexer