[antlr-interest] NEWBIE ANTLRWorks question: Missing zeroes in input

Mon Mar 15 16:20:27 PDT 2010

I am learning ANTLR, so my answer might not be the very best solution, but here goes.

I think you are right about there being a bug within ANTLR, when I copy your grammer and run with your test data I see the parser only recieving one zero when multiple zeros are together.  

To me that implies looking into the lexer code generated by ANTLR, and in GammarLogLexer.java I find this:
    ...
    public void mTokens() throws RecognitionException {
        // ...GrammarLog.g:1:8: ( REG7_TIPO | REG9_TIPO | HEX_DIGIT )
        int alt1=3;
        int LA1_0 = input.LA(1);
        if ( (LA1_0=='0') ) {
            int LA1_1 = input.LA(2);
            if ( (LA1_1=='0') ) {
                int LA1_3 = input.LA(3);
                if ( (LA1_3=='0') ) {
                    int LA1_4 = input.LA(4);
                    if ( (LA1_4=='7') ) {
                        alt1=1;
                    }
                    else if ( (LA1_4=='9') ) {
                        alt1=2;
                    }
                    else {
                        NoViableAltException nvae =
                            new NoViableAltException("", 1, 4, input);
                        throw nvae;
                    }
                }
                else {
                    NoViableAltException nvae =
                        new NoViableAltException("", 1, 3, input);
                    throw nvae;
                }
            }
            else {
                alt1=3;}
        }
        ...
The way I read this code is that if the lexer finds two zeros in a row, it is going to return REG7_TIPO or REG9_TIPO or throw an exception.  I think the two else blocks should return a HEX_DIGIT token instead of throwing an exception.

Ok, first attempt is to make the two tokens actual token rules within the grammar and place after the existing HEX_DIGIT rule (also comment out the token option near the top).  Which reorganizes the code slightly but did not affect the execution, and it still has the same problem.

Next option is to put a gated symantic predicate on the two lexer rules (place these after the HEX_DIGIT rule), this forces these two tokens to appear at the beginning of a line and turns off their generation later in the line (this is where someone with more experience might be able to give a better predicate than what I have placed here):
        REG7_TIPO    :        {$pos == 0}?=>    '0007FFF8'; 
        REG9_TIPO    :        {$pos == 0}?=>    '0009FFF6';

Well, that will make your test data work (although I had to add six more hex digits at the end - I didn't attempt to see if you had correct string length for your test).  It does appear to pick up the zeros correctly.  

Two potential problems I can see with this, first can your language have just hex digits at the beginning, that would make my predicates invalid.  Second is if the two tokens can appear in any other location within the "sentence", then the predicates would disqualify the tokens at any other location.  And since this is a scaled down version, I am not sure if either of these are true.

I also attempted if the REG7_TIPO and REG9_TIPO started with a non-hex digit and then everything would work right away without gated symantic predicates.  And there would be no problems with the token appearing in the middle of a line.  This actually would be the best thing from a code perspective, although I am not sure it is possible.

Hope this helps.
Wayne

PS. I think this might be the exact same problem for your other questions, also mentioning multiple zeros.