[antlr-interest] Lexer lookahead overoptimizes

shmuel siegel antlr at shmuelhome.mine.nu
Sat Apr 7 04:56:14 PDT 2007


Among other rules, I have these two lexer rules.
 SHIN:
    '\u00d7' '\u00a9' ('\u00d7' '\u0081')? ('\u00d7' '\u0082')?;
 TUF:
    '\u00d7' '\u00aa';

The code produced for "SHIN" properly recognizes that the optional first 
parenthesis needs two terms to match but the second parenthesis will try 
to match if the first character matches. Therefore I get a recognition 
exception from the sequence '\u00d7' '\u00a9' '\u00d7' '\u00aa'.

What I am saying will probably be clearer upon looking at the code 
produced for "SHIN". Note that it just checks for '\u00d7' and then 
wants to match '\u00d7' '\u0082'.
    // $ANTLR start SHIN
    public final void mSHIN() throws RecognitionException {
        try {
            int _type = SHIN;
            // 
E:\\downloads\\Eclipse\\learning\\Tamei\\grammar\\Miqroh.g:154:2: ( 
'\\u00d7' '\\u00a9' ( '\\u00d7' '\\u0081' )? ( '\\u00d7' '\\u0082' )? )
            // 
E:\\downloads\\Eclipse\\learning\\Tamei\\grammar\\Miqroh.g:154:2: 
'\\u00d7' '\\u00a9' ( '\\u00d7' '\\u0081' )? ( '\\u00d7' '\\u0082' )?
            {
            match('\u00D7');
            match('\u00A9');
            // 
E:\\downloads\\Eclipse\\learning\\Tamei\\grammar\\Miqroh.g:154:20: ( 
'\\u00d7' '\\u0081' )?
            int alt9=2;
            int LA9_0 = input.LA(1);

            if ( (LA9_0=='\u00D7') ) {
                int LA9_1 = input.LA(2);

                if ( (LA9_1=='\u0081') ) {
                    alt9=1;
                }
            }
            switch (alt9) {
                case 1 :
                    // 
E:\\downloads\\Eclipse\\learning\\Tamei\\grammar\\Miqroh.g:154:21: 
'\\u00d7' '\\u0081'
                    {
                    match('\u00D7');
                    match('\u0081');

                    }
                    break;

            }

            // 
E:\\downloads\\Eclipse\\learning\\Tamei\\grammar\\Miqroh.g:154:41: ( 
'\\u00d7' '\\u0082' )?
            int alt10=2;
            int LA10_0 = input.LA(1);

            if ( (LA10_0=='\u00D7') ) {
                alt10=1;
            }
            switch (alt10) {
                case 1 :
                    // 
E:\\downloads\\Eclipse\\learning\\Tamei\\grammar\\Miqroh.g:154:42: 
'\\u00d7' '\\u0082'
                    {
                    match('\u00D7');
                    match('\u0082');

                    }
                    break;

            }


            }

            this.type = _type;
        }
        finally {
        }
    }
    // $ANTLR end SHIN




More information about the antlr-interest mailing list