[antlr-interest] ANTLR seems to be incorrectly generating a lexer

Andrew Haley aph at redhat.com
Thu Mar 18 10:42:48 PDT 2010


Consider this very simple grammar to recognize strings with no embedded '"'.
ANTLR seems to be generating an incorrect lexer for StringPart.

grammar small;

defaults	
    : StringPart EOF
    ;
	
StringPart
    :    ( ~ NonStringChars) *
    ;
    
fragment
NonStringChars
    :    '"'
    ;

Look inside smallLexer.java, and

    // $ANTLR start "StringPart"
    public final void mStringPart() throws RecognitionException {
        try {
            int _type = StringPart;
            int _channel = DEFAULT_TOKEN_CHANNEL;
            // /home/aph/ceylon/small.g:8:5: ( (~ NonStringChars )* )
            // /home/aph/ceylon/small.g:8:10: (~ NonStringChars )*
            {
            // /home/aph/ceylon/small.g:8:10: (~ NonStringChars )*
            loop1:
            do {
                int alt1=2;
                int LA1_0 = input.LA(1);

                if ( ((LA1_0>='\u0000' && LA1_0<='!')||(LA1_0>='#' && LA1_0<='\uFFFF')) ) {
                    alt1=1;
                }


                switch (alt1) {
            	case 1 :
            	    // /home/aph/ceylon/small.g:8:12: ~ NonStringChars
            	    {

// ********************************************** Here's the bug:
            	    if ( (input.LA(1)>='\u0000' && input.LA(1)<='\u0004')||(input.LA(1)>='\u0006' && input.LA(1)<='\uFFFF') ) {
            	        input.consume();
// **************************************************************
            	    }
            	    else {
            	        MismatchedSetException mse = new MismatchedSetException(null,input);

What on Earth is
 
             input.LA(1)<='\u0004')||(input.LA(1)>='\u0006'

supposed to do?  It clearly excludes control character 5, but why?  If
I change the grammar for StringPart to

StringPart
    :    ( ~ '"') *
    ;
    
I get

            	    if ( (input.LA(1)>='\u0000' && input.LA(1)<='!')||(input.LA(1)>='#' && input.LA(1)<='\uFFFF') ) {
            	        input.consume();

which is right, I think.  So, replacing NonStringChars with '"' in the
grammar fixes the problem.

This is all very strange.  It seems that the parser generator is
inlining NonStringChars but getting it wrong.

This is ANTLR 3.2, by the way.

Andrew.


More information about the antlr-interest mailing list