[antlr-interest] My parser hangs in an infinite loop on certain inputs

Matt Lowry mclowry at cs.adelaide.edu.au
Mon Sep 8 22:50:40 PDT 2003


Hi folks. I'm having some trouble with my grammer concering "multiset 
formation" constructs.
The language I'm doing a parser for supports expressions like:

{* foo , 3:bar *} 

and

{* foo ,.. bar *}


These denote a multiset formed from foo and 3 instances of bar, and a 
multiset formed from all sets in the range from foo to bar respectively.
My spec for these constructs is as follows:

multiset_formation
   :   D_MS_OPEN!
       content: multiset_formation_content
       D_MS_CLOSE!
       { #multiset_formation = #( [MULTISET_FORMATION, "multiset_form"], #content ); };

multiset_formation_content
   :   (expression D_RANGE)=> multiset_range_content
   |   optional_multiset_expr_list;

multiset_range_content
   : expression D_RANGE expression;

optional_multiset_expr_list
   : ( multiset_expr ( D_COMMA! multiset_expr )* )?;

multiset_expr
   : (expression D_COLON)=> quantified_multiset_expr
   | expression;

quantified_multiset_expr
   :   occur:expression D_COLON! val:expression
       { #quantified_multiset_expr = #([QUANT_MSET_EXPR, "quant_mset_expr"],#occur,#val); }; // ACK! BLECH!


Where D_MS_OPEN="{*",  D_MS_CLOSE="*}",. D_RANGE=",..", D_COMMA=',', and 
D_COLON=':'.

The problem is that the parser generated from this enters an infinite 
loop and fails to terminate if I feed it a simple input like:
{* foo , 3:bar *}

However other valid inputs such as
{* foo ,.. bar *}
or
{* *}
are accepted as they should be.

The infinite loop seems to be occuring as a result of the translated AST 
action in the "quantified_multiset_expr" rule. If I remove this line; i.e.

quantified_multiset_expr
   :  occur:expression D_COLON! val:expression;


Now suddenly everything works fine. The rules aren't producing an AST of 
the form I want though :(.

So, my question to the list:
Is this behaviour because me do bad in the grammer and haven't realised 
it, or because ANTLR has a bug?
Or because my installation of ANTLR is bodged?
(I've attached to full grammer and test input so people can try to 
replicate this.)

Also, perhaps someone could suggest an alternative to achieve what I want.
If the parser sees:
{* foo , 3:bar *}
then I want the parser to generate a tree something like:

multiset_contents -> expression_list -> expression -> foo
                                     -> quantified_expression -> 3
                                                              -> bar


If I can achieve this without an AST action, and at the same time avoid 
tickling an ANTLR bug, then could someone please tell me.

Cheers muchly!

-- 
---------------------------------------------------------------
"He never sat down to program without a crowbar close at hand."
  -- Stanislaw Lem
---------------------------------------------------------------


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 

-------------- next part --------------


class Muck7Par extends Parser;
options {
    buildAST = true;
    k = 1;
}


tokens {

    EXPRESSION;
    PARENTHESIZED_EXPRESSION;
    SET_FORMATION;
    MULTISET_FORMATION;
    QUANT_MSET_EXPR;

    UNARY_PLUS;
    UNARY_MINUS;
    UNARY_HASH;

}



design_file:
        ( design_unit )* EOF!;

design_unit:
        expression D_SEMICOLON!;



expression
    : expr:precedence_0_expression { #expression = #( [EXPRESSION,"expr"] , #expr ); }
    ;

precedence_0_expression
    : precedence_1_expression
        (   ( D_REFLECT^ | D_EQUAL_GR_GR^ )
            right_operand:precedence_1_expression 
        )*
    ;

precedence_1_expression
    : left_operand:precedence_2_expression 
        (   D_EQUIV^
            right_operand:precedence_2_expression 
        )*
    ;


precedence_2_expression
    : left_operand:precedence_3_expression 
        (   ( D_IMPLIES^ | D_IMPLIED_BY^ | K_IMPLIES^ )
            right_operand:precedence_3_expression 
        )*
    ;


precedence_3_expression
    : left_operand:precedence_4_expression 
        (   ( K_OR^ | K_NOR^ | K_XOR^ | K_XNOR^ | K_MAX^ )
            right_operand:precedence_4_expression
        )*
    ;


precedence_4_expression
    : left_operand:precedence_5_expression 
        (   ( K_AND^ | K_NAND^ | K_MIN^ )
            right_operand:precedence_5_expression 
        )*
    ;


precedence_5_expression
    : left_operand:precedence_6_expression 
        (   ( D_EQUAL^ | D_NOT_EQUAL^ | D_LESS^ | D_GREATER^ | D_GRE_EQUAL^ | D_EQUAL_LESS^ | K_IN^ )
            right_operand:precedence_6_expression 
        )*
    ;


precedence_6_expression
    : left_operand:precedence_7_expression 
        (   K_SUB^
            right_operand:precedence_7_expression 
        )*
    ;


precedence_7_expression
    : left_operand:precedence_8_expression 
        (   ( D_PLUS^ | D_MINUS^ | D_AMPERSAND^ | D_AMPER_AMPER^ | D_VBAR_VBAR^ )
            right_operand:precedence_8_expression 
        )*
    ;


precedence_8_expression
    : left_operand:precedence_9_expression 
        (   ( D_ASTERISK^ | D_SOLIDUS^ | K_DIV^ | K_MOD^ | K_REM^ )
            right_operand:precedence_9_expression 
        )*
    ;


precedence_9_expression
    : left_operand:precedence_10_expression 
        (   ( D_CIRCUMFLEX^ | D_HASH^ )
            right_operand:precedence_10_expression 
        )?
    ;


precedence_10_expression
    : ( K_NOT^
       | D_PLUS^     { #D_PLUS.setType( UNARY_PLUS ); }
       | D_MINUS^    {#D_MINUS.setType( UNARY_MINUS );}
       | D_PERCENT^ 
       | D_HASH^     { #D_HASH.setType( UNARY_HASH ); }
       | D_TILDE^ 
      )
      unary_operand:precedence_10_expression 
    | pre_11_expr:precedence_11_expression
    ;



precedence_11_expression
    : (name D_AT)=> at_operation
    | (name D_APOSTROPHE)=> tick_operation
    | precedence_12_expression
    ;

at_operation
    : left_operand:name D_AT! right_operand:precedence_12_expression
    ;

tick_operation
    : operand:name D_APOSTROPHE!
    ;


precedence_12_expression
    : expression_primary
    ;
        

expression_primary
    : parenthesized_expression
    | literal
    | name
    | set_formation
    | multiset_formation
    ;


parenthesized_expression
    :   D_L_PAREN!
        expr: expression
        D_R_PAREN!
        { #parenthesized_expression = #( [PARENTHESIZED_EXPRESSION, "(expr)"] , #expr ); }
    ;


literal
    : ( STRING_LITERAL
       | CHARACTER_LITERAL
       | UNDEF_LITERAL_CHAR
       | UNDEF_LITERAL_STRING
       | INFINITY_LITERAL
       | DECIMAL_LITERAL
       | BASED_LITERAL
       | BINARY_BITVECTOR
       | OCTAL_BITVECTOR
       | HEXADECIMAL_BITVECTOR
      )
    ;


name
    : label ( D_PERIOD! label )*
    ;

label
    : SIMPLE_LABEL
    | operator_interp_label
    ;

operator_interp_label
    : OL_PERCENT  | OL_AMPERSAND  | OL_CIRCUMFLEX
    | OL_AT       | OL_APOSTROPHE | OL_BIN_HASH    | OL_UNA_HASH    | OL_TILDE
    | OL_ASTERISK | OL_SOLIDUS    | OL_BIN_PLUS    | OL_UNA_PLUS    | OL_BIN_MINUS | OL_UNA_MINUS
    | OL_LESS     | OL_EQUAL      | OL_GREATER     | OL_EQUIV       | OL_GRE_EQUAL | OL_EQUAL_LESS | OL_NOT_EQUAL
    | OL_IMPLIES  | OL_IMPLIED_BY | OL_AMPER_AMPER | OL_COLON_COLON | OL_VBAR_VBAR | OL_REFLECT    | OL_EQUAL_GR_GR
    | OLK_IMPLIES | OLK_OR        | OLK_NOR        | OLK_XOR        | OLK_XNOR     | OLK_MAX
    | OLK_AND     | OLK_NAND      | OLK_MIN        | OLK_IN         | OLK_SUB      | OLK_DIV
    | OLK_MOD     | OLK_REM       | OLK_NOT
    ;


set_formation
    :   D_L_BRACE!
        content: set_formation_content
        D_R_BRACE!
        { #set_formation = #( [SET_FORMATION, "set_form"] , #content ); }
    ;
        
set_formation_content
    :   (expression D_RANGE)=> (expression D_RANGE expression )
    |   optional_expr_list
    ;

optional_expr_list
    : ( expression ( D_COMMA! expression )* )?
    ;




multiset_formation
    :   D_MS_OPEN!
        content: multiset_formation_content
        D_MS_CLOSE!
        { #multiset_formation = #( [MULTISET_FORMATION, "multiset_form"], #content ); }
    ;
        
multiset_formation_content
    :   (expression D_RANGE)=> multiset_range_content
    |   optional_multiset_expr_list
    ;

multiset_range_content
    : expression D_RANGE expression
    ;

optional_multiset_expr_list
    : ( multiset_expr ( D_COMMA! multiset_expr )* )?
    ;

multiset_expr
    : (expression D_COLON)=> quantified_multiset_expr
    | expression
    ;

quantified_multiset_expr
    : occur:expression D_COLON! val:expression
        { #quantified_multiset_expr = #([QUANT_MSET_EXPR, "quant_mset_expr"],#occur,#val); } // ACK! BLECH!
        // would seem I need to eliminate the above AST action to avoid tickling ANTLR bug (???)
    ;






class Muck7Lex extends Lexer;
options {
    charVocabulary = '\u0003'..'\uFFFE';  // disallow -1,0,1,2 as character values ...
    // charVocabulary = '\u0003'..'\u00FF';  // disallow -1,0,1,2 as character values ...
    k = 6; // need at least 6-char lookahead to distinguish between long and short unicode literals.
    // caseSensitive=true;
    // caseSensitiveLiterals=false;
}


tokens {
    K_IMPLIES = "implies";
    K_OR      = "or";
    K_NOR     = "nor";
    K_XOR     = "xor";
    K_XNOR    = "xnor";
    K_MAX     = "max";
    K_AND     = "and";
    K_NAND    = "nand";
    K_MIN     = "min";
    K_IN      = "in";
    K_SUB     = "sub";
    K_DIV     = "div";
    K_MOD     = "mod";
    K_REM     = "rem";
    K_NOT     = "not";

    OL_PERCENT     = "%__";
    OL_AMPERSAND   = "__&__";
    OL_CIRCUMFLEX  = "__^__";

    OL_AT          = "__ at __";
    OL_APOSTROPHE  = "__'__";
    OL_BIN_HASH    = "__#__";
    OL_UNA_HASH    = "#__";
    OL_TILDE       = "~__";

    OL_ASTERISK    = "__*__";
    OL_SOLIDUS     = "__/__";
    OL_BIN_PLUS    = "__+__";
    OL_UNA_PLUS    = "+__";
    OL_BIN_MINUS   = "__-__";
    OL_UNA_MINUS   = "-__";

    OL_LESS        = "__<__";
    OL_EQUAL       = "__=__";
    OL_GREATER     = "__>__";
    OL_EQUIV       = "__==__";
    OL_GRE_EQUAL   = "__>=__";
    OL_EQUAL_LESS  = "__=<__";
    OL_NOT_EQUAL   = "__/=__";

    OL_IMPLIES     = "__=>__";
    OL_IMPLIED_BY  = "__<=__";

    OL_AMPER_AMPER = "__&&__";
    OL_COLON_COLON = "__::__";
    OL_VBAR_VBAR   = "__||__";

    OL_REFLECT     = "__>>>__";
    OL_EQUAL_GR_GR = "__=>>__";

    OLK_IMPLIES    = "__implies__";
    OLK_OR         = "__or__";
    OLK_NOR        = "__nor__";
    OLK_XOR        = "__xor__";
    OLK_XNOR       = "__xnor__";
    OLK_MAX        = "__max__";
    OLK_AND        = "__and__";
    OLK_NAND       = "__nand__";
    OLK_MIN        = "__min__";
    OLK_IN         = "__in__";
    OLK_SUB        = "__sub__";
    OLK_DIV        = "__div__";
    OLK_MOD        = "__mod__";
    OLK_REM        = "__rem__";
    OLK_NOT        = "not__";

}


protected GRAPHIC_CHARACTER: '\u0003'..'\u00FF';
// protected GRAPHIC_CHARACTER: '\u0003'..'\uFFFE';


protected SHORT_UNICODE_SPECIFIER
    : ('u'! | 'U'!) '+'! ( (BASED_DIGIT)? BASED_DIGIT )? BASED_DIGIT BASED_DIGIT BASED_DIGIT BASED_DIGIT;

protected LONG_UNICODE_SPECIFIER
    : ('u'! | 'U'!) '-'! BASED_DIGIT BASED_DIGIT BASED_DIGIT BASED_DIGIT 
                         BASED_DIGIT BASED_DIGIT BASED_DIGIT BASED_DIGIT;

CHARACTER_LITERAL
    : '\''! ( GRAPHIC_CHARACTER | SHORT_UNICODE_SPECIFIER | LONG_UNICODE_SPECIFIER ) '\''!;

STRING_LITERAL
    : '"'! ( {LA(2)=='"'}? "\"\"" | ~'"' )* '"'! ;



protected BINARY_DIGIT
    : '0'..'1';

protected OCTAL_DIGIT
    : '0'..'7';

protected DECIMAL_DIGIT
    : '0'..'9';

protected BASED_DIGIT 
    : '0'..'9' | 'A'..'F' | 'a'..'f';


BINARY_BITVECTOR
    : ( 'b' | 'B' ) '"' (BINARY_DIGIT)+ '"';

OCTAL_BITVECTOR
    : ( 'o' | 'O' ) '"' (OCTAL_DIGIT)+ '"';

HEXADECIMAL_BITVECTOR
    : ( 'x' | 'X' ) '"' (BASED_DIGIT)+ '"';


protected DECIMAL_DIGITS
    : ( DECIMAL_DIGIT )+;

protected BASED_DIGITS
    : ( BASED_DIGIT )+;

protected EXPONENT
    : ( 'e' | 'E' ) ( '+' | '-' )? DECIMAL_DIGITS;


DECIMAL_LITERAL
    : DECIMAL_DIGITS ( '.' DECIMAL_DIGITS )? ( EXPONENT )?;

BASED_LITERAL
    : DECIMAL_DIGIT '\\' BASED_DIGITS ( '.' BASED_DIGITS )? ( '\\' EXPONENT )?;



SIMPLE_LABEL
    : LABEL_START_CHAR ( LABEL_CONNECT_CHAR | LABEL_EXTEND_CHAR )* ;

protected LABEL_START_CHAR
    : 'a'..'z' | 'A'..'Z' ;

protected LABEL_CONNECT_CHAR
    : '_';

protected LABEL_EXTEND_CHAR
    : LABEL_START_CHAR | DECIMAL_DIGIT;




// NB inclusion of # and ~, exclusion of and "
/*
protected DELIMITER
    : D_PERCENT     | D_AMPERSAND   | D_CIRCUMFLEX | D_VERT_BAR
    | D_AT          | D_APOSTROPHE  | D_HASH       | D_TILDE
    | D_ASTERISK    | D_SOLIDUS     | D_PLUS       | D_MINUS
    | D_COMMA       | D_LOW_LINE    | D_PERIOD     | D_COLON      | D_SEMICOLON
    | D_L_PAREN     | D_R_PAREN     | D_L_SBRACKET | D_R_SBRACKET | D_L_BRACE   | D_R_BRACE
    | D_LESS        | D_EQUAL       | D_GREATER    | D_EQUIV      | D_GRE_EQUAL | D_EQUAL_LESS | D_NOT_EQUAL
    | D_IMPLIES     | D_IMPLIED_BY  | D_MS_OPEN    | D_MS_CLOSE
    | D_AMPER_AMPER | D_COLON_COLON | D_VBAR_VBAR  | D_GOES_TO
    | D_FN_START    | D_FN_END      | D_RANGE      | D_REFLECT    | D_EQUAL_GR_GR
    ;
*/

D_PERCENT:     '%';
D_AMPERSAND:   '&';
D_CIRCUMFLEX:  '^';
D_VERT_BAR:    '|';

D_AT:          '@';
D_APOSTROPHE:  '\'';
D_HASH:        '#';
D_TILDE:       '~';

D_ASTERISK:    '*';
D_SOLIDUS:     '/';
D_PLUS:        '+';
D_MINUS:       '-';

D_COMMA:       ',';
D_LOW_LINE:    '_';
D_PERIOD:      '.';
D_COLON:       ':';
D_SEMICOLON:   ';';

D_L_PAREN:     '(';
D_R_PAREN:     ')';
D_L_SBRACKET:  '[';
D_R_SBRACKET:  ']';
D_L_BRACE:     '{';
D_R_BRACE:     '}';

D_LESS:        '<';
D_EQUAL:       '=';
D_GREATER:     '>';
D_EQUIV:       "==";
D_GRE_EQUAL:   ">=";
D_EQUAL_LESS:  "=<";
D_NOT_EQUAL:   "/=";

D_IMPLIES:     "=>";
D_IMPLIED_BY:  "<=";

D_MS_OPEN:     "{*";
D_MS_CLOSE:    "*}";

D_AMPER_AMPER: "&&";
D_COLON_COLON: "::";
D_VBAR_VBAR:   "||";
D_GOES_TO:     "->";

D_FN_START:    "<*";
D_FN_END:      "*>";
D_RANGE:       ",..";
D_REFLECT:     ">>>";
D_EQUAL_GR_GR: "=>>";



OL_PERCENT:     "%__";
OL_AMPERSAND:   "__&__";
OL_CIRCUMFLEX:  "__^__";

OL_AT:          "__ at __";
OL_APOSTROPHE:  "__'__";
OL_BIN_HASH:    "__#__";
OL_UNA_HASH:    "#__";
OL_TILDE:       "~__";

OL_ASTERISK:    "__*__";
OL_SOLIDUS:     "__/__";
OL_BIN_PLUS:    "__+__";
OL_UNA_PLUS:    "+__";
OL_BIN_MINUS:   "__-__";
OL_UNA_MINUS:   "-__";

OL_LESS:        "__<__";
OL_EQUAL:       "__=__";
OL_GREATER:     "__>__";
OL_EQUIV:       "__==__";
OL_GRE_EQUAL:   "__>=__";
OL_EQUAL_LESS:  "__=<__";
OL_NOT_EQUAL:   "__/=__";

OL_IMPLIES:     "__=>__";
OL_IMPLIED_BY:  "__<=__";

OL_AMPER_AMPER: "__&&__";
OL_COLON_COLON: "__::__";
OL_VBAR_VBAR:   "__||__";

OL_REFLECT:     "__>>>__";
OL_EQUAL_GR_GR: "__=>>__";

OLK_IMPLIES:    "__implies__";
OLK_OR:         "__or__";
OLK_NOR:        "__nor__";
OLK_XOR:        "__xor__";
OLK_XNOR:       "__xnor__";
OLK_MAX:        "__max__";
OLK_AND:        "__and__";
OLK_NAND:       "__nand__";
OLK_MIN:        "__min__";
OLK_IN:         "__in__";
OLK_SUB:        "__sub__";
OLK_DIV:        "__div__";
OLK_MOD:        "__mod__";
OLK_REM:        "__rem__";



UNDEF_LITERAL_STRING: "_|_";
UNDEF_LITERAL_CHAR: '\u22A5';

protected UNDEFINED_LITERAL: UNDEF_LITERAL_STRING | UNDEF_LITERAL_CHAR;

INFINITY_LITERAL: '\u221E';




COMMENT
    : ( COMMENT_LINE | COMMENT_BLOCK ) { $setType(Token.SKIP); }
    ;

protected COMMENT_LINE
    : "//" (~'\n')* LOCAL_NEWLINE;
// NOTE really this should be "//" (~LOCAL_NEWLINE)* LOCAL_NEWLINE; but ANTLR won't support that.

protected COMMENT_BLOCK
    : "/*" 
        ( options {greedy=false;}: ( ~'\n' | LOCAL_NEWLINE )
        )*
        "*/"
    ;




SEPERATOR
    : ( UNICODE_SEPERATOR
        | LOCAL_NEWLINE
        | '\u0009' | '\u000B' | '\u000C' | '\r' // HozTab, VerTab, FF, CR
        )
        { $setType(Token.SKIP); }
    ;

protected UNICODE_SEPERATOR
    : '\u0020' | '\u00A0' | '\u1680' | '\u180E' | '\u2000'..'\u200A' 
    | '\u200B' | '\u202F' | '\u205F' | '\u3000' // Zs
    | '\u2028' // Zl
    | '\u2029' // Zp
    ;

// The draft sez cause of a new line is implementation-defined; I'm defining it as the "LINE FEED" character
//  (note that CR characters are still "seperators", so DOS-style line breaking appears as
//  <SEPERATOR><NEWLINE> and MAC-style as <NEWLINE><SEPERATOR>
protected LOCAL_NEWLINE
    : '\n'  { newline();};


-------------- next part --------------
{**};
{* foo ,.. bar *};
{* foo , 3:bar *};

//foo # { a ,.. b };
//foo * { bar, b, bing };
//{};
//10 + 20 * 30;
//10 + 50 * (-100);
//10 + 50 * -100;
//foo1 - foo2 + foo3;
//"foo" # #"bar" and "ning";
//#2 # #3 and (10 + 20 + 30);
//"*****" + 'U+0000';
//2 / ( 10 + 50 ) >>> "*****" + 'U+0000';
//"foo" # # "bar" + "ning";


More information about the antlr-interest mailing list