[antlr-interest] Can anyone help with a basic grammar problem in Antlr 3?

Fri Oct 14 05:30:05 PDT 2011

Hi Ross,

Just a quick response right now (it's bedtime here)...

This is sounding like something best handled in the AST rather than
trying to bung it all in the initial parse. I would look at making the
grammar lenient enough to accept your naked method calls and then
disambiguate on a subsequent pass when you have a symbol table hooked
up.

I'll have a proper look tomorrow, and perhaps others here might
suggest a better approach in the mean time.

Michael

On 14 October 2011 23:11, Ross Bamford <roscoml at gmail.com> wrote:
> Hi Michael,
> I've tried adding IDENTIFIER into the atom rule, and that solved the
> assignment expression issues I was having, but unfortunately it broke the
> method call parsing completely - the issue seemed to stem from the parser
> not being able to differentiate between a function call, and a plain
> identifier - cue many and varied MismatchedSetExceptions :(. After much
> debugging with ANTLRWorks (what a great tool by the way!!) the only thing
> I've found that fixes this is to make parens mandatory on a method call.
> Having done that after making the other changes you suggested I've now got
> my tests passing (except obviously the ones that call methods without
> parens). I've also made some other changes based on another of your messages
> I found in the archives, to make input such as "foo(1 2)" throw an exception
> rather than just printing a warning and ignoring the "2" - so thanks again!
> :)
> I would still really like to have optional parens on method calls, which I
> know is difficult... I have a little experience with parsing Ruby, for
> example, and I know there's an ambiguous case such as:
> a = foo
> whereby do I treat foo as a var or a method? In other projects I've managed
> to handle that at runtime by treating foo as a var only if an assignment
> statement such as "foo = 1" has been seen before, and I seem to recall I
> simply parsed all such cases as a bare IDENTIFIER to be resolved later, but
> I have no idea how to make this work in my current Antlr parser.
> Here is my grammar as it stands now:
> /* ************* GRAMMAR **************** */
> grammar BasicLang;
> options {
>     output=AST;
>     ASTLabelType=CommonTree;
>     backtrack=true;
>     memoize=true;
> }
> tokens {
>   ASSIGN;
>   METHOD_CALL;
>   SELF;
> }
> @parser::members {
>   /* throw exceptions rather than silently failing... */
> protected void mismatch(IntStream input, int ttype, BitSet follow)
>   throws RecognitionException
> {
>   throw new MismatchedTokenException(ttype, input);
> }
> @Override
> public Object recoverFromMismatchedSet(IntStream input, RecognitionException
> e, BitSet follow)
>   throws RecognitionException
> {
>   throw e;
> }
> @Override
> protected Object recoverFromMismatchedToken(IntStream input, int
> ttype, BitSet follow) throws RecognitionException {
>     if (ttype == RPAREN) {
>         throw new UnwantedTokenException(); // ("Invalid input in argument
> list");
>     }
>     return super.recoverFromMismatchedToken(input, ttype, follow);
> }
> }
> @rulecatch {
> // throw exceptions rather than silently failing...
> catch (RecognitionException e) {
>   throw e;
> }
> }
> start_rule
>   :   script
>   ;
> script
>   :   statement+
>   |   EOF!
>   ;
> statement
>   :   expr terminator!
>   ;
>
> expr
>   :   assign_expr
>   |   math_expr
>   |   meth_call_expr
>   ;
>
> meth_call_expr
>   :   (IDENTIFIER DOT)? func_call_expr -> ^(METHOD_CALL IDENTIFIER?
> func_call_expr)
>   |   (STRING_LITERAL DOT)? func_call_expr -> ^(METHOD_CALL STRING_LITERAL?
> func_call_expr)
>   ;
>
> fragment
> func_call_expr
>   :   IDENTIFIER^ argument_list
>   ;
>
> fragment
> argument_list
>   :   LPAREN! (expr (COMMA! expr)*)? RPAREN!
>   ;
>
> assign_expr
>   :   IDENTIFIER ASSIGN expr -> ^(ASSIGN IDENTIFIER expr)
>   ;
> math_expr
>   :   mult_expr ((ADD^|SUB^) mult_expr)*
>   ;
> mult_expr
>   :   pow_expr ((MUL^|DIV^|MOD^) pow_expr)*
>   ;
>
> pow_expr
>   :   unary_expr ((POW^) unary_expr)*
>   ;
>
> unary_expr
>   :   NOT? atom
>   ;
> atom
>   :     literal
>   |     IDENTIFIER
>   |     LPAREN! expr RPAREN!
>   ;
>
> literal
>   :     HEX_LITERAL
>   |     DECIMAL_LITERAL
>   |     OCTAL_LITERAL
>   |     FLOATING_POINT_LITERAL
> //  |     REGEXP_LITERAL
>   |     STRING_LITERAL
>   ;
>
> terminator
>   :     TERMINATOR
>   |     EOF
>   ;
> POW :   '^' ;
> MOD :   '%' ;
> ADD :   '+' ;
> SUB :   '-' ;
> DIV :   '/' ;
> MUL :   '*' ;
> NOT :   '!' ;
> ASSIGN
>     :   '='
>     ;
>
> LPAREN
>     :   '('
>     ;
>
> RPAREN
>     :   ')'
>     ;
>
> COMMA
>     :   ','
>     ;
>
> DOT :   '.' ;
> CHARACTER_LITERAL
>     :   '\'' ( EscapeSequence | ~('\''|'\\') ) '\''
>     ;
> STRING_LITERAL
>     :  '"' ( EscapeSequence | ~('\\'|'"') )* '"'
>     ;
> /*
> REGEXP_LITERAL
>     :  '/' ( EscapeSequence | ~('\\'|'"') )* '/'
>     ;
> */
> HEX_LITERAL : '0' ('x'|'X') HexDigit+ IntegerTypeSuffix? ;
> DECIMAL_LITERAL : ('0' | '1'..'9' '0'..'9'*) IntegerTypeSuffix? ;
> OCTAL_LITERAL : '0' ('0'..'7')+ IntegerTypeSuffix? ;
> fragment
> HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;
> fragment
> IntegerTypeSuffix
>   : ('l'|'L')
>   | ('u'|'U')  ('l'|'L')?
>   ;
> FLOATING_POINT_LITERAL
>     :   ('0'..'9')+ '.' ('0'..'9')* Exponent? FloatTypeSuffix?
>     |   '.' ('0'..'9')+ Exponent? FloatTypeSuffix?
>     |   ('0'..'9')+ Exponent? FloatTypeSuffix?
>   ;
> fragment
> Exponent : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
> fragment
> FloatTypeSuffix : ('f'|'F'|'d'|'D') ;
> fragment
> EscapeSequence
>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\'|'/')
>     |   OctalEscape
>     ;
> fragment
> OctalEscape
>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>     |   '\\' ('0'..'7') ('0'..'7')
>     |   '\\' ('0'..'7')
>     ;
> fragment
> UnicodeEscape
>     :   '\\' 'u' HexDigit HexDigit HexDigit HexDigit
>     ;
> COMMENT
>     :   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>     ;
> LINE_COMMENT
>     : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>     ;
>
> IDENTIFIER
>   : ID_LETTER (ID_LETTER|'0'..'9')*
>   ;
>
> fragment
> ID_LETTER
>   : '$'
>   | 'A'..'Z'
>   | 'a'..'z'
>   | '_'
>   ;
> TERMINATOR
>   : '\r'? '\n'
>   | ';'
>   ;
> WS  :  (' '|'\r'|'\t'|'\u000C') {$channel=HIDDEN;}
>     |  '...' '\r'? '\n'  {$channel=HIDDEN;}
>     ;
> /* ************* END ************************ */
>
>
> On Fri, Oct 14, 2011 at 2:12 AM, Michael
> Bedward <michael.bedward at gmail.com> wrote:
>>
>> Ah yes, it's getting stuck on the "b" because you haven't told it that
>> identifiers are atoms...
>>
>> atom
>>  :     literal
>>  |     IDENTIFIER
>>  |     LPAREN! expr RPAREN!
>>  ;
>>
>> Michael
>>
>>
>> On 14 October 2011 11:42, Ross Bamford <roscoml at gmail.com> wrote:
>> > Hi Michael,
>> > Thanks for the response! And thanks for being kind about my basic
>> > grammar
>> > :)
>> > I tried reordering the alternatives in expr as you suggested, and am a
>> > bit
>> > closer now than I was before! It's definitely parsing a = 1 + (b = 2)
>> > fine,
>> > but I'm still seeing NoViableAltExceptions with, for example
>> > "a=b+(c=2)".
>> > Looking at the debugger step by step it seems to still be trying to grab
>> > "b+" as a token, rather than seeing the "b" then the "+", which is why I
>> > tried adding IDENTIFIER to the "atom" rule previously. I tried adding it
>> > again after making the change you suggested but it still caused a lot of
>> > problems in other places.
>> > Thanks,
>> > Ross
>> >
>> > On Fri, Oct 14, 2011 at 1:04 AM, Michael Bedward
>> > <michael.bedward at gmail.com>
>> > wrote:
>> >>
>> >> Hi Ross,
>> >>
>> >> For a bit of a newbie that's a nice grammar - much neater than any of
>> >> mine
>> >> :)
>> >>
>> >> If you rearrange your expr rule so that the assign_expr is the first
>> >> alternative...
>> >>
>> >> expr
>> >>  :   assign_expr
>> >>  |   math_expr
>> >>  |   meth_call_expr
>> >>  ;
>> >>
>> >> ...I think that the grammar should be able to parse things like a = 1 +
>> >> (b
>> >> = 2)
>> >>
>> >> Michael
>> >>
>> >>
>> >> On 14 October 2011 10:38, Ross Bamford <roscoml at gmail.com> wrote:
>> >> > Hi Guys,
>> >> >
>> >> > I'm a bit of an Antlr newbie - I've successfully created and used
>> >> > Antlr
>> >> > 2
>> >> > grammars in the past but mostly by trial and error, and occasionally
>> >> > random
>> >> > hacking until it "worked"... I've recently become involved in a
>> >> > project
>> >> > that
>> >> > requires a very simple scripting language, and have decided to use
>> >> > Antlr
>> >> > 3
>> >> > for this, but I'm getting stuck quite early on - I think I have a
>> >> > fundamental problem in my grammar but after much hacking at it and
>> >> > trying
>> >> > various ideas I got from Google, I'm still hitting a bit of a brick
>> >> > wall.
>> >> >
>> >> > Basically I'm at the point where I have mathematical expressions and
>> >> > various
>> >> > literal types implemented, and am adding in function and method call
>> >> > handling - I want to be able to call methods with or without and
>> >> > explicit
>> >> > receiver, and in my language parenthesis are optional (I know that
>> >> > complicates matters a bit but it's what I need for this project).
>> >> > I've
>> >> > written the grammar so far against a set of functional tests, and all
>> >> > is
>> >> > well with most of my syntax. Here is my grammar:
>> >> >
>> >> > /* ********* GRAMMAR *********** */
>> >> > grammar BasicLang;
>> >> >
>> >> > options {
>> >> >    output=AST;
>> >> >    ASTLabelType=CommonTree;
>> >> >    backtrack=true;
>> >> >    memoize=true;
>> >> > }
>> >> >
>> >> > tokens {
>> >> >  ASSIGN;
>> >> >  METHOD_CALL;
>> >> >  SELF;
>> >> > }
>> >> >
>> >> > @parser::members {
>> >> >  /* throw exceptions rather than silently failing... */
>> >> > protected void mismatch(IntStream input, int ttype, BitSet follow)
>> >> >  throws RecognitionException
>> >> > {
>> >> >  throw new MismatchedTokenException(ttype, input);
>> >> > }
>> >> >  public Object recoverFromMismatchedSet(IntStream input,
>> >> > RecognitionException e, BitSet follow)
>> >> >  throws RecognitionException
>> >> > {
>> >> >  throw e;
>> >> > }
>> >> > }
>> >> >
>> >> > @rulecatch {
>> >> > // throw exceptions rather than silently failing...
>> >> > catch (RecognitionException e) {
>> >> >  throw e;
>> >> > }
>> >> > }
>> >> >
>> >> > start_rule
>> >> >  :   script
>> >> >  ;
>> >> >
>> >> > script
>> >> >  :   statement*
>> >> >  ;
>> >> >
>> >> > statement
>> >> >  :   expr terminator!
>> >> >  ;
>> >> >
>> >> > expr
>> >> >  :   math_expr
>> >> >  |   assign_expr
>> >> >  |   meth_call_expr
>> >> >  ;
>> >> >
>> >> > meth_call_expr
>> >> >  :   (IDENTIFIER DOT)? func_call_expr -> ^(METHOD_CALL IDENTIFIER?
>> >> > func_call_expr)
>> >> >  |   (STRING_LITERAL DOT)? func_call_expr -> ^(METHOD_CALL
>> >> > STRING_LITERAL?
>> >> > func_call_expr)
>> >> >  ;
>> >> >
>> >> > fragment
>> >> > func_call_expr
>> >> >  :   IDENTIFIER^ argument_list
>> >> >  ;
>> >> >
>> >> > fragment
>> >> > argument_list
>> >> >  :   LPAREN!? (expr (COMMA! expr)*)? RPAREN!?
>> >> >  ;
>> >> >
>> >> > assign_expr
>> >> >  :   IDENTIFIER ASSIGN expr -> ^(ASSIGN IDENTIFIER expr)
>> >> >  ;
>> >> >
>> >> > math_expr
>> >> >  :   mult_expr ((ADD^|SUB^) mult_expr)*
>> >> >  ;
>> >> >
>> >> > mult_expr
>> >> >  :   pow_expr ((MUL^|DIV^|MOD^) pow_expr)*
>> >> >  ;
>> >> >
>> >> > pow_expr
>> >> >  :   unary_expr ((POW^) unary_expr)*
>> >> >  ;
>> >> >
>> >> > unary_expr
>> >> >  :   NOT? atom
>> >> >  ;
>> >> >
>> >> > atom
>> >> >  :     literal
>> >> >  |     LPAREN! expr RPAREN!
>> >> >  ;
>> >> >
>> >> > literal
>> >> >  :     HEX_LITERAL
>> >> >  |     DECIMAL_LITERAL
>> >> >  |     OCTAL_LITERAL
>> >> >  |     FLOATING_POINT_LITERAL
>> >> > //  |     REGEXP_LITERAL
>> >> >  |     STRING_LITERAL
>> >> >  ;
>> >> >
>> >> > terminator
>> >> >  :     TERMINATOR
>> >> >  |     EOF
>> >> >  ;
>> >> >
>> >> > POW :   '^' ;
>> >> > MOD :   '%' ;
>> >> > ADD :   '+' ;
>> >> > SUB :   '-' ;
>> >> > DIV :   '/' ;
>> >> > MUL :   '*' ;
>> >> > NOT :   '!' ;
>> >> >
>> >> > ASSIGN
>> >> >    :   '='
>> >> >    ;
>> >> >
>> >> > LPAREN
>> >> >    :   '('
>> >> >    ;
>> >> >
>> >> > RPAREN
>> >> >    :   ')'
>> >> >    ;
>> >> >
>> >> > COMMA
>> >> >    :   ','
>> >> >    ;
>> >> >
>> >> > DOT :   '.' ;
>> >> >
>> >> > CHARACTER_LITERAL
>> >> >    :   '\'' ( EscapeSequence | ~('\''|'\\') ) '\''
>> >> >    ;
>> >> >
>> >> > STRING_LITERAL
>> >> >    :  '"' ( EscapeSequence | ~('\\'|'"') )* '"'
>> >> >    ;
>> >> >
>> >> > /*
>> >> > REGEXP_LITERAL
>> >> >    :  '/' ( EscapeSequence | ~('\\'|'"') )* '/'
>> >> >    ;
>> >> > */
>> >> >
>> >> > HEX_LITERAL : '0' ('x'|'X') HexDigit+ IntegerTypeSuffix? ;
>> >> >
>> >> > DECIMAL_LITERAL : ('0' | '1'..'9' '0'..'9'*) IntegerTypeSuffix? ;
>> >> >
>> >> > OCTAL_LITERAL : '0' ('0'..'7')+ IntegerTypeSuffix? ;
>> >> >
>> >> > fragment
>> >> > HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;
>> >> >
>> >> > fragment
>> >> > IntegerTypeSuffix
>> >> >  : ('l'|'L')
>> >> >  | ('u'|'U')  ('l'|'L')?
>> >> >  ;
>> >> >
>> >> > FLOATING_POINT_LITERAL
>> >> >    :   ('0'..'9')+ '.' ('0'..'9')* Exponent? FloatTypeSuffix?
>> >> >    |   '.' ('0'..'9')+ Exponent? FloatTypeSuffix?
>> >> >    |   ('0'..'9')+ Exponent? FloatTypeSuffix?
>> >> >  ;
>> >> >
>> >> > fragment
>> >> > Exponent : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
>> >> >
>> >> > fragment
>> >> > FloatTypeSuffix : ('f'|'F'|'d'|'D') ;
>> >> >
>> >> > fragment
>> >> > EscapeSequence
>> >> >    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\'|'/')
>> >> >    |   OctalEscape
>> >> >    ;
>> >> >
>> >> > fragment
>> >> > OctalEscape
>> >> >    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>> >> >    |   '\\' ('0'..'7') ('0'..'7')
>> >> >    |   '\\' ('0'..'7')
>> >> >    ;
>> >> >
>> >> > fragment
>> >> > UnicodeEscape
>> >> >    :   '\\' 'u' HexDigit HexDigit HexDigit HexDigit
>> >> >    ;
>> >> > COMMENT
>> >> >    :   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>> >> >    ;
>> >> >
>> >> > LINE_COMMENT
>> >> >    : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>> >> >    ;
>> >> >
>> >> > IDENTIFIER
>> >> >  : ID_LETTER (ID_LETTER|'0'..'9')*
>> >> >  ;
>> >> >
>> >> > fragment
>> >> > ID_LETTER
>> >> >  : '$'
>> >> >  | 'A'..'Z'
>> >> >  | 'a'..'z'
>> >> >  | '_'
>> >> >  ;
>> >> >
>> >> > TERMINATOR
>> >> >  : '\r'? '\n'
>> >> >  | ';'
>> >> >  ;
>> >> >
>> >> > WS  :  (' '|'\r'|'\t'|'\u000C') {$channel=HIDDEN;}
>> >> >    |  '...' '\r'? '\n'  {$channel=HIDDEN;}
>> >> >    ;
>> >> >
>> >> > /* *************** END *************** */
>> >> >
>> >> > With this grammar, my tests so far pass, and I'm building trees for
>> >> > simple
>> >> > arithmetic operations and the like, including involving variables
>> >> > (e.g.
>> >> > a+1
>> >> > and the like), and method calls are working as I expect, including
>> >> > when
>> >> > passing method call results as args to another method call. But I
>> >> > cannot
>> >> > get
>> >> > input such as "a=b+(c=1)" to parse at all - Debugging in AntlrWorks
>> >> > shows me
>> >> > that the problem occurs when the parse sees the "b+", when it throws
>> >> > a
>> >> > NoViableAlt exception.
>> >> >
>> >> > I guessed this was because the parser doesn't see the identifier as
>> >> > an
>> >> > atom,
>> >> > so tries to parse it with the + symbol. So, I tried adding IDENTIFIER
>> >> > as
>> >> > an
>> >> > alternative to the atom rule - but that just broke the parser
>> >> > completely
>> >> > and
>> >> > many of my tests failed with an exception - MismatchedSetException.
>> >> >
>> >> > I've been playing with this for a few days now but no matter what I
>> >> > do,
>> >> > even
>> >> > when I get the type of syntax I mentioned above (the assign
>> >> > statement)
>> >> > working, I invariably break something (or more often, everything! :(
>> >> > )
>> >> > else.
>> >> > I'm really hoping someone out there will take pity on me and give me
>> >> > some
>> >> > insight into what I'm doing wrong.
>> >> >
>> >> > Thanks in advance!
>> >> > --
>> >> > Ross Bamford - roscoml at gmail.com
>> >> >
>> >> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> >> > Unsubscribe:
>> >>
>> >> > > http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>> >> >
>> >
>> >
>>
>