[antlr-interest] Can anyone help with a basic grammar problem in Antlr 3?

Ross Bamford roscoml at gmail.com
Fri Oct 14 05:11:11 PDT 2011


Hi Michael,

I've tried adding IDENTIFIER into the atom rule, and that solved the
assignment expression issues I was having, but unfortunately it broke the
method call parsing completely - the issue seemed to stem from the parser
not being able to differentiate between a function call, and a plain
identifier - cue many and varied MismatchedSetExceptions :(. After much
debugging with ANTLRWorks (what a great tool by the way!!) the only thing
I've found that fixes this is to make parens mandatory on a method call.
Having done that after making the other changes you suggested I've now got
my tests passing (except obviously the ones that call methods without
parens). I've also made some other changes based on another of your messages
I found in the archives, to make input such as "foo(1 2)" throw an exception
rather than just printing a warning and ignoring the "2" - so thanks again!
:)

I would still really like to have optional parens on method calls, which I
know is difficult... I have a little experience with parsing Ruby, for
example, and I know there's an ambiguous case such as:

a = foo

whereby do I treat foo as a var or a method? In other projects I've managed
to handle that at runtime by treating foo as a var only if an assignment
statement such as "foo = 1" has been seen before, and I seem to recall I
simply parsed all such cases as a bare IDENTIFIER to be resolved later, but
I have no idea how to make this work in my current Antlr parser.

Here is my grammar as it stands now:

/* ************* GRAMMAR **************** */
grammar BasicLang;

options {
    output=AST;
    ASTLabelType=CommonTree;
    backtrack=true;
    memoize=true;
}

tokens {
  ASSIGN;
  METHOD_CALL;
  SELF;
}

@parser::members {
  /* throw exceptions rather than silently failing... */
protected void mismatch(IntStream input, int ttype, BitSet follow)
  throws RecognitionException
{
  throw new MismatchedTokenException(ttype, input);
}

@Override
public Object recoverFromMismatchedSet(IntStream input, RecognitionException
e, BitSet follow)
  throws RecognitionException
{
  throw e;
}
 @Override
protected Object recoverFromMismatchedToken(IntStream input, int
ttype, BitSet follow) throws RecognitionException {
    if (ttype == RPAREN) {
        throw new UnwantedTokenException(); // ("Invalid input in argument
list");
    }
    return super.recoverFromMismatchedToken(input, ttype, follow);
}

}

@rulecatch {
// throw exceptions rather than silently failing...
catch (RecognitionException e) {
  throw e;
}
}

start_rule
  :   script
  ;

script
  :   statement+
  |   EOF!
  ;

statement
  :   expr terminator!
  ;

expr
  :   assign_expr
  |   math_expr
  |   meth_call_expr
  ;

meth_call_expr
  :   (IDENTIFIER DOT)? func_call_expr -> ^(METHOD_CALL IDENTIFIER?
func_call_expr)
  |   (STRING_LITERAL DOT)? func_call_expr -> ^(METHOD_CALL STRING_LITERAL?
func_call_expr)
  ;

fragment
func_call_expr
  :   IDENTIFIER^ argument_list
  ;

fragment
argument_list
  :   LPAREN! (expr (COMMA! expr)*)? RPAREN!
  ;

assign_expr
  :   IDENTIFIER ASSIGN expr -> ^(ASSIGN IDENTIFIER expr)
  ;

math_expr
  :   mult_expr ((ADD^|SUB^) mult_expr)*
  ;

mult_expr
  :   pow_expr ((MUL^|DIV^|MOD^) pow_expr)*
  ;

pow_expr
  :   unary_expr ((POW^) unary_expr)*
  ;

unary_expr
  :   NOT? atom
  ;

atom
  :     literal
  |     IDENTIFIER
  |     LPAREN! expr RPAREN!
  ;

literal
  :     HEX_LITERAL
  |     DECIMAL_LITERAL
  |     OCTAL_LITERAL
  |     FLOATING_POINT_LITERAL
//  |     REGEXP_LITERAL
  |     STRING_LITERAL
  ;

terminator
  :     TERMINATOR
  |     EOF
  ;

POW :   '^' ;
MOD :   '%' ;
ADD :   '+' ;
SUB :   '-' ;
DIV :   '/' ;
MUL :   '*' ;
NOT :   '!' ;

ASSIGN
    :   '='
    ;

LPAREN
    :   '('
    ;

RPAREN
    :   ')'
    ;

COMMA
    :   ','
    ;

DOT :   '.' ;

CHARACTER_LITERAL
    :   '\'' ( EscapeSequence | ~('\''|'\\') ) '\''
    ;

STRING_LITERAL
    :  '"' ( EscapeSequence | ~('\\'|'"') )* '"'
    ;

/*
REGEXP_LITERAL
    :  '/' ( EscapeSequence | ~('\\'|'"') )* '/'
    ;
*/

HEX_LITERAL : '0' ('x'|'X') HexDigit+ IntegerTypeSuffix? ;

DECIMAL_LITERAL : ('0' | '1'..'9' '0'..'9'*) IntegerTypeSuffix? ;

OCTAL_LITERAL : '0' ('0'..'7')+ IntegerTypeSuffix? ;

fragment
HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
IntegerTypeSuffix
  : ('l'|'L')
  | ('u'|'U')  ('l'|'L')?
  ;

FLOATING_POINT_LITERAL
    :   ('0'..'9')+ '.' ('0'..'9')* Exponent? FloatTypeSuffix?
    |   '.' ('0'..'9')+ Exponent? FloatTypeSuffix?
    |   ('0'..'9')+ Exponent? FloatTypeSuffix?
  ;

fragment
Exponent : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;

fragment
FloatTypeSuffix : ('f'|'F'|'d'|'D') ;

fragment
EscapeSequence
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\'|'/')
    |   OctalEscape
    ;

fragment
OctalEscape
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UnicodeEscape
    :   '\\' 'u' HexDigit HexDigit HexDigit HexDigit
    ;
COMMENT
    :   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
    ;

LINE_COMMENT
    : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    ;

IDENTIFIER
  : ID_LETTER (ID_LETTER|'0'..'9')*
  ;

fragment
ID_LETTER
  : '$'
  | 'A'..'Z'
  | 'a'..'z'
  | '_'
  ;

TERMINATOR
  : '\r'? '\n'
  | ';'
  ;

WS  :  (' '|'\r'|'\t'|'\u000C') {$channel=HIDDEN;}
    |  '...' '\r'? '\n'  {$channel=HIDDEN;}
    ;

/* ************* END ************************ */


On Fri, Oct 14, 2011 at 2:12 AM, Michael Bedward <michael.bedward at gmail.com>
 wrote:

> Ah yes, it's getting stuck on the "b" because you haven't told it that
> identifiers are atoms...
>
> atom
>  :     literal
>  |     IDENTIFIER
>  |     LPAREN! expr RPAREN!
>  ;
>
> Michael
>
>
> On 14 October 2011 11:42, Ross Bamford <roscoml at gmail.com> wrote:
> > Hi Michael,
> > Thanks for the response! And thanks for being kind about my basic grammar
> > :)
> > I tried reordering the alternatives in expr as you suggested, and am a
> bit
> > closer now than I was before! It's definitely parsing a = 1 + (b = 2)
> fine,
> > but I'm still seeing NoViableAltExceptions with, for example "a=b+(c=2)".
> > Looking at the debugger step by step it seems to still be trying to grab
> > "b+" as a token, rather than seeing the "b" then the "+", which is why I
> > tried adding IDENTIFIER to the "atom" rule previously. I tried adding it
> > again after making the change you suggested but it still caused a lot of
> > problems in other places.
> > Thanks,
> > Ross
> >
> > On Fri, Oct 14, 2011 at 1:04 AM, Michael Bedward <
> michael.bedward at gmail.com>
> > wrote:
> >>
> >> Hi Ross,
> >>
> >> For a bit of a newbie that's a nice grammar - much neater than any of
> mine
> >> :)
> >>
> >> If you rearrange your expr rule so that the assign_expr is the first
> >> alternative...
> >>
> >> expr
> >>  :   assign_expr
> >>  |   math_expr
> >>  |   meth_call_expr
> >>  ;
> >>
> >> ...I think that the grammar should be able to parse things like a = 1 +
> (b
> >> = 2)
> >>
> >> Michael
> >>
> >>
> >> On 14 October 2011 10:38, Ross Bamford <roscoml at gmail.com> wrote:
> >> > Hi Guys,
> >> >
> >> > I'm a bit of an Antlr newbie - I've successfully created and used
> Antlr
> >> > 2
> >> > grammars in the past but mostly by trial and error, and occasionally
> >> > random
> >> > hacking until it "worked"... I've recently become involved in a
> project
> >> > that
> >> > requires a very simple scripting language, and have decided to use
> Antlr
> >> > 3
> >> > for this, but I'm getting stuck quite early on - I think I have a
> >> > fundamental problem in my grammar but after much hacking at it and
> >> > trying
> >> > various ideas I got from Google, I'm still hitting a bit of a brick
> >> > wall.
> >> >
> >> > Basically I'm at the point where I have mathematical expressions and
> >> > various
> >> > literal types implemented, and am adding in function and method call
> >> > handling - I want to be able to call methods with or without and
> >> > explicit
> >> > receiver, and in my language parenthesis are optional (I know that
> >> > complicates matters a bit but it's what I need for this project). I've
> >> > written the grammar so far against a set of functional tests, and all
> is
> >> > well with most of my syntax. Here is my grammar:
> >> >
> >> > /* ********* GRAMMAR *********** */
> >> > grammar BasicLang;
> >> >
> >> > options {
> >> >    output=AST;
> >> >    ASTLabelType=CommonTree;
> >> >    backtrack=true;
> >> >    memoize=true;
> >> > }
> >> >
> >> > tokens {
> >> >  ASSIGN;
> >> >  METHOD_CALL;
> >> >  SELF;
> >> > }
> >> >
> >> > @parser::members {
> >> >  /* throw exceptions rather than silently failing... */
> >> > protected void mismatch(IntStream input, int ttype, BitSet follow)
> >> >  throws RecognitionException
> >> > {
> >> >  throw new MismatchedTokenException(ttype, input);
> >> > }
> >> >  public Object recoverFromMismatchedSet(IntStream input,
> >> > RecognitionException e, BitSet follow)
> >> >  throws RecognitionException
> >> > {
> >> >  throw e;
> >> > }
> >> > }
> >> >
> >> > @rulecatch {
> >> > // throw exceptions rather than silently failing...
> >> > catch (RecognitionException e) {
> >> >  throw e;
> >> > }
> >> > }
> >> >
> >> > start_rule
> >> >  :   script
> >> >  ;
> >> >
> >> > script
> >> >  :   statement*
> >> >  ;
> >> >
> >> > statement
> >> >  :   expr terminator!
> >> >  ;
> >> >
> >> > expr
> >> >  :   math_expr
> >> >  |   assign_expr
> >> >  |   meth_call_expr
> >> >  ;
> >> >
> >> > meth_call_expr
> >> >  :   (IDENTIFIER DOT)? func_call_expr -> ^(METHOD_CALL IDENTIFIER?
> >> > func_call_expr)
> >> >  |   (STRING_LITERAL DOT)? func_call_expr -> ^(METHOD_CALL
> >> > STRING_LITERAL?
> >> > func_call_expr)
> >> >  ;
> >> >
> >> > fragment
> >> > func_call_expr
> >> >  :   IDENTIFIER^ argument_list
> >> >  ;
> >> >
> >> > fragment
> >> > argument_list
> >> >  :   LPAREN!? (expr (COMMA! expr)*)? RPAREN!?
> >> >  ;
> >> >
> >> > assign_expr
> >> >  :   IDENTIFIER ASSIGN expr -> ^(ASSIGN IDENTIFIER expr)
> >> >  ;
> >> >
> >> > math_expr
> >> >  :   mult_expr ((ADD^|SUB^) mult_expr)*
> >> >  ;
> >> >
> >> > mult_expr
> >> >  :   pow_expr ((MUL^|DIV^|MOD^) pow_expr)*
> >> >  ;
> >> >
> >> > pow_expr
> >> >  :   unary_expr ((POW^) unary_expr)*
> >> >  ;
> >> >
> >> > unary_expr
> >> >  :   NOT? atom
> >> >  ;
> >> >
> >> > atom
> >> >  :     literal
> >> >  |     LPAREN! expr RPAREN!
> >> >  ;
> >> >
> >> > literal
> >> >  :     HEX_LITERAL
> >> >  |     DECIMAL_LITERAL
> >> >  |     OCTAL_LITERAL
> >> >  |     FLOATING_POINT_LITERAL
> >> > //  |     REGEXP_LITERAL
> >> >  |     STRING_LITERAL
> >> >  ;
> >> >
> >> > terminator
> >> >  :     TERMINATOR
> >> >  |     EOF
> >> >  ;
> >> >
> >> > POW :   '^' ;
> >> > MOD :   '%' ;
> >> > ADD :   '+' ;
> >> > SUB :   '-' ;
> >> > DIV :   '/' ;
> >> > MUL :   '*' ;
> >> > NOT :   '!' ;
> >> >
> >> > ASSIGN
> >> >    :   '='
> >> >    ;
> >> >
> >> > LPAREN
> >> >    :   '('
> >> >    ;
> >> >
> >> > RPAREN
> >> >    :   ')'
> >> >    ;
> >> >
> >> > COMMA
> >> >    :   ','
> >> >    ;
> >> >
> >> > DOT :   '.' ;
> >> >
> >> > CHARACTER_LITERAL
> >> >    :   '\'' ( EscapeSequence | ~('\''|'\\') ) '\''
> >> >    ;
> >> >
> >> > STRING_LITERAL
> >> >    :  '"' ( EscapeSequence | ~('\\'|'"') )* '"'
> >> >    ;
> >> >
> >> > /*
> >> > REGEXP_LITERAL
> >> >    :  '/' ( EscapeSequence | ~('\\'|'"') )* '/'
> >> >    ;
> >> > */
> >> >
> >> > HEX_LITERAL : '0' ('x'|'X') HexDigit+ IntegerTypeSuffix? ;
> >> >
> >> > DECIMAL_LITERAL : ('0' | '1'..'9' '0'..'9'*) IntegerTypeSuffix? ;
> >> >
> >> > OCTAL_LITERAL : '0' ('0'..'7')+ IntegerTypeSuffix? ;
> >> >
> >> > fragment
> >> > HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;
> >> >
> >> > fragment
> >> > IntegerTypeSuffix
> >> >  : ('l'|'L')
> >> >  | ('u'|'U')  ('l'|'L')?
> >> >  ;
> >> >
> >> > FLOATING_POINT_LITERAL
> >> >    :   ('0'..'9')+ '.' ('0'..'9')* Exponent? FloatTypeSuffix?
> >> >    |   '.' ('0'..'9')+ Exponent? FloatTypeSuffix?
> >> >    |   ('0'..'9')+ Exponent? FloatTypeSuffix?
> >> >  ;
> >> >
> >> > fragment
> >> > Exponent : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
> >> >
> >> > fragment
> >> > FloatTypeSuffix : ('f'|'F'|'d'|'D') ;
> >> >
> >> > fragment
> >> > EscapeSequence
> >> >    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\'|'/')
> >> >    |   OctalEscape
> >> >    ;
> >> >
> >> > fragment
> >> > OctalEscape
> >> >    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
> >> >    |   '\\' ('0'..'7') ('0'..'7')
> >> >    |   '\\' ('0'..'7')
> >> >    ;
> >> >
> >> > fragment
> >> > UnicodeEscape
> >> >    :   '\\' 'u' HexDigit HexDigit HexDigit HexDigit
> >> >    ;
> >> > COMMENT
> >> >    :   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
> >> >    ;
> >> >
> >> > LINE_COMMENT
> >> >    : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
> >> >    ;
> >> >
> >> > IDENTIFIER
> >> >  : ID_LETTER (ID_LETTER|'0'..'9')*
> >> >  ;
> >> >
> >> > fragment
> >> > ID_LETTER
> >> >  : '$'
> >> >  | 'A'..'Z'
> >> >  | 'a'..'z'
> >> >  | '_'
> >> >  ;
> >> >
> >> > TERMINATOR
> >> >  : '\r'? '\n'
> >> >  | ';'
> >> >  ;
> >> >
> >> > WS  :  (' '|'\r'|'\t'|'\u000C') {$channel=HIDDEN;}
> >> >    |  '...' '\r'? '\n'  {$channel=HIDDEN;}
> >> >    ;
> >> >
> >> > /* *************** END *************** */
> >> >
> >> > With this grammar, my tests so far pass, and I'm building trees for
> >> > simple
> >> > arithmetic operations and the like, including involving variables
> (e.g.
> >> > a+1
> >> > and the like), and method calls are working as I expect, including
> when
> >> > passing method call results as args to another method call. But I
> cannot
> >> > get
> >> > input such as "a=b+(c=1)" to parse at all - Debugging in AntlrWorks
> >> > shows me
> >> > that the problem occurs when the parse sees the "b+", when it throws a
> >> > NoViableAlt exception.
> >> >
> >> > I guessed this was because the parser doesn't see the identifier as an
> >> > atom,
> >> > so tries to parse it with the + symbol. So, I tried adding IDENTIFIER
> as
> >> > an
> >> > alternative to the atom rule - but that just broke the parser
> completely
> >> > and
> >> > many of my tests failed with an exception - MismatchedSetException.
> >> >
> >> > I've been playing with this for a few days now but no matter what I
> do,
> >> > even
> >> > when I get the type of syntax I mentioned above (the assign statement)
> >> > working, I invariably break something (or more often, everything! :( )
> >> > else.
> >> > I'm really hoping someone out there will take pity on me and give me
> >> > some
> >> > insight into what I'm doing wrong.
> >> >
> >> > Thanks in advance!
> >> > --
> >> > Ross Bamford - roscoml at gmail.com
> >> >
> >> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> >> > Unsubscribe:
> >> >
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
> >> >
> >
> >
>
>


More information about the antlr-interest mailing list